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FOREWORD 


Some  DARPA  lU  activities  for  1992  of  interest  to  the  lU  community  are 
given  below. 

DARPA  Organization 

In  1992  the  program  structure  of  the  Software  and  Intelli^nt  Systems  Office 
(SISTO)  was  reorganized.  As  shown  in  Hgure  1,  lU  projects  are  now  in  the 
Autonomous  Systems  portion  of  Intelligent  Systems.  (Demo-II  is  the  Utunanned 
Ground  Vehicle  project.)  The  goals  and  missions  of  lU  are  unchanged. 
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Applied  Technology  Demonstration  Support 
At  DARPA  the  riKxlel  for  insertion  of  research  into  the  "real  world"  is  to  set  up 
applied  technology  demonstrations  (ATDs)  such  as  RADIUS  or  the  Unmanned 
Ground  Vehicle  (UGV)  as  practical  demonstrations  of  the  ultimate  uses  of  research. 


xi 


Unfortunately,  by  the  end  of  a  3  to  S  year  ATD  the  base  technology  is  no  longer  on 
the  cutting  edge.  Rand  Waltzman,  my  predecessor,  came  up  with  a  solution  for 
RADIUS.  He  suggested  a  Broad  Area  Announcement  (BAA)  timed  so  as  to 
activate  a  set  of  research  projects  a  year  after  RADIUS  was  initiated  to  deal  with 
perceived  lU  technology  gaps.  Proposed  research  projects  are  intended  to  supply 
current  technology  to  RADIUS,  but  not  to  be  on  the  critical  path.  Five  such  research 
projects  in  lU-RADIUS  were  initiated  in  1992  and  are  reported  on  in  these 
proceedings.  We  have  used  the  same  approach  for  the  UGV  project  in  the  area  of 
recoimaisance,  surveillance,  and  target  acquisition  (RSTA).  A  BAA  in  six  areas 
related  to  UGV  RSTA  was  announced  in  December  of  1992,  and  a  set  of  awards 
will  be  made  in  early  1993.  We  hope  that  the  results  of  these  studies  will  transition 
into  later  UGV  demonstrations.  Some  care  must  be  exercised  in  implementing  such 
studies: 

•  someone  must  coordinate  and  integrate  the  various  research  efforts  with  the 
ATD 

•  the  original  contract  with  the  ATD  contractor  must  include  tasks  for  interacting 
with  the  researchers  and  for  integrating  the  research  results 

•  software  environment  standards  for  integrating  the  results  of  the  research  into 
the  ATD  must  be  part  of  the  requirements  of  the  research  contracts 

•  the  research  must  not  be  warped  into  development  by  the  pressures  of  the  ATD 
milestones 

The  major  benefits  of  this  approach  are  that  real-world  problems  and 
associated  imagery  can  be  made  available  to  the  lU  conununity  and  that  advanced 
research  can  be  incorporated  into  an  ATD  in  advanced  stages  some  time  after  the 
ATD  has  been  initiated. 

Special  lU  Workshops 

Several  special  lU  workshops  were  held  in  1992.  Profs.  Ruzena  Bajcsy  (U  Penn) 
and  Takeo  Kanade  (CMU)  hosted  a  computational  sensors  workshop  at  the  U  of 
Penn  in  May  1992.  Profs.  Ryszard  Michalski  (GMU)  and  Azriel  Rosenfeld  (UMd) 
hosted  a  workshop  in  learning  in  lU  in  October  1992.  A  session  on  benchmaricing 
in  lU  was  held  at  the  Principal  Investigators  workshop  in  September  1992.  Reports 
on  these  workshops  i^pear  in  these  proc^dings. 


lU/AI  Efforts 

Typically,  lU  researchers  do  not  communicate  with  researchers  in  AI  and  vice 
versa.  An  attempt  is  being  made  to  bring  specialists  in  AI  and  lU  toother.  Recent 
efforts  include  lU  and  learning  (UMd/GMU),  lU  and  reasoning  (ISIAJSC),  lU  and 
natural  language  (SUNY  Buffalo),  and  lU  and  neural  nets  (new  BAA;  contracts  to 
be  awarded  early  1993).  Although  the  current  efforts  are  small,  it  is  hoped  that  they 
will  lead  to  more  extensive  AI/IU  interactions. 

Automatic  Target  Recognition  (ATR) 

An  interoffice  DARPA  working  group  on  ATR  has  been  set  up  to  develop  an 
interdisciplinary  approach  to  ATR  pr^lems.  The  participants  are  Software  and 
Intelligent  Systems  Technology  Office  (SISTO),  Microelectronics  Technology 
Office  (MTO),  Advanced  Systems  Technology  Office  (ASTO),  and  Defense 
Sciences  Office  (DSO).  A  joint  BAA  on  ATR  focussed  on  university  participation 
was  issued  in  late  1992;  awards  are  expected  by  Spring  of  1993. 

Oscar  Firschein,  DARPA  SISTO 
Program  Manager 
Image  Understanding 
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Abstract 

Research  in  the  Computer  Vision  Laboratory 
at  Maryland  is  focused  on  both  theoretical  and 
normative  questions  related  to  vision.  This  re¬ 
port  reviews  our  work  on  these  questions  during 
the  period  October  1991-January  1993.  The 
areas  covered  include  navigation,  recognition, 
and  low-level  vision. 

1  Introduction 

Understanding  the  mechanisms  underlying  the  processes 
of  visual  perception  and  creating  machines  with  visual 
capabilities  requires  that  we  answer  several  questions  of 
different  natures.  Among  these  are  theoretical  questions, 
whose  answers  will  establish  the  range  of  possible  mech¬ 
anisms  that  could  exist  in  intelligent  visutd  systems;  and 
normative  questions,  whose  answers  will  suggest  what 
classes  of  systems  (animals  or  robots)  would  be  desir¬ 
able  or  optimal  for  a  given  set  of  tasks. 

Our  theoretical  work  on  navigation  is  devoted  to  the 
analysis  of  correspondence  and  the  investigation  of  the 
amount  of  three-dimensional  information  contained  in 
noisy  correspondence  (or  optical  flow)  fields;  as  well  as 
to  such  issues  as  the  analysi.*:  of  localization  techniques  on 
natural  terrain  and  the  problem  of  visibility  as  it  relates 
to  path  planning.  Our  research  on  normative  questions 
related  to  navigation  addresses  the  amount  of  informa¬ 
tion  contained  in  normal  flow  fields  that  is  necessary  for 
robustly  solving  various  specific  problems,  as  opposed 
to  problems  of  general  recovery.  Both  aspects  of  our 
navigation-related  research  are  reviewed  in  Section  2. 

Our  theoretical  work  on  recognition  has  concentrated 
on  the  study  of  local  projective  and  affine  invariants, 
while  our  normative  research  on  recognition  has  been 
devoted  to  the  development  of  a  framework  for  recogniz¬ 
ing  an  object’s  purpose.  Section  3  summarizes  the  mmn 
results  of  our  recognition-related  research. 

We  have  also  developed  a  collection  of  low-level  vi¬ 
sion  techniques  for  image  segmentation,  segmentation  of 
SAR  data,  and  robust  estimation,  as  well  as  new  rep¬ 
resentations  for  objects  that  facilitate  recognition  taslu. 
Finally,  we  have  worked  in  several  specific  application 
areas  such  as  handwriting,  face  recognition,  aerial  im¬ 
age  understanding,  image  enhancement  and  morphing, 
as  well  as  on  the  parallelization  of  image  understanding 


algorithms.  Our  research  on  low-level  vision,  ^>plicar 
tions,  and  computational  aspects  is  summarized  in  Sec¬ 
tion  4. 

Appended  to  this  report  is  a  list  of  the  52  technical  re¬ 
ports  on  computer  vision  issued  by  our  Laboratory  dur¬ 
ing  the  period  October  1991-January  1993.  The  num¬ 
bers  in  brackets  in  the  body  of  this  report  refer  to  these 
technical  reports. 

2  Navigation 

Visual  navigation  constitutes  a  problem  which  is  of  con¬ 
siderable  practical  as  well  as  scientific  interest.  Naviga¬ 
tion,  in  general,  refers  to  the  performance  of  sensory- 
mediated  movement,  and  visual  navigation  is  defined  as 
the  process  of  motion  control  based  on  an  analysis  of  im¬ 
ages.  A  system  with  navigational  capabilities  interacts 
adaptively  with  its  environment.  The  movement  of  the 
system  is  governed  by  sensory  feedback  which  allows  it  to 
adapt  to  variations  in  the  environment;  it  does  not  have 
to  be  limited  to  a  small  set  of  predefined  motions,  as  is 
the  case  for  instance,  with  cam-activated  machinery. 

Visual  navigation  encompasses  a  wide  range  of  percep¬ 
tual  capabilities  that  can  be  classified  hierarchically.  At 
the  bottom  of  the  hierarchy  are  low-level  tasks,  such  as 
obstacle  avoidance;  the  top  is  represented  by  high-level 
abilities  like  homing  or  target  pursuit.  As  a  basic  capa¬ 
bility,  however,  every  visual  navigation  system  must  have 
an  understanding  of  visual  motion.  It  should  be  able  to 
estimate  the  three-dimensional  motions  of  objects  in  its 
environment;  even  more  important,  it  should  be  able  to 
determine  its  own  motion.  Naturally,  a  large  part  of  our 
research  is  devoted  to  problems  of  visual  motion  analy¬ 
sis. 

One  way  to  deal  with  the  problem  of  visual  navigation 
is  to  consider  it  as  a  subproblem  of  the  general  structure 
from  motion  problem  (a  theoretical  question).  By  mak¬ 
ing  various  assumptions  we  can  develop  solutions  to  the 
problem  of  token  correspondence.  In  general,  such  solu¬ 
tions  will  involve  errors,  but  we  can  study  ways  of  identi¬ 
fying  special  instances  of  the  problem  in  which  a  robust 
solution  for  structure  and  motion  is  possible.  Our  work 
along  these  lines  is  described  in  detail  in  Section  2.1. 
Section  2.2  is  devoted  to  a  normative  study  of  the  visual 
motion  analysis  problem,  where  we  do  not  attempt  to  es¬ 
timate  feature  correspondences;  rather,  as  input  to  our 
motion  algorithms  we  use  the  spatiotemporal  derivatives 
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of  the  image  intensity  function  (the  so-called  “normal 
flow”). 

2.1  Motion  and  structure  estimation 

2.1.1  Monocular  and  binocular  recovery  of 
motion  and  structure  parameters 

A  central  problem  in  vision-based  navigation  is  to  use 

2- D  information  from  a  sequence  of  images  to  infer  3-D 
motion  and  structure  information.  By  its  very  nature 
this  problem  is  ill-posed  and  most  of  the  algorithms  dis¬ 
cuss^  in  the  literature  have  proven  to  be  very  sensitive 
to  even  moderate  levels  of  noise  in  the  images  and  in  the 
calibration  of  the  camera(s). 

Over  the  last  few  years,  we  have  advocated  the  use  of 
feature-based  algorithms  and  long  sequences  of  images 
for  estimating  the  motion  of  the  observer,  the  motions 
of  objects,  and  the  spatial  structure  of  feature  points. 
These  efforts  have  resulted  in  several  robust  algorithms 
which  have  been  successfully  used  for  both  monocular 
and  binocular  real  image  sequences. 

In  [41],  the  problem  of  estimating  the  kinematics  of  the 
moving  camera  and  the  spatial  structure  of  the  objects 
in  a  stationary  environment  is  considered.  Two  estima¬ 
tion  techniques,  batch  and  recursive,  have  been  used. 
The  batch  technique  applies  a  non-linear  least  squares 
method  to  the  stack  of  images,  while  the  recursive  tech¬ 
nique  uses  an  iterative  extended  Kalman  filter  and  an¬ 
alyzes  one  frame  at  a  time.  The  approach  is  based  on 
modeling  the  motion  of  the  camera  using  nine  parame¬ 
ters,  the  3-D  coordinates  of  the  rotation  center  and  the 
linear  and  angular  velocity  components.  A  perspective 
camera  model  is  used.  The  structure  parameters  are  the 

3- D  coordinates  of  the  feature  points  in  the  inertial  co¬ 
ordinate  system.  These  choices  of  parameters  give  rise 
to  linear  plant  models,  leading  to  closed  form  solutions 
for  the  state  and  covariance  transition  differential  equa¬ 
tions.  Time  consuming  numerical  integration  stepra  are 
not  needed. 

The  inputs  to  the  algorithm  are  feature  point  corre¬ 
spondences  over  the  image  sequence.  The  task  of  au¬ 
tomatically  detecting  and  tracking  features  over  a  long 
sequence  of  consecutive  frames  is  a  challenging  problem 
when  the  camera  motion  is  significant.  In  general,  fea¬ 
ture  displacement  over  consecutive  frames  can  approxi¬ 
mately  be  decomposed  into  two  components;  (i)  the  dis¬ 
placement  due  to  camera  motion,  which  can  be  compen¬ 
sated  by  image  rotation,  scaling,  and  translation;  (ii)  the 
displacement  due  to  object  motion  and/or  perspective 
projection.  The  displacement  due  to  camera  motion  is 
usually  much  larger  and  more  irregular  than  the  displace¬ 
ment  caused  by  object  motion  and  perspective  deformar 
tion.  We  have  developed  a  two  step  approach:  First,  the 
motion  of  the  camera  is  compensated  using  a  recently  de¬ 
veloped  image  registration  algorithm.  Then  consecutive 
frames  are  transformed  to  the  same  coordinate  system 
and  the  feature  correspondence  problem  is  solved  as  one 
of  tracking  moving  objects  using  a  still  camera.  Methods 
of  subpixel  accuracy  feature  matching  and  tracking  are 
introduced.  The  approach  results  in  a  robust  and  effi¬ 
cient  algorithm.  Rnults  on  several  real  image  sequences 
are  presented  in  two  papers  that  appear  elsewhere  in 


these  Proceedings  (parts  of  which  have  already  been  re¬ 
ported  in  [31, 41, 45]).  The  monocular  algorithm  has  also 
been  extended  to  the  case  of  a  binocular  moving  camera. 
For  binocular  imagery,  the  traditional  stereo  triangular 
tion  method  fails  when  the  images  are  not  taken  by  the 
two  cameras  at  the  same  time.  But  for  our  algorithm, 
since  asynchronism  is  allowed,  the  two  cameras  can  func¬ 
tion  independently  (see  [45]). 

The  methods  summarized  above  have  attempted  to 
automate  the  problem  of  motion  and  structure  recovery 
under  relatively  general  conditions.  In  practical  applica¬ 
tions,  such  as  the  navigation  of  an  autonomous  vehicle 
or  a  low-fljring  aircraft,  several  simplifications  are  pos¬ 
able;  for  example,  the  3-D  structure  of  a  (small)  set 
of  landmark  points  may  be  available  from  laser  radar 
range  measurements,  or  approximate  vehicle  kinematics 
may  be  known  from  inertid  sensors.  Batch  and  recursive 
estimation  procedures  for  including  such  additional  in¬ 
formation  from  the  sensors  and  the  scene  are  described 
in  [16].  For  the  situation  where  the  structure  of  a  set 
of  landmark  points  is  known,  the  absolute  pose  and  ve¬ 
locity  of  the  vehicle  and  the  locations  of  the  unknown 
feature  points  can  be  estimated.  When  the  approximate 
vehicle  kinematics  aue  known,  the  ranges  of  the  feature 
points  and  improved  estimates  of  the  vehicle  kinematics 
can  be  obtained,  os  described  in  [16]. 

2.1.2  MAP  estimation  techniques  [38,  47] 

We  have  developed  a  Maximum  A  Posteriori  (MAP) 
estimation  algorithm  for  calculating  the  camera  motion 
and  the  structure  of  a  (rigid)  scene.  Our  algorithm  as¬ 
sumes  the  motion  to  be  along  a  smooth  trajectory  and 
the  sequence  of  images  to  be  dense,  so  that  the  displace¬ 
ment  between  successive  frames  obtained  by  each  camera 
is  at  most  n  pixels,  where  typically  n  =  2.  We  calculate 
instantaneous  estimates  of  the  focus  of  expansion  (FOE) 
and  of  the  scene  depth  map,  and  keep  updating  these  es¬ 
timates  through  the  sequence.  Our  algorithm  begins  by 
calculating  a  MAP  estimate  of  the  subpixel  displacement 
at  each  point  and  a  confidence  measure  in  that  estimate. 
Using  points  for  which  the  confidence  is  high  we  calcu¬ 
late  MAP  estimates  for  the  FOE  and  the  magnitudes  of 
the  displacements  at  these  points,  hence  their  relative 
depths.  After  determining  the  FOE  we  know  the  direc¬ 
tion  of  displacement  at  every  point  in  the  image  and  we 
can  again  apply  the  MAP  estimation  method  to  get  the 
displacement  magnitude  at  each  point  and  the  associated 
confidence  measure.  This  information  is  propagated  over 
a  long  sequence  of  images  by  using  the  a  posteriori  dis¬ 
tribution  calculated  from  a  set  of  images  as  a  prior  for 
the  next  set  of  images. 

We  have  also  developed  a  MAP  algorithm  for  fusing 
monocular  and  stereo  cues  from  two  image  sequences 
to  get  robust  estimates  of  both  motion  and  structure, 
under  the  same  assumptions.  The  algorithm  starts  by 
calculating  the  instantaneous  FOE,  a  MAP  estimate  of 
the  displacement  at  each  pixel,  an  associated  confidence 
measure,  and  a  relative  depth  map,  as  described  above, 
from  one  of  the  two  frame  sequences.  By  calculating 
the  disparities  at  some  feature  points  and  using  infor¬ 
mation  about  their  relative  depths  we  compute  the  in- 
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stantaneous  component  of  velocity  in  the  direction  per¬ 
pendicular  to  the  image  plane.  Using  this  information 
a  depth  map  is  calculated;  this  depth  map  is  then  used 
to  derive  a  prior  probability  distribution  for  disparity 
that  is  used  in  matching  the  two  frames  of  the  stereo 
pairs.  We  use  this  method  to  estimate  the  disparity  at 
each  pixel  independently;  no  assumptions  about  surface 
smoothness  are  used.  Both  the  monocular  and  binocular 
algorithms  have  been  successfully  tested  on  real  image 
sequences. 

2.1.3  F^«net-Serret  motion  [36] 

We  have  formulated  a  new  model,  FrtntUSerrti  mo¬ 
tion,  for  the  motion  of  an  observer  in  a  stationary  en¬ 
vironment.  This  model  relates  the  motion  parameters 
of  the  observer  to  the  curvature  and  torsion  of  the  path 
along  which  the  observer  moves.  We  derive  screw-motion 
equations  for  F^enet-Serret  motion  and  use  them  for  ge¬ 
ometrical  analysis  of  the  motion  as  well  as  analysis  of 
the  resulting  velocity  patterns  in  3-D  and  motion  field 
patterns  on  the  surface  of  the  velocity  egosphere.  We 
use  normal  flow  to  derive  constraints  on  the  rotational 
and  translational  velocity  of  the  observer  and  compute 
egomotion  by  intersecting  these  constraints.  We  analyze 
the  accuracy  of  egomotion  estimation  for  different  com¬ 
binations  of  observer  motion  and  feature  distance.  We 
suggest  that  depth  of  field  should  be  controlled  in  or¬ 
der  to  make  the  analysis  of  egomotion  on  the  basis  of 
normal  flow  possible,  and  we  derive  the  constraints  on 
depth  which  make  either  rotation  or  translation  domi¬ 
nant.  These  ideas  have  been  validated  by  experiments 
on  real  image  sequences. 

2.1.4  Feature-based  and  flow-based  motion 
estimation:  a  unified  view  (23) 

State-of-the-art  algorithms  for  computing  3-D  motion 
from  images  can  make  use  of  either  feature  correspon¬ 
dences  or  optical  flow.  In  particular,  noise-robust  algo¬ 
rithms  can  be  formulated  for  the  feature-based  two-view 
problem — computing  the  depths  of  the  feature  points 
and  the  camera  motion  from  correspondences  of  feature 
points  between  two  images.  For  such  algorithms,  condi¬ 
tions  for  decompoeability  and  for  uniqueness  of  the  solu¬ 
tion,  as  well  as  direct  optimization  solutions  and  “critical 
surface”  conditions,  can  be  formulated.  Similarly,  noise 
robust  algorithms  can  be  formulated  that  make  use  of  op¬ 
tical  flow;  here  too,  decompoeability,  uniqueness,  direct 
optimization,  and  the  “critical  surface”  can  be  treated, 
and  relationships  to  the  algorithms  for  finite  motion  can 
be  anadyzed.  In  both  the  feature-based  and  flow-based 
cases,  a  simpler  treatment  can  be  given  for  the  case  of 
motion  on  a  planar  surface. 

2.2  Direct  motion  analysis 

We  have  also  addressed  the  problem  of  estimating  3-D 
motion  directly  without  going  through  the  intermedi¬ 
ate  stage  of  optical  flow  or  correspondence  estimation. 
The  inputs  that  we  have  utilized  are  the  spatiotemporal 
derivatives  of  the  image  intensity  function  (the  normal 
flow). 

From  measurements  on  the  image  we  can  only  com¬ 
pute  the  relative  motion  between  the  observer  and  any 


point  in  the  3-D  scene.  The  model  that  has  usually  been 
employed  in  previous  research  to  relate  2-D  image  mea¬ 
surements  to  3-D  motion  and  structure  is  that  of  rigid 
motion.  Consequently,  egomotion  recovery  for  an  ob¬ 
server  moving  in  a  static  world  has  been  treated  in  the 
same  way  as  the  estimation  of  an  object’s  3-D  motion 
relative  to  an  observer.  The  rigid  motion  model  is  ap¬ 
propriate  if  only  the  observer  is  moving,  but  it  holds  only 
for  a  restricted  subset  of  moving  objects,  mainly  man¬ 
made  ones.  Indeed,  virtually  all  objects  in  the  natural 
world  move  non-rigidly.  However,  if  we  consider  only 
a  small  patch  in  the  image  of  a  moving  object,  a  rigid 
motion  approximation  is  legitimate.  For  the  case  of  ego¬ 
motion,  data  from  all  parts  of  the  image  plane  crui  be 
used,  whereas  for  object  motion  only  local  information 
can  be  employed.  We  have  therefore  developed  concep¬ 
tually  different  techniques  for  explaining  the  mechanisms 
underlying  the  perceptual  processes  of  egomotion  recov¬ 
ery  and  3-D  object  motion  recovery. 

We  have  developed  solutions  to  the  following  prob¬ 
lems;  (a)  Given  an  active  observer  viewing  sn  object 
moving  tn  s  rigid  manner  (translation  -f  rotation),  re¬ 
cover  the  direction  of  the  S-D  translation  and  the  time 
to  collision  bg  using  onlg  the  spatiotemporal  derivatives 
of  the  image  intensity  function.  Although  this  problem 
is  not  equivalent  to  “structure  from  motion”  b^ause  it 
does  not  fully  recover  the  3-D  motion,  it  is  of  importance 
in  a  variety  of  situations.  If  an  object  is  rotating  around 
itself  and  also  translating  in  some  direction,  we  are  usu¬ 
ally  interested  in  its  translation — for  example,  in  prob¬ 
lems  related  to  tracking,  prey  catching,  interception  [27], 
obstacle  avoidance,  etc.  The  basic  idea  of  this  motion 
parameter  estimation  strategy  lies  in  the  employment  of 
fixation  and  tracking  [24,  46].  Fixation  simplifies  much 
of  the  computation  by  placing  the  object  at  the  center 
of  the  visual  field,  and  the  main  advantage  of  tracking 
is  the  accumulation  of  information  over  time.  We  have 
shown  how  tracking  is  accomplished  using  normal  flow 
measurements,  and  have  used  it  for  two  different  tasks  in 
the  solution  process:  First,  as  a  tool  to  compensate  for 
the  lack  of  existence  of  an  optical  flow  field,  zuid  to  esti¬ 
mate  the  translation  parallel  to  the  image  plane;  and  sec¬ 
ond,  to  gather  information  about  the  motion  component 
perpendicular  to  the  image  plane,  (b)  Given  an  active 
observer  moving  rigidly  in  a  static  environment,  recover 
the  direction  of  its  translation  and  its  rotation.  This  is 
the  task  of  passive  navigation,  a  term  used  to  describe 
the  set  of  processes  by  which  a  system  can  estimate  its 
motion  with  respect  to  the  environment.  Our  approach 
to  egomotion  estimation  [32]  is  based  on  a  geometric 
analysis  of  the  properties  of  the  normal  flow  field.  The 
fact  that  the  motion  is  rigid  defines  geometric  relations 
between  certain  values  of  the  spatiotemporal  derivatives 
of  the  image  intensity  function.  We  have  proved  that 
the  normal  flow  gives  rise  to  global  patterns  in  the  im¬ 
age  plane.  The  geometry  of  these  patterns  is  related  to 
the  three  dimensional  motion  parameters.  By  locating 
some  of  these  patterns,  which  depend  only  on  subsets 
of  the  motion  parameters,  using  a  simple  search  tech¬ 
nique,  the  3-D  motion  parameters  can  be  found.  The 
algorithmic  procedure  that  we  have  developed  (which  is 
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described  in  a  separate  paper  in  these  Proceedings)  is 
provably  robust,  since  it  is  not  affected  by  small  per¬ 
turbations  in  the  local  image  motion  measurements.  In 
fact,  since  only  the  signs  of  the  normal  flow  measure¬ 
ments  are  employed,  the  direction  of  translation  and  the 
axis  of  rotation  can  be  estimated  in  the  presence  of  up 
to  100%  error  in  the  image  measurements. 

2.3  Localisation,  visibility,  and  path  planning 

2.3.1  Localisation 

We  have  developed  an  approach  to  autonomous  local¬ 
isation  of  ground  vehicles  on  natural  terrain  [4].  The 
localisation  problem  is  solved  using  measurements  in¬ 
cluding  altitude,  heading,  and  distances  to  specific  en¬ 
vironmental  points.  Our  algorithm  utilises  random  ac¬ 
quisition  of  distance  measurements  to  prune  the  possible 
location(s)  of  the  viewer.  The  approach  is  also  applicable 
to  airborne  localisation.  The  computational  complexity 
of  an  implementation  on  the  Ginnection  Machine  and 
the  accuracy  of  the  localisation  have  been  analysed. 

A  method  for  localisation  and  positioning  in  an  indoor 
environment  has  also  been  developed  [33].  We  define 
localisation  as  the  act  of  recognising  the  environment, 
and  positioning  as  the  act  of  computing  the  exact  coor¬ 
dinates  of  a  robot  in  the  environment.  Our  method  is 
based  on  representing  the  scene  as  a  set  of  2-D  views  and 
predicting  the  appearances  of  novel  views  by  linear  com¬ 
binations  of  the  model  views.  The  method  accurately 
approximates  the  appearance  of  scenes  under  weak  per¬ 
spective  projection.  Analysis  of  this  projection  as  well 
as  experimental  results  demonstrate  that  in  many  cases 
this  approximation  is  sufficient  to  accurately  describe  the 
scene.  When  the  weak  perspective  approximation  is  in¬ 
valid,  either  a  larger  number  of  models  can  be  acquired 
or  an  iterative  solution  to  account  for  the  perspective 
distortions  can  be  employed.  The  method  has  several 
advantages  over  other  approaches.  It  uses  relatively  rich 
representations;  the  representations  are  2-D  rather  than 
3-D;  and  localization  can  be  done  from  only  a  single  2- 
D  view.  The  same  general  method  is  applied  to  both 
the  localization  and  positioning  problems,  and  a  simple 
algorithm  for  repositioning,  the  task  of  returning  to  a 
previously  visited  position  defined  by  a  single  view,  can 
be  derived  from  this  method. 

2.3.2  Visibility  and  path  planning 

We  have  investigated  [29]  two  classes  of  parallel  al¬ 
gorithms  for  point-to-region  visibility  analysis  on  ter¬ 
rain;  ray-structure-based  methods  and  propagation- 
based  methods.  A  new  propagation-bas^  dgorithm 
has  been  developed  which  avoids  problems  commonly 
occurring  with  such  algorithms.  The  performance  and 
characteristics  of  the  two  kinds  of  algorithms  have  been 
compared.  The  sources  of  uncertainty  in  visibility  com¬ 
putation  and  the  importance  of  taking  uncertainty  into 
consideration  have  brnn  analyzed.  Different  methods  for 
representing  the  uncertainty  have  been  studied,  includ¬ 
ing  Monte  Carlo  simulation,  analytic  estimation,  and 
some  simple  heuristic  indicators.  Our  experiments  show 
that  these  indicators  can  be  used  for  efficient  coarse  clas¬ 
sification  of  the  likelihood  of  point  intervisibility. 


Current  approaches  to  robot  motion  planning  are  lim¬ 
ited  in  their  ability  to  deal  with  an  uncertain  and  dy¬ 
namically  changing  environment.  We  have  developed  [5] 
a  probabilistic  model  based  on  discrete  events  that  ab¬ 
stract  the  dynamic  interaction  between  the  robot  and 
the  unknown  part  of  the  environment.  The  resulting 
framework  makes  it  possible  to  design  and  evaluate  mo¬ 
tion  planning  strategies  that  consider  both  the  known 
portion  of  the  environment  and  the  portion  that  is  un¬ 
known  but  satisfies  a  probability  distribution.  We  have 
studied  three  instances  of  the  general  model  that  have 
been  useful  in  designing  efficient  motion  planning  algo¬ 
rithms  under  various  assumptions  about  the  robot’s  en¬ 
vironment  and  its  behavior  with  respect  to  unexpected 
events. 

3  Recognition 

The  problem  of  object  recognition  has  been  traditionally 
treat^  as  one  of  matching  image  features  or  recovered 
surface  features  with  geometric  object  models.  Such  ap¬ 
proaches  are  primarily  devoted  to  the  robust  detection 
or  recovery  of  features  and  to  handling  the  combinatorial 
complexity  of  the  matching  process.  In  this  spirit,  the 
problem  of  recognition  is  defined  as  finding  regularity 
across  views,  and  the  theories  of  object  recognition  can 
be  classified  into  three  main  groups;  computation  of  in¬ 
variant  properties,  object  decomposition  into  parts,  and 
alignment.  In  Section  3.2  our  recent  work  on  invariants 
is  presented, with  emphasis  on  local  projective  and  affine 
invariants.  Section  3.3  is  devoted  to  a  novel  method 
of  two-dimensional  object  segmentation  and  recognition, 
and  Section  3.4  deals  with  our  recent  work  on  alignment 
(pose  estimation).  Section  3.1  describes  our  recent  work 
on  an  alternative  framework  for  recognition. 

3.1  A  framework  for  object  recognition  [10] 

Vision  systems  that  operate  in  different  environments 
2uid  perform  different  visual  tuks  do  not  necessarily  rec¬ 
ognise  objects  using  similar  algorithms.  A  vision  system 
that  needs  to  recognize  ten  types  of  objects  does  not  nec¬ 
essarily  work  in  the  same  way  as  a  system  that  needs  to 
recognise  one  type  or  a  hundred  types.  A  system  that 
serves  a  rapidly  moving  agent  is  not  necessarily  built  in 
the  same  way  as  a  system  for  a  stationary  agent.  Object 
recognition  should  be  studied  by  taking  into  account  not 
only  the  objects  that  have  to  be  recognized  but  also  the 
agent  that  has  to  perform  the  recognition.  Since  different 
agents,  working  with  different  purposes  in  different  envi¬ 
ronments,  do  not  recognize  visually  in  the  same  manner, 
we  should  not  seek  a  general,  universal  theory  of  object 
recognition.  Instead,  we  should  concentrate  on  develop¬ 
ing  a  methodology  that,  given  an  agent  in  an  environ¬ 
ment,  will  suggest  how  to  perform  particular  recognition 
tasks. 

An  agent  is  a  robot  that  has  visual  (and  other)  sensing 
capabilities  and  is  able  to  carry  out  a  set  of  behaviors. 
These  behaviors  are  direct  results  of  a  set  of  purposes  or 
intentions  that  the  agent  has.  A  behavior  is  identified 
as  anything  that  changes  the  internal  state  of  the  agent 
and  its  relationship  to  the  environment.  Carrying  out  a 
behavior  calls  for  the  performance  of  various  recognition 
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tasks.  By  performing  partial  recovery  of  attributes  of 
an  object,  we  can  find  out  if  the  object  is  suitable  for 
the  desired  purpose.  In  general  an  object  can  be  used 
for  many  purposes.  The  agent  must  recognize  the  one 
needed  to  carry  out  its  behavior. 

Perception  is  a  causal  and  intensional  transaction  be¬ 
tween  the  mind  and  the  world.  The  intensional  content 
of  our  visual  perception  is  termed  “the  visual  experi¬ 
ence”  .  When  we  see  a  table  there  are  two  elements  in  the 
perceptual  situation:  the  visual  experience  and  the  ta¬ 
ble.  The  two  are  not  independent.  The  visual  experience 
has  the  presence  and  features  of  the  table  as  conditions 
of  satisfaction.  The  content  of  the  visual  experience  is 
self-referential  in  the  sense  that  it  requires  that  the  state 
of  affairs  in  the  world  must  cause  the  visual  experience 
which  is  the  realization  of  the  intensional  content. 

When  we  visually  perceive  an  object  we  have  a  visual 
experience.  This  visual  experience  is  an  experience  of 
the  object.  It  may  be  that  the  conditions  of  satisfaction 
are  not  fulfilled.  This  is  the  case  for  illusions,  halluci¬ 
nations,  etc.  The  visual  experience,  and  not  the  world, 
is  at  fault.  The  visual  experience  that  we  have,  in  this 
case,  is  indistinguishable  from  the  visual  experience  we 
would  have  if  we  actually  saw  the  real  object.  The  in¬ 
tensional  content  of  the  visual  experience  determines  its 
conditions  of  satisfaction.  A  visual  experience  in  that 
sense  is  a  mental  phenomenon  which  is  intrinsically  in¬ 
tensional. 

An  agent  is  defined  as  a  set  of  intentions,  /i ,  /j, . . . , 
Each  intention  /*  is  translated  into  a  set  of  behav¬ 
iors,  E<^b  behavior  Bm  calls  for 

the  completion  of  recognition  tasks  Tkii,Tki2,---,Tkij. 
The  agent  acts  in  behavior  Bkt  under  intention  !».  T^ 
behavior  calls  for  the  completion  of  recognition  tasks 
Tkii,. .  .,Tkin-  The  behavior  sets  parameters  for  the 
recognition  tasks.  Under  one  behavior  a  chair  will  an¬ 
swer  yes  to  a  recognition  task  that  is  looking  for  obsta¬ 
cles,  under  another  behavior  it  will  answer  yes  to  a  task 
that  is  looking  for  a  sitting  place,  and  under  still  another 
it  will  answer  yes  to  a  task  that  is  looking  for  an  assault 
weapon. 

We  view  the  recognition  process  along  the  axis  (inten¬ 
tion,  behavior,  recognition  task).  For  a  theory  of  purpo¬ 
sive  object  recognition  we  should  be  able  to  make  two  ba¬ 
sic  transformations:  first,  from  a  desired  intention  to  the 
set  of  behaviors  that  achieve  it;  second,  from  a  specific 
behavior  to  some  needed  recognition  ta8k(8).  We  have 
shown  [10]  that  the  intention-to-behaviors  problem  with 
a  finite  number  of  behaviors  is  undecidable  by  reducing  it 
the  halting  problem.  We  believe  that  the  transformation 
from  behaviors  to  recognition  tasks  is  also  hard. 

If  we  add  constraints  to  our  definition  of  the  problem 
we  can  move  from  undecidability  to  intractability.  For 
example,  by  constraining  ourselves  to  a  constant  set  of 
objects  we  can  show  a  PSPACE-hard  lower  bound.  This 
can  be  shown  by  reducing  our  problem,  for  example,  to 
that  of  motion  planning  for  an  object  in  the  presence  of 
movable  obstacles,  where  the  final  positions  of  the  obsta¬ 
cles  are  specified  as  part  of  the  goal  of  the  motion.  The 
reduction  is  straightforward.  The  set  of  objects  contains 
the  moving  objects  and  the  obstacles.  The  positions  of 


the  objects  are  part  of  the  relation  set.  The  intention  en¬ 
codes  the  final  state.  Grasping,  pushing  and  moving  are 
the  behaviors.  Solving  the  intention-to-behaviors  prob¬ 
lem  gives  a  solution  to  this  problem. 

We  are  interested  in  object  utilisation;  this  is  not  the 
same  as  naming  an  object.  Under  our  framework  an 
agent  acts  in  behavior  Bti  under  intention  h-  The 
behavior  calls  for  the  completion  of  recognition  tasks 
Tkii, . . .  ,Tkin-  The  behavior  sets  parameters  for  the 
recognition  tasks.  Each  recognition  task  activates  a  dif¬ 
ferent  collection  of  basic  perceptual  modules.  Each  mod¬ 
ule  qualitatively  finds  a  generic  object  property  which  is 
a  result  of  one  or  a  combination  of  direct  low-level  com¬ 
putations  on  some  sensory  data  (possibly  done  by  other 
modules).  The  result  of  a  module’s  operation  is  given  as 
a  qualitative  value.  Each  module  has  its  own  neighbor¬ 
ing  open  intervals  which  are  parameter-specific.  The  t*'' 
module  can  take  one  of  9,1 , ... ,  qin  qualitative  values. 

The  state  of  our  recognition  system,  denoted  by  Qi, 
is  a  tuple  of  all  the  qualitative  values  of  our  modules 
(9if->9m)  under  recognition  task  Tkij.  Each  recogni¬ 
tion  task  Tkij  defines  a  system  state  that  will  constitute 
a  positive  answer  to  that  recognition  task.  Recognition 
is  done  when  we  complete  our  task,  which  means  a  sta¬ 
ble  answer  from  our  modules.  The  conditions  for  this 
kind  of  decision  will  not  be  considered  here  and  proba¬ 
bly  should  take  into  account  utility  measures  (frequency 
of  appearance,  network  complexity,  etc.). 

Under  this  framework  learning  can  be  defined  as  the 
process  of  matching  the  “correct”  system  state  with  the 
recognition  task  needed  by  a  certain  behavior.  This  pro¬ 
cess  is  actually  the  reverse  of  recognition.  A  behavior 
creates  a  need  for  an  object.  An  object  is  segmented  by 
low  level  modules,  and  a  system  state  is  achieved.  The 
object  is  tested  and  a  satisfied  result  for  a  needed  be¬ 
havior  starts  the  creation  or  definition  of  a  recognition 
task. 

When  we  need  to  perform  a  given  recognition  task 
Tkij  under  behavior  Bki  and  intention  Ik,  we  may  assume 
that  some  parameter  setting  is  done  by  the  intention  and 
the  behavior.  These  parameters  fix  the  setting  for  the 
task,  which  includes  the  required  system  state  (some  of 
the  modules  might  be  in  don’t  care  states)  and  possibly 
some  additional  “common  knowledge”  parameters,  such 
as  environmental  parameters  (outdoor,  indoor),  preda¬ 
tor,  size,  etc.  From  this  pont  of  view  the  recognition 
process  makes  use  of  high-level  information.  For  further 
discussion  of  object  categories  and  functional  modeling 
see  [10]. 

3.2  Invariants  [19,  34,  44] 

Invariants  are  useful  in  solving  major  problems  associ¬ 
ated  with  object  recognition.  For  instance,  different  im¬ 
ages  of  the  same  object  often  differ  from  each  other  be¬ 
cause  of  the  different  viewpoints  from  which  they  were 
taken.  To  match  the  two  images,  standard  method  thus 
need  to  find  the  correct  viewpoint,  a  difficult  problem 
that  can  involve  search  in  a  large  parauneter  space  of  all 
possible  points  of  view  and/or  finding  feature  correspon¬ 
dences.  Geometric  invariants  are  shape  descriptors,  com¬ 
puted  from  the  geometry  of  the  shape,  that  remain  un- 


7 


changed  under  geometric  transformations  such  as  change 
of  viewpoint.  Thus  they  can  be  matched  without  search. 
Deformations  of  objects  are  another  important  class  of 
changes  for  which  invariance  is  useful. 

We  have  developed  a  new  and  more  robust  method  of 
obtaining  local  projective  and  affine  invariants.  These 
shape  descriptors  are  useful  for  object  recognition  be¬ 
cause  they  eliminate  the  search  for  the  unknown  view¬ 
point.  Being  local,  our  invariants  are  much  less  sensitive 
to  occlusion  than  the  global  ones  used  by  others.  The 
basic  ideas  underlying  our  method  are:  i)  employing  an 
implicit  curve  representation  without  a  curve  parameter, 
thus  increasing  robustness;  ii)  using  a  canonical  coordi¬ 
nate  system  which  is  defined  by  intrinsic  properties  of 
the  shape,  independently  of  any  given  coordinate  sys¬ 
tem,  and  is  thus  invariant.  Several  shape  configurations 
have  been  treated  using  this  approach;  a  general  curve 
without  any  correspondence,  and  curves  with  known  cor¬ 
respondences  of  one  or  two  feature  points  or  lines.  The 
method  is  applied  by  fitting  an  implicit  polynomial  in 
a  neighborhood  of  each  object  contour  point.  It  has 
been  successfully  implemented  for  real  images  of  vari¬ 
ous  two-dimensional  objects  in  three-dimensional  space. 
This  work  is  described  in  detail  in  a  separate  paper  in 
these  Proceedings. 

3.3  Target  recognition  [52] 

A  multilevel  energy  environment  has  been  developed 
that  simultaneously  performs  delineation,  representation 
and  classification  of  two-dimensional  object  shapes  in  an 
image  utilizing  a  global  optimization  technique.  The  en¬ 
ergy  environment  supports  a  novel  multipolar  represen¬ 
tation  which  allows  the  delineation  and  representation 
tasks  to  be  viewed  as  a  single  operation.  The  delin¬ 
eator  acts  as  a  hypothesis  generator  for  the  multipolar 
representation,  which  uses  description  length  tests  to  de¬ 
termine  whether  to  establish  new  “centers” .  Model  in¬ 
formation  is  then  utilized  at  these  centers  to  identify 
pieces  of  objects.  In  this  way  occluded  objects  can  be 
recognized.  This  method  is  more  robust  than  conven¬ 
tional,  multistaged  approaches  because  it  incorporates 
all  known  information  into  a  single  decision  process.  It 
has  been  applied  to  the  delineation  and  classificatioa  of 
vehicles  in  FLIR  images.  Further  details  on  this  work 
can  be  found  in  a  separate  paper  in  these  Proceedings. 

3.4  Pose  estimation 

We  have  shown  that  the  bounded  error  recognition  prob¬ 
lem  for  images  of  non-planar  three-dimensional  objects 
using  point  features  can  be  decomposed  into  a  set  of  one- 
dimensional  search  tasks,  involving  searches  along  lines 
joining  the  origin  of  the  object  coordinate  system  to  the 
feature  points  chosen  to  model  the  object.  Points  ate 
selected  along  these  lines  at  locations  given  by  the  coor¬ 
dinates  of  the  detected  image  points;  concurrent  brack¬ 
eting  of  these  points  by  segment  tree  search  along  each 
line  provides  maximal  matchings  between  feature  points 
and  image  points.  The  depth  of  search  is  limited  by  pixel 
resolution.  This  method  is  well  adapted  to  the  task  of 
tracking  objects  in  the  presence  of  variable  occlusion  and 
clutter.  This  work  is  described  in  greater  detail  in  a  sep¬ 


arate  paper  in  these  Proceedings.  Some  of  our  earlier 
work  on  object  pose  estimation  is  described  in  [12]. 

4  Low-level  vision 

4.1  Estimation  and  segmentation 

4.1.1  Image  segmentation 

The  problems  of  image  estimation  and  segmentation 
can  be  cast  in  a  joint  Maximum  A  Posteriori  (MAP) 
framework  using  Gibbs  distributions  defined  over  the  im¬ 
age  intensities  and  over  line  processes  representing  the 
boundaries  of  image  regions.  MAP  estimation  is  then 
reduced  to  minimising  an  appropriate  energy  function 
defined  on  the  intensity  and  line  processes. 

The  energy  function  t3rpically  has  three  components; 
(a)  a  measure  of  closeness  to  the  data,  (b)  a  weak  con¬ 
straint  which  assumes  that  the  image  is  mostly  smooth 
except  at  the  discontinuities,  and  (c)  penalties  on  broken 
contours,  multiple  edges,  etc.  In  its  most  general  form, 
the  energy  is  highly  non-convex,  causing  deterministic 
relaxation  techniques  to  converge  to  shallow,  local  min¬ 
ima.  Stochastic  relaxation  is  not  always  a  viable  alternar 
tive  due  to  the  computational  complexity  of  the  problem. 
We  are  interested  in  deterministic,  continuation  methods 
to  solve  the  problem. 

Alternative  energy  functions  have  been  suggested 
which  depend  primarily  on  the  intensities  and  usually  ig¬ 
nore  the  interactions  between  the  line  processes.  We  can 
utilize  the  insights  gained  by  these  methods  by  showing 
that  each  of  the  alternative  energy  function  sequences 
has  an  equivalent  sequence  in  the  domain  of  the  inten¬ 
sity  and  line  processes.  Interactions  can  then  be  added 
once  the  equivalent  energy  functions  have  been  obtained. 
There  are  many  equivalent  energy  functions  in  the  do¬ 
main  of  the  intensity  and  line  processes;  the  concept  of 
an  uncertainty  function  can  help  us  to  choose  the  proper 
equivalent  energy  function.  The  uncertainty  function  is 
analogous  to  the  entropy  in  a  statistical  mechanical  sys¬ 
tem. 

The  resulting  algorithm  is  a  combination  of  the  Con¬ 
jugate  Gradient  rad  the  Iterated  Conditional  Models  al- 
gorithnu  and  is  completely  deterministic.  It  has  been 
applied  successfully  to  the  segmentation  of  aerial  images 
[!]• 

A  segmentation-based  image  coding  technique  has  also 
been  developed  [17].  Both  uniform  rad  textured  re¬ 
gion  extraction  ^gorithms  are  used  for  segmentation. 
Textured  regions  are  reconstructed  using  2-D  noncausal 
Gaussira-Markov  random  field  models.  Uniform  re¬ 
gions  are  reconstructed  using  polynomial  expansions.  An 
arithmetic  coder  is  used  for  coding  the  boundaries  of  re¬ 
gions.  Reasonable  quality  images  have  been  obtained  at 
a  compression  factor  of  82:1. 

4.1.3  Segmentation  of  SAR  data  [18,  40] 

A  statistical  image  model  has  been  developed  for  seg¬ 
menting  polarimetric  synthetic  aperture  radar  (SAR) 
data  into  regimis  of  homogeneous  rad  similar  polarimet¬ 
ric  backscatter  characteristics.  A  model  for  the  condi¬ 
tional  distribution  of  the  polarimetric  complex  data  is 
combined  with  a  Markov  random  field  representation  for 
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the  distribution  of  the  region  labels  to  obtain  the  pos¬ 
terior  distribution.  Optimal  region  labeling  of  the  data 
is  then  defined  as  maximizing  the  posterior  distribution 
of  the  region  labels  given  the  polarimetric  SAR  complex 
data.  This  MAP  technique  has  been  implemented  on  a 
parallel  optimization  network.  Two  procedures  can  be 
used  for  selecting  the  characteristics  of  the  regions;  one  is 
supervised  and  requires  training  areas,  the  other  is  unsu¬ 
pervised  and  is  based  on  multidimensional  clustering  of 
the  logarithms  of  the  parameters  composing  the  polari- 
metric  covariance  matrix  of  the  data.  Experiments  using 
real  multilook  polarimetric  SAR  complex  data,  dual  po¬ 
larization  SAR  data,  and  fully  polarimetric  SAR  data 
indicate  that  all  three  types  of  data  yield  generally  sim¬ 
ilar  segmentation  results. 

For  unsupervised  segmentation,  classes  of  polarimet¬ 
ric  backscatter  have  been  selected  based  on  multidimen¬ 
sional  fuzzy  clustering.  The  clustering  procedure  uses 
both  polarimetric  amplitude  and  phase  information,  is 
adapted  to  the  presence  of  image  speckle,  and  does  not 
require  an  arbitrary  weighting  of  the  different  polarimet¬ 
ric  channels;  it  also  provides  a  partitioning  of  each  data 
sample  used  for  clustering  into  multiple  clusters.  Given 
the  classes,  the  entire  image  can  then  be  classified  using 
a  MAP  polarimetric  classifier.  Successful  segmentation 
results  have  been  obtained  using  four-look  polarimetric 
SAR  complex  data  of  lava  flows  and  of  sea-ice  acquired 
by  the  NASA/JPL  airborne  polarimetric  radar  (AIR- 
SAR). 

4.1.3  Robust  estimation  [30] 

Data  processing  for  scientific  and  industrial  tasks  of¬ 
ten  involves  accurate  extraction  of  theoretical  model  pa¬ 
rameters  from  empirical  data,  and  requires  automated 
estimation  methods  that  are  robust  in  the  presence  of 
“noisy”  (i.e.,  contaminated)  data.  Robust  estimation  is 
thus  an  important  statistical  tool  that  is  frequently  ap¬ 
plied  in  numerous  fields  of  science  and  engineering. 

Since  the  computational  complexity  of  an  estimator  is 
one  of  the  most  important  measures  of  its  practicality, 
searching  for  methods  that  reduce  the  time  (and  space) 
complexity  of  robust  estimators  is  a  desirable  research 
goal.  We  have  developed  several  computationally  ef¬ 
ficient  algorithms  for  the  exact  computation  of  robust 
statistical  estimators.  In  particular,  we  have  studied  the 
design  and  analysis  of  such  algorithms  for  various  prob¬ 
lem  domains,  including  line,  curve,  and  surface  fitting. 
In  this  connection,  we  have  developed  a  general  method¬ 
ology  for  the  efficient  computation  of  these  classes  of 
estimators  through  the  application  of  computational  ge¬ 
ometry  techniques.  In  particular,  we  have  developed  ran¬ 
domized  algorithms  for  the  above  tasks  that  have  the  fol¬ 
lowing  properties:  (1)  The  algorithms  always  terminate 
and  return  the  correct  computational  results;  (2)  the  im¬ 
proved  (expected)  running  times  occur  with  extremely 
high  probability;  (3)  the  algorithms  are  quite  easy  to  im¬ 
plement;  (4)  the  constants  of  proportionality  (hidden  by 
the  asymptotic  notation)  are  small  (i.e.,  the  algorithms 
are  practical);  and  (5)  the  algorithms  are  space  optimal 
(i.e.,  they  require  linear  storage). 

The  problem  of  fitting  a  straight  line  to  a  set  of  data 


points  is  an  important  task  in  many  application  areas. 
Recently  the  computation  of  linear  estimators  that  are 
robust  has  been  recognized  as  important,  since  these  es¬ 
timators  are  insensitive  to  outlying  data  points,  which 
arise  often  in  practice.  We  have  studied  one  such  robust 
estimator  [13],  Siegel’s  repeated  median  (RM)  line  es¬ 
timator,  whicb  achieves  the  highest  possible  breakdown 
point  of  50%,  and  have  developed:  (1)  a  simple  practi¬ 
cal  randomized  RM  algorithm  that  runs  in  0(n  log^  n) 
time  with  high  probability,  and  (2)  a  slightly  more  com¬ 
plex  randomized  RM  algorithm  which  performs  as  well 
asymptotically,  and  which  is  shown  by  empirical  evi¬ 
dence  to  perform  in  time  0(n  log  n)  on  many  realistic 
input  distributions. 

Existing  algorithms  for  affine  equivariant  regression 
estimators  with  high  breakdown  point  are  computation¬ 
ally  intensive.  Heuristically,  this  appears  to  be  due 
to  combinatorial  and  geometric  reasons.  Consequently, 
non-affine  estimators  may  allow  faster  computation.  We 
have  developed  [43]  an  RM  algorithm  which  runs  in 
O(nlog^n)  time,  a  substantial  improvement  over  the 
naive  O(n^)  method.  The  new  algorithm  allows  an  em¬ 
pirical  study  of  this  estimator  for  n  up  to  40,000.  It  turns 
out  that  the  finite-sample  efficiencies  converge  extremely 
slowly  although  the  estimator  is  asymptotically  normal. 

More  generally,  given  a  set  of  n  distinct  points  in 
d-dimensional  space  that  are  hypothesized  to  lie  on  a 
hyperplane,  robust  statistical  estimators  have  been  re¬ 
cently  proposed  for  the  parameters  of  the  model  that 
best  fits  these  points.  We  have  developed  [48]  efficient 
algorithms  for  computing  median-based  robust  estima¬ 
tors  (e.g.,  the  RM  and  Theil-Sen  estimators)  in  high¬ 
dimensional  space,  by  generalizing  the  computational 
geometry  techniques  that  were  used  to  achieve  efficient 
algorithms  in  the  2-D  case.  This  yields  0(n‘*“ '  log  n)  ex¬ 
pected  time  algorithms  for  the  d-dimensional  Theil-Sen 
and  RM  estimators.  Both  algorithms  are  space  optimal, 
i.e.,  they  require  0(n)  storage,  for  fixed  d.  An  extension 
of  the  methodology  to  nonlinear  domain(s)  such  as  circle 
fitting  has  also  been  demonstrated. 


4.1.4  Reliability  of  geometric  computations 

122] 

The  reliability  of  3-D  interpretations  computed  from 
images  can  be  analyzed  in  statistical  terms  by  employ¬ 
ing  a  realistic  model  of  image  noise.  First,  the  reliability 
of  edge  fitting  can  be  evaluated  in  terms  of  image  noise 
characteristics.  Then,  the  reliability  of  vanishing  point 
estimation  can  be  deduced  from  the  reliability  of  edge 
fitting.  The  result  can  then  be  applied  to  focal  length 
calibration,  and  an  optimal  scheme  derived  in  such  a  way 
that  the  reliability  of  the  computed  estimate  is  maxi¬ 
mized.  The  confidence  interval  of  the  optimal  estimate 
can  also  be  computed.  We  can  also  evaluate  the  reliabil¬ 
ity  of  fitting  an  orthogonad  frame  to  three  orientations 
obtained  by  sensing.  Finally,  we  can  derive  statistical 
criteria  for  testing  edge  groupings,  varnishing  points,  fo¬ 
cuses  of  expansion,  and  vanishing  lines. 
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4.2  Representation  and  geometry 

4.2.1  Multipolar  representation  [3] 

We  have  developed  a  new  method  of  delineating  and 
representing  lobed  objects,  i.e.,  objects  containing  mul¬ 
tiple  compact  parts,  by  using  a  multi-polar  representa¬ 
tion  (MPR).  The  process  begins  by  constructing  a  single- 
center  polar  representation  somewhere  inside  the  object. 
This  establishes  a  1-D,  cyclic  Markov  Random  Field 
(lElCMRF)  which  optimizes  edge  sharpness  and  con¬ 
tour  smoothness.  The  optimization  is  done  using  simu¬ 
lated  annealing.  The  IDCMRF  is  segmented  into  sectors 
at  significant  minima  of  the  radius;  these  sectors  define 
lobes.  (An  alternative  is  to  segment  at  zero-crossings  of 
the  contour  curvature;  see  [52].)  Next,  each  lobe  is  as¬ 
signed  a  candidate  polar  sector  representation  centered 
at  the  lobe’s  centroid.  This  new  set  of  representations 
is  then  compared  to  the  previous  one  using  a  “radius 
entropy  test”  (RET).  This  test  selects  the  representa¬ 
tion  with  the  highest  degree  of  roundness,  i.e.  the  rep¬ 
resentation  in  which  the  radii  are  most  uniform  in  their 
sizes.  When  a  new  representation  supersedes  an  old  one, 
each  new  polar  center  establishes  a  1-D  Markov  Random 
Field  for  its  sectcw  and  boundary  conditions  are  deter¬ 
mined  between  neighboring  MRFs.  This  process  contin¬ 
ues  recursively  within  each  of  these  MRFs.  To  deal  with 
deep  lobes  and  concavities  we  define  two  special  classes 
of  MRFs.  Tunnel  MRFs  are  used  to  explore  long  narrow 
lobes;  these  MRFs  extend  the  original  lobe  center  by  di¬ 
viding  the  lobe’s  radii  into  fore  and  aft  groups.  External 
MRFs,  whose  centers  lie  outside  of  the  object,  are  used 
to  delineate  concavities.  These  are  detected  when  signif¬ 
icant  contour  gaps  appear  between  neighboring  MRFs. 
These  MRFs  must  also  meet  the  RET  criterion  in  their 
creation.  The  method  can  operate  on  raw  image  data 
without  preprocessing;  the  object  representation  is  con¬ 
structed  simultaneously  with  the  delineation  process,  an 
important  advantage  when  using  the  representation  for 
object  recognition  in  real  images,  as  described  in  [52]. 

4.2.2  Multiresolution  curve  representation  [25] 

We  have  developed  a  robust  method  for  describing  pla¬ 
nar  curves  at  multiple  resolution  using  curvature  infor¬ 
mation.  The  method  takes  into  account  the  discrete  na¬ 
ture  of  digital  images  as  well  as  the  discrete  aspect  of 
a  multi-resolution  structure  (pyramid).  The  robustness 
of  the  technique  is  due  to  information  that  is  extracted 
by  observing  the  behavior  of  corners  in  the  pyramid. 
The  algorithm  is  conceptually  simple  and  easily  paral- 
lelizable.  We  have  analyzed  the  curvature  of  continuous 
curves  in  scale-space,  and  studied  the  behavior  of  cur¬ 
vature  extrema  under  varying  scale.  These  results  ue 
used  to  eliminate  any  ambiguities  that  might  arise  from 
sampling  problems  due  to  the  discreteness  of  the  repre¬ 
sentation. 

4.2.3  Growth  models  for  shapes  [51] 

We  have  developed  discrete  models  for  growth  of  a 
shape  from  a  point  on  a  two-dimensional  Cartesian  grid. 
By  growth  is  meant  an  accretionary  process  occurring  at 
the  boundary  of  the  shape.  We  have  studied  three  types 
of  growth  models:  deterministic  (periodic),  probabilis¬ 


tic  (stochastic),  and  probabilistic  mixing  of  deterministic 
processes.  We  have  found  that  the  probabilistic  models 
can  produce  smooth  isotropic  or  elongated  shapes,  con¬ 
cavities,  and  protrusions. 

4.2.4  Polygonal  ribbons  [7,  20] 

A  polygonal  ribbon  is  a  finite  sequence  of  polygons 
such  that  each  pair  of  successive  polygons  intersect  ex¬ 
actly  in  a  common  side.  We  have  investigated  various 
geometric  properties  of  such  ribbons  in  two  and  three 
dimensions,  including  properties  such  as  nonselfintersec¬ 
tion,  orientability,  and  twist.  When  the  polygons  are 
all  of  simple  types — for  example,  when  they  are  all  tri¬ 
angles  or  all  rectangles — they  can  be  represented  com¬ 
pactly  in  terms  of  such  quantities  as  vertex  coordinates, 
side  lengths,  and  dihedral  angles.  For  nonselfintersecting 
ribbons,  we  have  established  basic  connectivity  proper¬ 
ties  of  the  ribbon  and  its  border. 

4.2.5  Connectedness  [6] 

In  collaboration  with  Prof.  A.  Nakamura  of  Japan, 
we  have  solved  the  long-outstanding  problem  of  proving 
that  two-dimensional  connected  pictures  over  {0,1}  are 
not  recognizable  by  finite-state  acceptors. 

4.3  Applications 

4.3.1  Aerial  image  understanding 

The  University  of  Maryland  (with  TASC  a  a  subcon¬ 
tractor)  is  one  of  the  group  of  institutions  doing  research 
on  Mriri  image  understanding  in  support  of  the  RADIUS 
program.  The  emphasis  of  our  research  is  on  knowledge 
bas^  change  detection  using  site  models  and  image  ana¬ 
lysts’  (lA)  domain  expertise.  Typically,  an  lA  uses  a  set 
of  object  and  background  models  to  build  a  site  model 
for  an  area  of  interest.  Change  detection  consists  of  clas¬ 
sifying  changes  in  the  imagery  into  changes  due  to  site 
updates,  changes  due  to  activity,  and  irrelevant  changes 
due  to  illumination,  seasonal  variations,  etc. 

Before  change  detection  can  be  attempted,  the  newly 
acquired  images  have  to  be  registered  to  the  site  model. 
Site  models  can  also  be  used  to  mediate  registration  be¬ 
tween  two  severely  off-nadir  images.  We  are  developing 
methods  of  using  site  models  in  image  registration;  this 
is  especially  important  in  oblique  acquisition  situations 
where  3-D  information  is  critical  due  to  foreshortening, 
etc. 

Even  if  the  task  of  image  registration  can  be  accom¬ 
plished,  the  lA’s  expertise  is  crucial  in  identifying  rel¬ 
evant  changes.  This  expertise  is  dependent  on  the  site 
and  the  specific  intelligence  agenda.  We  are  studying 
how  image  understanding  (lU)  techniques  can  aid  the 
lA  in  change  detection.  This  will  be  accomplished  by 
designing  an  interface  that  allows  the  lA  to  specify  what 
are  to  be  considered  as  relevant  changes,  and  to  select 
appropriate  lU  algorithms  for  detecting  these  changes. 

Once  changes  have  been  identified,  updating  of  the  site 
models  may  be  necessary.  We  plan  to  use  non-monotonic 
reasoning  based  techniques  such  as  assumption  based 
truth  maintenance  systems  (ATMS)  and  their  variants  to 
efficiently  perform  the  searches  required  for  image  regis¬ 
tration;  to  formulate  constraints  on  lU  algorithms  (e.g.. 
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for  stereo  or  building  detection);  and  to  provide  interac¬ 
tive  user  guidance  in  change  detection. 

A  significant  part  of  our  research  effort  will  be  the 
employment  of  usability  analysis  techniques  and  guid¬ 
ance  from  experienced  lAs  to  ensure  the  utility  of  the 
algorithms  that  we  develop.  A  more  detailed  paper  re¬ 
porting  our  progress  to  date  can  be  found  elsewhere  in 
these  Proceedings. 

4.3.2  IVees  [11,  39] 

Plants  such  as  trees  can  be  modeled  by  three- 
dimensional  hierarchical  branching  structures.  If  these 
structures  are  sufficiently  sparse,  so  that  self-occlusion  is 
relatively  minor,  we  have  shown  that  their  geometrical 
properties  can  be  recovered  from  a  single  image. 

The  distribution  of  leaves  in  a  tree  crown  can  be  mod¬ 
eled  by  a  random  geometric  process.  Statistical  proper¬ 
ties  of  such  distributions  can  then  be  derived,  including 
the  probability  of  seeing  through  the  leaves,  and  the  dis¬ 
tribution  of  leaf  gray  levels  under  various  illumination 
models. 

4.3.3  Faces  [15,  42] 

Faces  represent  one  of  the  most  common  visual  pat¬ 
terns  in  our  environment,  and  humans  have  a  remarkable 
ability  to  recognize  them.  Face  recognition  does  not  fit 
into  the  traditional  approaches  of  model  based  recogni¬ 
tion  in  vision.  Like  most  other  natural  objects,  a  geo¬ 
metrical  interpretation  of  faces  is  difficult  to  achieve.  We 
have  developed  a  feature  based  approach  to  face  recog¬ 
nition,  where  the  features  are  derived  from  the  inten¬ 
sity  data  without  assuming  any  knowledge  of  the  face 
structure.  The  feature  extraction  model  is  biologically 
motivated,  and  the  locations  of  the  features  often  corre¬ 
spond  to  salient  facial  features  such  as  the  eyes,  nose,  etc. 
Topological  graphs  are  used  to  represent  relations  be¬ 
tween  features,  and  a  simple  deterministic  graph  match¬ 
ing  scheme  which  exploits  the  basic  structure  is  used  to 
recognize  familiar  faces  from  a  database.  Each  of  the 
stages  in  the  system  can  be  fully  implemented  in  paral¬ 
lel  to  achieve  real  time  recognition. 

We  have  also  developed  sm  approach  to  labeling  the 
components  of  faces  from  range  images.  The  compo¬ 
nents  of  interest  are  those  which  humans  usually  find 
significant  for  recognition.  To  cope  with  the  non-rigidity 
of  faces,  an  entirely  qualitative  approach  is  used.  A  pre¬ 
processing  stage  employs  a  multi-stage  diffusion  process 
to  identify  convexity  and  concavity  points.  These  points 
are  grouped  into  components  and  qualitative  reasoning 
about  possible  interpretations  of  the  components  is  per¬ 
formed.  Consistency  of  hypothesized  interpretations  is 
verified  using  context-based  reasoning. 

4.3.4  Handwriting  [8,  21] 

The  primary  intention  of  the  handwriting  process  is 
to  produce  a  series  of  perceptually  meaningful  strokes 
which  collectively  relay  a  message  to  the  reader.  Un¬ 
fortunately,  the  process  may  be  quite  complex,  so  noise 
is  easily  introduced  and  correct  interpretation  may  not 
be  possible  in  the  absence  of  contextual  knowledge  (lin¬ 
guistic  or  graphic)  about  the  domain.  We  believe  that 


a  handwritten  document  should  be  analyzed  within  the 
context  of  the  process  which  created  it. 

The  problem  of  off-line  handwritten  character  recog¬ 
nition  has  eluded  a  satisfactory  solution  for  several 
decades.  Researchers  working  in  the  area  of  on-line 
recognition  have  had  greater  success,  but  the  possibil¬ 
ity  of  extracting  on-line  information  from  static  images 
has  not  been  fully  explored.  The  experience  of  forensic 
document  examiners  assures  us  that  in  many  cases,  such 
information  can  be  successfully  recovered. 

We  have  designed  a  system  for  the  recovery  of  tempo¬ 
ral  information  from  static  handwritten  images,  based 
on  local,  regional  and  global  temporal  clues  which  are 
often  found  in  hand-written  samples,  and  have  shown 
how  these  clues  can  be  recovered  from  an  image.  Our 
approach  attempts  to  understand  the  handwriting  sig¬ 
nal  and  to  perform  a  detailed  analysis  of  stroke  and  sub¬ 
stroke  properties.  We  believe  that  the  recovery  task  re¬ 
quires  that  we  break  away  from  traditional  thresholding 
and  thinning  techniques,  and  we  have  developed  a  frame¬ 
work  for  such  analysis.  We  have  shown  how  temporal 
clues  can  reliably  be  extracted  from  this  framework  and 
have  developed  a  control  structure  for  integrating  the 
partial  information.  Many  seemingly  ambiguous  situa¬ 
tions  can  be  resolved  by  the  derived  clues  and  by  knowl¬ 
edge  about  the  writing  process. 

If  we  view  handwriting  as  a  parameterized  process, 
then  problems  such  as  signature  verification  can  be  posed 
as  the  recovery  of  specific  parameters.  To  demonstrate 
this  approach,  we  have  studied  the  mechanical  aspects 
of  instrument  grasp  and  have  qualitatively  demonstrated 
the  recovery  of  parameters  which  have  stable  and  mean¬ 
ingful  effects  on  the  static  image.  Our  model  for  the 
grasping  of  a  writing  instrument  makes  explicit  the  forces 
exerted  in  the  hand/instrument/paper  system  while  the 
instrument  is  in  motion.  We  have  used  this  model  as 
a  basis  for  analyzing  the  pressure  exerted  on  the  writ¬ 
ing  surface  for  strokes  of  different  orientations.  We  have 
shown  that  relative  pressure  information  is  preserved  in 
static  handwriting  images.  This  information  is  a  valu¬ 
able  heuristic  in  recovering  the  direction  of  motion  of  the 
instrument  which  created  the  stroke,  and  can  be  applied 
to  the  development  of  on-line  recognition  techniques  that 
can  be  used  on  off-line  handwritten  data. 

4.4  Parallel  processing 

Single  instruction  stream,  single  data  stream  (SIMD) 
processor  array  machines  are  popular  in  practical  par¬ 
allel  computing.  Such  machines  differ  from  one  another 
considerably  in  the  level  of  autonomy  provided  to  each 
processing  element  (PE)  of  the  array.  An  understanding 
of  the  levels  of  autonomy  provided  by  the  architectures  is 
important  in  the  design  of  efficient  algorithms  for  them. 
We  can  classify  SIMD  architectures  into  six  categories 
differing  in  key  aspects  such  as  the  selection  of  the  in¬ 
structions  to  be  executed,  operands  for  the  instructions, 
and  the  source/destination  of  communications. 

The  data  parallel  model  of  computation  used  in  pro¬ 
cessor  arrays  exploits  the  parallelism  in  the  data  by  pro¬ 
cessing  multiple  data  elements  (pixels,  in  image  analy¬ 
sis)  simultaneously  by  assigning  one  PE  to  each  data  ele- 
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ment.  This  scheme  does  not  make  efficient  use  of  the  pro¬ 
cessor  array  when  processing  relatively  small  data  struc¬ 
tures.  We  have  developed  [37]  a  technique  of  data  repli¬ 
cation  that  combines  operation  parallelism  with  data 
parallelism,  to  process  sm^l  data  structures  efficiently 
on  large  processor  arrays.  It  decomposes  the  main  op¬ 
eration  into  suboperations  that  are  performed  simulta¬ 
neously  on  separate  copies  of  the  data  structure.  The 
autonomy  of  the  individual  PEs  is  critical  to  this  de¬ 
composition.  We  have  developed  replicated  data  algo¬ 
rithms  for  several  low  level  image  operations  such  as  his- 
togramming,  convolution,  and  rank  order  filtering  [2]. 
Additionally,  we  have  developed  a  way  of  constructing 
a  replicated  data  algorithm  for  an  operation  automat¬ 
ically  from  an  image  algebra  expression  for  it,  demon¬ 
strating  its  generality.  We  have  also  devised  a  replicated 
data  algorithm  to  compute  single  source  shortest  paths 
on  general  graphs,  demonstrating  its  applicability  be¬ 
yond  image  analysis.  We  have  analyst  the  speedup 
performance  of  the  algorithms  on  various  interconnec¬ 
tion  networks  to  determine  the  conditions  under  which 
the  technique  results  in  a  speedup.  Implementations  of 
the  algorithms  on  a  Connection  Machine  CM-2  and  a 
MasPar  MP-1  have  yielded  impressive  speedups. 

We  have  also  developed  [37]  a  parallel  search  scheme 
for  the  model-based  interpretation  of  aerial  images  un¬ 
der  a  focus-of-attention  paradigm  and  have  implemented 
it  on  a  CM-2.  Candidate  objects  are  generated  as  con¬ 
nected  combinations  of  the  connected  components  of  the 
ir.iage  and  are  matched  against  the  model  by  checking 
if  the  parameters  computed  from  the  region  satisfy  the 
model  constraints.  This  process  is  posed  as  a  search  in 
the  space  of  combinations  of  connected  components  with 
the  finding  of  an  (optimally)  successful  region  as  the  goal. 
Our  implementation  exploits  parallelism  at  multiple  lev¬ 
els  by  parallelising  control  tasks  such  as  the  management 
of  the  open  list.  The  level  of  processor  autonomy  and 
other  details  of  the  architecture  play  important  roles  in 
the  search  scheme. 

We  have  defined  [9]  a  class  of  routing  operations  which 
can  be  performed  in  n  unit-routes  on  a  hypercube  with 
2"  nodes.  Specifically,  we  have  shown  that  the  con¬ 
junction  of  two  conditions,  called  wrap-contiguity  and 
wrap-mononicity,  is  sufficient  to  allow  efficient  routabil- 
ity.  This  result  subsumes  some  earlier  results  on  hyper¬ 
cube  routing. 

The  use  of  dynamic  programming  tor  stereo  matching 
has  been  studied  extensively.  It  has  been  pointed  out 
that  this  approach  is  suitable  for  parallel  processing,  but 
there  have  been  no  attempts  to  implement  a  dynamic 
programming  stereo  matching  algorithm  on  a  parallel 
machine.  We  have  developed  [2^  a  massively  parallel 
implementation  of  Baker’s  dynamic  programming  stereo 
algorithm;  our  implementation  can  use  many  processors 
per  scanline,  compared  to  a  naive  approach  of  one  pro¬ 
cessor  per  scanline.  This  is  important  because  typical 
images  contain  2Sfi  to  1024  scanlines,  while  massively 
parallel  machines  can  have  many  more  processors.  We 
have  also  introduced  a  method  of  handling  inter-scanline 
inconsistencies  that  is  very  well  suited  for  parallel  imple¬ 
mentation.  The  method  increases  the  amount  of  process¬ 


ing  needed  to  solve  the  stereo  matching  problem  by  only 
a  small  fraction.  On  a  16K  processor  Connection  Mar 
chine  the  entire  algorithm  requires  as  little  as  1  second 
for  simple  512  x  512  images. 

4.5  Miscellaneous 

4.5.1  Enhancement  [35]  and  morphing  [50] 

We  are  studying  the  use  of  a  multi-stage  physical  dif¬ 
fusion  process  in  early  vision  processing  of  range  images. 
The  input  range  data  is  interpreted  as  occupying  a  vol¬ 
ume  in  3-D  apace.  Each  diffusion  stage  simulates  the 
process  of  diffusing  the  boundary  of  the  volume  into  the 
volume.  The  result  appears  to  be  useful  for  both  discon¬ 
tinuity  detection  and  segmentation  into  shape  coherent 
regions. 

Image  interpolation  and  metamorphosis  can  also  be 
performed  by  using  a  scale  space  created  by  diffusing 
the  difference  function  of  ..je  source  and  the  goal  im¬ 
ages.  This  formulation  allows  us  to  minimize  the  need 
for  human  intervention  in  the  selection  of  features  in  a 
process  such  as  image  metamorphosis.  The  smooth  tran¬ 
sitions  are  accompanied  by  a  moderated  blurring  that  is 
useful  in  displaying  the  metamorphosis  process.  The  ap¬ 
proach  can  also  be  applied  to  motion  image  sequences  as 
a  method  of  enhancing  animation. 

4.5.2  Matching  [28] 

In  its  original  form  the  pmint  pattern-matching  relax¬ 
ation  scheme  of  Ranade  and  Rosenfeld  did  not  easily 
permit  the  representation  of  uncertainty,  and  it  did  not 
exhibit  the  desirable  property  that  confidence  in  consis¬ 
tent  pairings  of  features  should  increase  from  one  itera¬ 
tion  to  the  next.  Because  the  process  of  pooling  intrinsic 
support  with  contextual  support  is  essentially  a  process 
of  evidence  combination,  it  was  suggested  by  Faugeras 
over  ten  years  ago  that  the  evidence  theory  of  Dempster 
and  Shafer  might  be  an  appropriate  framework  for  relax¬ 
ation  labeling.  We  have  implemented  such  an  approach 
and  applied  it  to  the  domain  of  object  recognition  in 
simulate  SAR  imagery. 

4.5.3  Bibliographies  [14,  49] 

Bibliographies  containing  a  total  of  over  3000  refer¬ 
ences  on  computer  vision,  image  analysis,  and  related 
topics  have  been  prepared  for  the  years  1991  and  1992. 

References 

[1]  A.  Rangarajan  and  R.  Chellappa,  “A  Continua¬ 
tion  Method  for  Image  Estimation  and  Segmenta¬ 
tion”,  CAR-TR-586,  F49620-88-C-0067  and  MIP- 
84-51010,  October  1991. 

[2]  P.J.  Narayanan  and  L.S.  Davis,  “Rank  Order  Filter¬ 
ing  on  Processor  Array  Machines”,  CAR-TR-587, 
DACA76-88-C-0008,  October  1991. 

[3]  N.S.  FViedland  and  A.  Rosenfeld,  “Lobed  Object 
Delineation  Using  a  Multipolar  Representation”, 
CAR-TR-590,  AFOSR-91-0239,  October  1991. 

[4]  Y.  Yacoob  and  L.S.  Davis,  “Computational  Ground 
and  Airborne  Localization  Over  Rough  Ter- 


12 


rain”,  CAR-TR-591,  DACA76-89-C-0019,  Novem¬ 
ber  1991. 

[5]  R.  Sharma,  “A  Probabilistic  Framework  for  Dy- 
naonic  Motion  Planning  in  Partially  Known  Envi¬ 
ronments”,  CAR-TR-592,  DACA76-89-C-0019  and 
IRI-90-57934,  October  1991. 

[6]  A.  Nakamura  and  A.  Rosenfeld,  “Two-Dimensional 
Connected  Pictures  Are  Not  Recognizable  By 
Finite-State  Acceptors”,  CAR-TR-593,  November 
1991. 

[7]  P.  Bhattacharya  and  A.  Rosenfeld,  “lYiangulated 
Ribbons  in  Two  and  Three  Dimensions” ,  CAR-TR- 
594,  AFOSR-91-0239,  November  1991. 

[8]  D.S.  Doermann  and  A.  Rosenfeld,  “Recovery  of 
Temporal  Information  from  Static  Images  of  Hand¬ 
writing”,  CAR-TR-595,  December  1991. 

[9]  T.  Bestul,  “A  Class  of  Efficiently  Routable  Hyper¬ 
cube  Operations”,  CAR-TR-596,  December  1991. 

[10]  E.  Rivlin,  Y.  Aloimonos,  and  A.  Rosenfeld,  “Pur¬ 
posive  Recognition;  A  Framework”,  CAR-TR-597, 
DACA76-89-C-0019  and  IRI-90-57934,  December 
1991. 

[11]  A.  Waksman  and  A.  Rosenfeld,  “Sparse,  Opaque 
Three-Dimensional  Texture,  1:  Arborescent  Pat¬ 
terns”,  CAR-TR-598,  DACA76-89-C-0019,  Decem¬ 
ber  1991. 

[12]  D.F.  DeMenthon  and  L.S.  Davis,  “Model-Based 
Object  Pose  in  25  Lines  of  Code” ,  CAR-TR-599, 
DACA7&-89-C-0019,  December  1991. 

[13]  D.M.  Mount  and  N.S.  Netanyahu,  “Computation¬ 
ally  Efficient  Algorithms  for  a  Highly  Robust  Line 
Estimator”,  CAR-TR-600,  DACA76-89-C-0019 and 
CCR-89-08901,  December  1991. 

[14]  A.  Rosenfeld,  “Image  Analysis  and  Computer  Vi¬ 
sion:  1991”,  CAR-TR-601,  AFOSR-91-0239,  Jan¬ 
uary  1992. 

[15]  B.S.  Manjunath,  R.  Chellappa,  and  C.  von  der 
Malsburg,  “A  Feature  Based  Approach  to  Face 
Recognition”,  CAR-TR-604,  AFOSR-90-0133,  Jan¬ 
uary  1992. 

[16]  C.  Shekhar  and  R.  Chellappa,  “Visual  Motion  Anal¬ 
ysis  for  Autonomous  Navigation”,  CAR-TR-607, 
N00014-89-J-1598,  March  1992. 

[17]  O.-J.  Kwon  and  R.  Chellappa,  “Segmentation- 
Based  Image  Compression” ,  CAR-TR-608,  MIP-91- 
00655,  March  1992. 

[18]  E.  Rignot  and  R.  Chellappa,  “Segmentation  of  Po- 
larimetric  Synthetic  Aperture  Radar  Data” ,  CAR- 
TR-609,  F49620-92-J-0130,  March  1992. 

[19]  I.  Weiss,  “Local  Projective  and  Affine  Invariants”, 
CAR-TR-612,  DACA76-92-C-0009,  April  1992. 

[20]  P.  Bhattacharya  and  A.  Rosenfeld,  “Rectangular 
Ribbons  in  Two  and  Three  Dimensions” ,  CAR-TR- 
613,  AFOSR-91-0239,  April  1992. 


[21]  D.S.  Doermann  and  V.  Varma,  “Instrument  Grasp; 
A  Model  and  its  Effects  on  Handwritten  Strokes”, 
CAR-TR-614,  April  1992. 

[22]  K.  Kanatani,  “Statistical  Reliability  of  3-D  Inter¬ 
pretation  from  Images”,  CAR-TR-615,  DACA76- 
92-C-0009,  April  1992. 

[23]  K.  Kanatani,  “Computation  of  3-D  Motion  from 
Images”,  CAR-TR-617,  DACA76-92-C-0009,  April 
1992. 

[24]  C.  Fermiiller  and  Y.  Aloimonos,  “Tracking  Facili¬ 
tates  3-D  Motion  Estimation”,  CAR-TR-618,  IRI- 
90-57934,  April  1992. 

[25]  C.  Fermuller  and  W.  Kropatsch,  “Multi-Resolution 
Shape  Description  by  Corners”,  CAR-TR-619,  IRI- 
90-57934,  April  1992. 

[26]  L.T.  Chen  and  L.S.  Davis,  “Parallel  Stereo  Match¬ 
ing  Using  Dynamic  Programming”,  CAR-TR-620, 
DACA76-92-C-0009,  April  1992. 

[27]  L.  Huang  and  Y.  Aloimonos,  “The  Geometry  of 
Visual  Interception”,  CAR-TR-622,  DACA76-92-C- 
0009  and  IRI-90-57934,  April  1992. 

[28]  P.  Cucka  and  A.  Rosenfeld,  “Evidence-Based 
Pattern-Matching  Relaxation”,  CAR-TR-623, 
AFOSR-91-0239,  May  1992. 

[29]  Y.A.  Teng  and  L.S.  Davis,  “Visibility  Analysis  on 
Digital  Terrain  Models  and  Its  Parallel  Implemen¬ 
tation”,  CAR-TR-625,  DACA76-92-C-0009,  May 
1992. 

[30]  N.S.  Netanyahu,  “Computationally  Efficient  Al¬ 
gorithms  for  Robust  Estimation”,  CS-TR-1898, 
DACA76-89-C-0019  and  AFOSR-91-0239,  May 
1992. 

[31]  Q.  Zheng  and  R.  Chellappa,  “Automatic  Fea¬ 
ture  Point  Extraction  and  IVacking  in  Image  Se¬ 
quences  for  Arbitrary  Camera  Motion” ,  CAR-TR- 
628,  DACA76-92-C-0009,  June  1992. 

[32]  C.  Fermuller  and  Y.  Aloimonos,  “Qualitative  Ego- 
motion”,  CAR-TR-629,  DACA76-92-C-0009  and 
IRI-90-57934,  June  1992. 

[33]  E.  Rivlin  and  R.  Basri,  “Localization  and  Position¬ 
ing  Using  Combinations  of  Model  Views”,  CAR- 
TR-631,  DACA76-92-C-0009,  July  1992. 

[34]  I.  Weiss,  “Geometric  Invariants  and  Object 
Recognition”,  CAR-TR-632,  N00014-91-J-1222  and 
DACA76-92-C-0009,  August  1992. 

[35]  Y.  Yacoob  and  L.S.  Davis,  “Early  Vision  Processing 
Using  a  Multi-Stage  Diffusion  Process” ,  CAR-TR- 
633,  DACA76-92-C-0009,  August  1992. 

[36]  Z.  Duric,  A.  Rosenfeld,  and  L.S.  Davis,  “Ego- 
motion  Analysis  Based  on  the  Frenet-Serret  Mo¬ 
tion  Model”,  CAR-TR-634,  AFOSR-91-0239,  Au¬ 
gust  1992. 

f.17)  P  J  Narayanan,  “Effective  Use  of  SIMD  Machines 
for  Image  Analysis”,  CAR-TR-635,  DACA76-92-C- 
0009,  August  1992. 


13 


[38]  M.  Abdel-Mottaleb,  R.  Chellappa,  and  A.  Roaen- 
feld,  “Passive  Navigation  Using  Bayesian  Esti¬ 
mation”,  CAR-TR-636,  AFOSR-91-0239,  August 
1992. 

[39]  A.  Waksman  and  A.  Rosenfeld,  “Sparse,  Opaque 
Three-Dimensional  Texture,  2:  Foliate  Pat¬ 
terns”,  CAR-TR-637,  DACA76-92-C-0009,  Septem¬ 
ber  1992. 

[40]  E.  Rignot,  P.  Dubois,  and  R.  CheUsq>pa,  “Unsuper¬ 
vised  Segmentation  of  Polarimetric  SAR  Data  Using 
the  Covariance  Matrix” ,  CAR-TR-638,  F49620-92- 
J0130,  September  1992. 

[41]  T.-H.  Wu  and  R.  Chellappa,  “Experiments  on  Es¬ 
timating  Motion  and  Structure  Parameters  Using 
Long  Monocular  Image  Sequences” ,  CAR-TRr640, 
DACA76-92-C-0009,  October  1992. 

[42]  Y.  Yacoob  and  L.S.  Davis,  “Qualitative  Labeling  of 
Human  Face  Components  from  Range  Data” ,  CAR- 
TR-642,  DACA76-92-C-0009,  October  1992. 

[43]  P.J.  Rousseeuw,  N.S.  Netanyahu,  and  D.M.  Mount, 
“New  Statistical  and  Computational  Results  on  the 
Repeated  Median  Line” ,  CAR-TR-643,  AFOSR-91- 
0239,  October  1992. 

[44]  E.  Rivlin  and  I.  Weiss,  “Local  Invariants  for 
Recognition”,  CAR-TR.644,  DACA76-92-C-0009 
and  F49620-92-J-0332,  October  1992. 

[45]  T.-H.  Wu  and  R.  Chellappa,  “Stereoscopic  Recovery 
of  Egomotion  and  Environmental  Structure:  Mod¬ 
els,  Uniqueness  and  Experiments”,  CAR-TR-646, 
DACA76'92-C-0009,  November  1992. 

[46]  C.  Fermuller  and  Y.  Aloimonos,  “The  Role  of  Fix¬ 
ation  in  Visual  Motion  Analysis”,  CAR-TR-647, 
DACA76-92-C-0009  and  IRI-90-57934,  November 
1992. 

[47]  M.  Abdel-Mottaleb,  R.  Chellappa,  and  A.  Rosen- 
feld,  “Binocular  Motion  Stereo  using  MAP  Estima¬ 
tion”,  CAR-TRr650,  F49620-93- 1-0039,  November 

1992. 

[48]  D.M.  Mount  and  N.S.  Netanyahu,  “Computation¬ 
ally  Efficient  Algorithms  for  High-Dimensional  Ro¬ 
bust  Estimators”,  CAR-TR-652,  CCR-89-08901 
and  AFOSR-91-0239,  December  1992. 

[49]  A.  Rosenfeld,  “Image  Analysis  and  Computer  Vi¬ 
sion:  1992”,  CAR-TR-653,  F49620-93-1-0039,  Jan¬ 
uary  1993. 

[50]  Y.  Yacoob,  L.S.  Davis,  and  H.  Samet,  “Scale 
Space  Interpolation  and  Metamorphosis”,  CAR- 
TR-654,  DACA76-92-C-0009  and  IRI-90-17393, 
January  1993. 

[51]  S.  Thompson  and  A.  Rosenfeld,  “Discrete  Stochas¬ 
tic  Growth  Models  for  Two-Dimensional  Shapes” , 
CAR-TR-656,  F49620-93-1-0039,  January  1993. 

[52]  N.S.  F^iedland,  “Utilizing  Energy  Function  and  De¬ 
scription  Length  Minimization  for  Integrated  De¬ 
lineation,  Representation,  and  Classification  of  Ob¬ 
jects”,  CAIt-TRr657,  F49620-93- 1-0039,  January 

1993. 


14 


Image  Understanding  Research  at  SRI  International 

Martin  A.  Fischler  and  Robert  C.  Bolles 

Artificial  Intelligence  Center 
SRI  International 

333  Ravens  wood  Ave.,  Menlo  Park,  CA  94025 
(fischler@ai.sri.com  bolles@ai.sri.com) 


Abstract 

The  image  understanding  program  at  SRI  Interna¬ 
tional  is  a  broad  effort  spanning  the  entire  range  of  ma¬ 
chine  vision  research.  In  this  report  we  describe  our 
progress  in  two  major  scientific  areas:  (1)  recovery  of 
scene  geometry,  and  (2)  semantic  labeling  and  scene 
modeling.  We  have  addressed  the  problems  of  auto¬ 
mated  and  interactive  scene  analysis  in  two  application 
domains:  the  first  is  concerned  with  modeling  the  earth’s 
surface  from  aerial  imaging  sensors;  the  second  is  con¬ 
cerned  with  ground-level  vision  and  vision-based  land 
navigation. 

In  particular,  we  describe  progress  in  automated  and 
interactive  scene  modeling  and  visualization;  in  auto¬ 
matic  image  segmentation  and  delineation  of  both  nat¬ 
ural  and  man-made  objects;  in  detecting  and  tracking 
moving  objects;  and  in  using  knowledge  beyond  shape 
and  immediate  appearance  to  recognize  objects  in  natu¬ 
ral  scenes  and  other  complex  domains. 

1  Introduction 

The  overall  goal  of  Image  Understanding  research  at  SRI 
International  is  to  obtain  solutions  to  fundamental  prob¬ 
lems  in  computer  vision  that  are  essential  in  allowing  ma¬ 
chines  to  model,  manipulate,  and  understand  their  en¬ 
vironment  from  sensor-acquired  data  and  stored  knowl¬ 
edge. 

In  this  report  we  describe  progress  on  the  fundamen¬ 
tal  problems  of  geometric  recovery  and  semantic  label¬ 
ing  in  two  application  domains:  aerial  and  ground-based 
vision.*  The  first  application  domain  is  concerned  with 
modeling  the  earth’s  surface  from  photographs  taken 
from  aircraft  and  satellites;  the  second  application  do¬ 
main  is  concerned  with  modeling  a  natural  environment 
in  real  time  from  data  acquired  by  a  robotic  device  mov¬ 
ing  through,  and  interacting  with,  this  environment. 

In  both  application  domains,  the  underlying  prob¬ 
lems  in  geometric  modeling  involve  making  our  recovery 

‘Supported  by  various  Defense  Advanced  Research  Projects 
Agency  contracts. 


techniques  robust  enough  to  operate  in  complex  scene 
domains  where  conventional  correlation  based  stereo  is 
known  to  be  inadequate  (e.g.,  in  urban  scenes  where  oc¬ 
clusions  prevent  direct  matching,  and  in  ground-level 
views  of  natural  terrain  where  there  are  radical  depth 
changes  over  very  small  angular  displacements  and  thus 
the  continuity  assumption  is  often  violated).  We  de¬ 
scribe  progress  in  using  multiple  images  to  obtain  an  in¬ 
tegrated  geometric  and  physical  description  of  the  scene 
in  terms  of  surfaces  and  their  reflectance  properties 
rather  than  just  isolated  points.  We  are  also  involved 
in  an  effort  to  assess  the  capabilities  of  existing  stereo 
approaches  and  techniques  to  support  land-based  navi¬ 
gation,  in  developing  techniques  for  building  object  de¬ 
scriptions  that  evolve  gradually  over  time  as  more  data 
are  obtained,  and  in  devising  motion  analysis  techniques 
for  detecting  and  tracking  moving  objects  in  data  taken 
by  moving  sensors. 

The  problems  of  semanftc  labeling  (e.g.,  object  recog¬ 
nition)  are  known  to  be  extremely  difficult,  and  we  have 
taken  two  different  approaches  in  our  work.  Where  it 
is  feasible  for  a  human  to  be  part  of  the  process,  as  is 
typically  true  in  such  tasks  as  building  terrain  and  “site” 
models  from  aerial  imagery,  the  problem  becomes  one  of 
designing  an  effective  interactive  environment  with  the 
proper  man-machine  interface,  feature  extraction  proce¬ 
dures,  and  visualization  capabilities.  In  the  robotic  do¬ 
main,  where  human  intervention  is  limited,  we  have  de¬ 
vised  a  context-based  methodology  for  recognizing  com¬ 
plex  man-made  and  natural  objects,  and  have  developed 
and  refined  a  set  of  procedures  for  delineating  specified 
natural  and  man-made  objects  that  are  “line-like”  in  ap¬ 
pearance  (e.g.,  trees  and  roads). 

An  important  theme  in  much  of  our  current  work  is  an 
emphasis  on  robustness  and  computational  performance 
—  especially  through  the  development  of  algorithms  ca¬ 
pable  of  exploiting  the  parallel  machine  architectures 
now  available  (e.g.,  the  Connection  Machine"").^ 


^Use  of  a  Connection  Machine  waa  provided  by  DARPA. 


15 


2  Geometric  Recovery 

The  goal  of  geometric  recovery  is  to  build  a  three- 
dimensional  structural  description  of  a  scene  to  support 
such  tasks  as  robot  navigation  and  cartographic  mod¬ 
eling.  Ideally  the  description  would  consist  of  several 
interconnected  representations,  including  a  detailed  rep¬ 
resentation  of  the  support  surface  (i.e.,  the  ground),  a  list 
of  material  types  and  semantic  labels  for  all  scene  “ob¬ 
jects,”  and  a  set  of  accurate  transformations  from  the 
local  vehicle  coordinate  system  to  the  global  reference 
system.  For  many  tasks,  especially  robot  navigation,  the 
process  of  building  a  scene  model  should  be  viewed  as  an 
on-going  process  in  which  a  continuous  stream  of  data 
is  used  for  incrementally  updating  the  representations. 
In  practice,  however,  current  scene  modeling  techniques 
typically  analyze  each  snapshot  of  a  scene  independently 
and  produce  a  loose  patchwork  of  representations,  in¬ 
cluding  such  things  as  ground  surface  patches,  clouds  of 
x-y-z  points  Msociated  with  objects,  and  a  set  of  impre¬ 
cise  transformations  from  the  local  coordinate  system  to 
global  system. 

Our  research  goals  in  this  area  are  to  develop  compact 
and  expressive  representations  for  modeling  natural  ob¬ 
jects,  such  as  rocks  and  bushes,  and  to  develop  effective 
techniques  for  incrementally  compiling  a  complete  scene 
model  from  multiple  views.  In  this  section  we  briefly 
describe  our  recent  progress  on  the  following  topics  that 
are  steps  toward  our  broader  goals: 

•  Representation  of  objects  as  three-dimensional  sur¬ 
faces,  instead  of  viewpoint-dependent  two-and-a- 
half-dimensional  surfaces. 

•  Extraction  of  both  depth  and  reflectance  maps  by 
integrating  stereo  and  photometric  analysis. 

•  Use  of  temporal  analysis  to  detect,  track,  and  iden¬ 
tify  independently  moving  objects  in  a  scene. 

•  Evaluation  of  current  stereo  techniques,  including 
characterization  of  their  strengths  and  weaknesses. 

2.1  Three-Dimensional  Object  Models 

In  the  past  we  have  developed  a  n»jmber  of  three- 
dimensional  object  representations,  including  fractal- 
based  descriptions  [Pentland],  contextual  representa¬ 
tions  [Strat  &  Fischler],  and  a  “representation  space” 
approach  [Bobick  &  Bolles].  Recently  we  have  devel¬ 
oped  a  triangulated  mesh  model  that  supports  both  ob¬ 
ject  segmentation  and  surface  refinement  techniques.  In 
the  previous  Image  Understanding  Workshop  Proceed¬ 
ings  we  described  a  technique  for  coalescing  clouds  of 
three-dimensional  points  into  a  small  number  of  repre¬ 
sentative  surfaces  [Fua  and  Sander].  In  the  past  year  we 
have  concentrated  on  a  specialization  of  the  triangula¬ 
tion  representation  that  we  call  hexagonally-connected 


triangular  meshes.  These  meshes  have  the  advantage 
that  they  can  be  easily  deformed  to  refine  their  local 
shape  so  that  they  satisfy  both  photometric  and  depth 
constraints.  The  use  of  these  meshes  as  part  of  a  tech¬ 
nique  for  integrating  stereo  processing  and  photometric 
analysis  is  presented  in  a  separate  paper  in  these  pro¬ 
ceedings  [Fua&  Leclerc]. 

One  advantage  of  a  triangular  mesh  representation  is 
that  many  computers  now  incorporate  special  hardware 
to  support  and  perform  graphics  operations  on  such  rep¬ 
resentations.  This  same  hardware  can  be  used,  with 
appropriate  analysis  routines,  to  predict  such  things  as 
scene  depth  values,  surface  orientations,  and  observed  in¬ 
tensities.  We  have  implemented  some  of  our  techniques 
on  Silicon  Graphics  computers  that  support  these  oper¬ 
ations. 

Since  these  mesh  representations  are  three-dimen¬ 
sional,  they  can  directly  encode  all  aspects  of  an  ob¬ 
ject’s  appearance  in  a  single  structure.  This  structure, 
in  conjunction  with  rendering  techniques,  provides  a  con¬ 
venient  way  to  work  with  complex,  convoluted  objects. 

In  most  of  our  experiments,  we  have  used  regular 
meshes.  While  this  is  appropriate  for  surfaces  whose 
properties  remain  relatively  constant,  it  is  not  optimal 
for  complex  surfaces  that  require  the  combined  efficiency 
and  accuracy  provided  by  irregular  networks.  The  rela¬ 
tively  smooth  parts  of  such  surfaces  can  be  represented 
by  large  patches,  while  the  rougher  parts  could  be  de¬ 
scribed  by  finer,  more  precise  triangulations.  We  are  in 
the  process  of  implementing  irregular  networks  formed 
by  allowing  selected  facets  to  be  subdivided. 

2.2  Integration  of  Stereo  and  Photomet¬ 
ric  Analysis 

Over  the  past  few  years  we  have  investigated  techniques 
for  integrating  stereo  and  photometric  analysis  because 
these  two  techniques  are  complementary;  one  works  well 
when  the  imagery  contains  distinctive  photometric  pat¬ 
terns  and  the  other  works  well  when  the  imagery  only 
contains  gradual  shading.  In  1991  we  reported  the  re¬ 
sults  of  our  first  techni(|ue  of  this  type  [Leclerc  &  Bo¬ 
bick].  Recently  we  have  developed  a  new  approach  that 
functions  well  even  though  we  have  relaxed  several  as¬ 
sumptions  commonly  used  in  shape-from-shading  tech- 
ni(|ues.  This  new  technicpie  computes  both  the  shape 
and  reflectance  properties  of  physical  surfaces  from  the 
information  present  in  multiple  images.  It  considers  two 
classes  of  information.  The  first  is  the  information  that 
can  be  extracted  from  a  single  image,  such  as  texture 
gradients,  shading,  and  occlusion  edges.  The  technique 
takes  advantage  of  the  fact  that  multiple  images  enhance 
the  utility  of  this  type  of  information  by  allowing  both 
consistency  checks  to  filter  out  mistakes  and  averaging 
to  improve  precision.  The  second  class  of  information 
inchuies  the  stereo  depth  values  computed  from  two  or 


16 


more  images. 

Out  surface  reconstruction  method  uses  an  object- 
centered  representation,  specifically,  a  hexagonally- 
connected  three-dimensional  mesh  of  vertices  with  tri¬ 
angular  facets.  Such  a  representation  accommodates  the 
two  classes  of  information  mentioned  above,  as  well  as 
multiple  images  (including  motion  sequences  of  a  rigid 
object)  and  self-occlusions.  We  have  chosen  to  model  the 
surface  material  using  a  Lambertian  reflectance  model 
with  variable  albedo,  though  generalizations  to  specular 
surfaces  are  possible.  Consequently,  the  natural  choice 
for  the  monocular  information  source  is  shading,  while 
intensity  is  the  natural  choice  for  the  image  feature  used 
in  multi-image  correspondence.  Not  only  are  these  the 
natural  choices  when  we  are  able  to  assume  a  Lamber¬ 
tian  reflectance  model,  they  are  complementary:  inten¬ 
sity  correlation  is  most  accurate  wherever  the  input  im¬ 
ages  are  highly  textured,  and  shading  is  most  accurate 
when  the  input  images  have  smooth  intensity  variation. 
Since  we  wish  to  deal  with  surfaces  with  non-uniform 
albedo,  we  have  developed  a  new  approach  that  ana¬ 
lyzes  the  facet-to-facet  geometry  and  albedo  pattern  to 
recover  surface  models. 

We  use  an  optimization  approach  to  reconstruct  the 
surface  shape  and  its  albedo  from  the  input  images.  We 
alter  the  shape  and  reflectance  properties  of  the  surface 
mesh  to  minimize  an  objective  function,  given  an  initial 
surface  estimate  provided  by  other  means,  such  as  a  stan¬ 
dard  stereo  algorithm.  The  objective  function  is  a  lin¬ 
ear  combination  of  an  intensity  correlation  component, 
an  albedo  variation  component,  and  a  surface  smooth¬ 
ness  component.  The  first  two  components  are  a  func¬ 
tion  of  the  intensities  projected  onto  the  triangular  facets 
from  the  input  images  (taking  occlusions  into  account), 
and  are  weighted  according  to  the  amount  of  texture  in 
the  intensities,  for  the  reasons  mentioned  in  the  previ¬ 
ous  paragraph.  The  geometric  smoothness  component 
is  slowly  decreased  during  the  optimization  process  to 
allow  for  an  accurate  final  estimate  of  the  surface  shape 
and  reflectance. 

We  have  implemented  an  algorithm  employing  these 
three  terms  and  have  performed  extensive  experiments 
using  synthetic  images  as  well  as  aerial  and  face  images. 
The  strengths  of  the  approach  include: 

•  The  use  of  the  3-D  surface  mesh  allows  us  to  deal 
with  self-occlusions  and  thus  effectively  merge  infor¬ 
mation  from  several  potentially  very  different  view¬ 
points  to  eliminate  “blind-spots.” 

•  By  combining  stereo  and  shape  from  shading,  and 
weighing  appropriately  the  reliability  of  their  re¬ 
spective  contributions,  we  can  obtain  results  that 
are  better  than  those  produced  by  either  technique 
alone. 

•  Using  the  facets  to  perform  the  stereo  computation 
frees  us  from  the  constant-depth  assumption  that 


standard  correlation-based  stereo  techniques  make. 
It  becomes  possible  to  recover  accurately  the  depth 
of  sharply  sloping  surfaces  (such  as  that  of  a  sharp 
ridge). 

•  The  shape-from-shading  component  does  not  make 
the  constant-albedo  assumption  common  to  most 
shading  algorithms.  Instead,  we  only  make  the 
weaker  and  much  more  general  assumption  that 
albedoes  vary  slowly  across  textureless  areas. 

More  complete  details  of  this  technique  can  be  found  in 
a  separate  paper  in  these  proceedings  [Fua  it  Leclerc]. 

2.3  Moving  Object  Detection,  Tracking, 
and  Recognition 

Our  goal  in  this  research  effort  is  to  develop  auto¬ 
mated  methods  for  producing  three-dimensional  models 
of  scenes  containing  moving  objects.  Our  approach  is 
to  analyze  sequences  of  temporally  coherent  images,  be¬ 
cause  they  provide  the  machine  with  both  “redundant” 
information  and  new  information  about  the  scene.  The 
redundant  information  can  be  used  to  increase  the  pre¬ 
cision  and  reliability  of  computed  models;  the  new  in¬ 
formation  can  be  used  to  extend  models  into  previously 
unseen  areas.  Recently  we  have  developed  two  new  tech¬ 
niques  of  this  type.  One  is  a  real-time  technique  designed 
to  provide  feedback  within  an  “active  vision”  paradigm. 
The  other  integrates  object  recognition  into  the  track¬ 
ing  process  in  order  to  bridge  gaps  in  tracking-continuity 
caused  by  such  things  as  occlusion  and  low-level  process¬ 
ing  mistakes. 

The  first  technique,  which  is  the  product  of  a  joint  ef¬ 
fort  between  Xerox  PARC,  Stanford  University,  and  SRI, 
produces  motion  results  (or  stereo  disparities)  at  10  to  15 
hertz.  It  has  been  implemented  on  two  multi-processor 
configurations,  a  16k-processor  Connection  Machine  and 
a  5-processor  VX/MVX  graphics  accelerator  system 
(200-MIPS).  With  these  systems  we  have  demonstrated 
real-time  control  of  a  five  degree-of-freedom  camera  sys¬ 
tem  tracking  a  person  walking  around  a  room  [Woodfill]. 

The  second  technique  incorporates  object  recognition 
procedures  into  the  tracking  process  in  order  to  improve 
tracking  reliability  and  facilitate  object  identification. 
Our  strategy  has  involved  four  steps.  First,  we  “train” 
the  system  to  recognize  an  object,  such  as  a  truck,  by 
showing  it  to  the  system  from  several  viewpoints.  Sec¬ 
ond,  given  an  image  sequence  of  the  truck  moving  in 
front  of  the  camera  system,  we  apply  our  “weaving-wall” 
tracking  technique  [Baker  it  Garvey]  to  build  a  temporal 
model  of  the  moving  objects  in  the  sequence.  Third,  we 
apply  the  PRS  recognition  system  [Chen  it  Mulgaonkar] 
to  identify  the  truck  in  individual  images.  And  fourth, 
we  use  the  recognition  results  to  “explain”  discontinu¬ 
ities  in  tracking  the  various  objects  so  that  we  can  pro- 
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duce  a  more  coherent  description  of  the  motion  in  the 
scene. 

2.4  Evaluation  of  Current  Stereo  Tech¬ 
niques 

Stereo  analysis,  which  for  a  long  time  had  been  viewed  as 
an  interesting,  but  too-costly-to-be-practical  technique, 
has  emerged  as  a  viable  tool  for  realtime  applications, 
such  as  vehicle  navigation.  This  has  happened  for  two 
reasons.  First,  advances  in  hardware  have  made  it  prac¬ 
tical  to  compute  stereo  matches  “in  real  time.”  And 
second,  advances  in  algorithm  development  have  made 
it  possible  to  correctly  match  large  portions  of  outdoor 
scenes. 

An  important  next  step  in  the  development  and  use  of 
practical  stereo  systems  is  the  characterization  of  their 
capabilities.  Potential  users,  such  as  system  integrators 
and  automatic  task-planners,  need  to  know  the  tech¬ 
niques’  computational  requirements,  their  speeds,  the 
precision  of  their  results,  their  common  mistakes,  etc. 
in  order  to  model  the  behavior  of  these  stereo  systems 
and  reason  about  their  use.  With  this  in  mind,  SRI, 
JPL,  and  Teleos  began  a  multi-phase  evaluation  process 
last  year  within  the  DARPA  Unmanned  Ground  Vehicle 
(UGV)  Project.  The  first  phase  of  that  evaluation  has 
been  completed  and  the  second  phase  has  begun. 

The  overall  plan  for  that  evaluation  was  (and  con¬ 
tinues  to  be)  to  pursue  a  three-pronged  approach,  in¬ 
cluding  analytic  models,  qualitative  “behavioral”  mod¬ 
els,  and  statistical  performance  models.  The  analytic 
models  would  be  used  to  estimate  such  things  as  the  ex¬ 
pected  depth  precision  computable  with  a  specific  cam¬ 
era  configuration.  The  qualitative  models  would  be  used 
to  identify  key  problems  for  future  research,  for  exam¬ 
ple,  detection  of  holes,  analysis  of  shadowed  regions,  and 
depth  measurements  in  bland  areas.  The  statistical  mod¬ 
els  would  be  used  to  produce  quantitative  estimates  of 
such  key  factors  as  the  smallest  obstacle  detectable  at  a 
specified  distance.  SRI  has  taken  the  lead  in  the  quali¬ 
tative  evaluation;  JPL  has  taken  the  lead  in  the  quanti¬ 
tative  analysis. 

For  the  qualitative  analysis  we  decided  to  start  by  ex¬ 
amining  a  small  number  of  techniques  in  order  to  debug 
the  process,  and  then  expand  the  evaluation  to  include 
a  much  larger  set  of  participants.  The  goals  of  the  first 
phase  were  to  get  an  initial  estimate  of  the  effectiveness 
of  current  stereo  techniques  applied  to  UGV  ta.sks,  to 
identify  key  problems  for  future  research,  and  to  debug 
the  evaluation  process. 

One  of  the  high-level  guidelines  we  adopted  was  to  de¬ 
velop  and  maintain  an  atmosphere  of  cooperation  and 
constructive  criticism  among  the  researchers  participat¬ 
ing  in  the  evaluation.  Without  this  we  would  not  be 
able  to  focus  on  our  ultimate  goal  of  producing  a  se¬ 
quence  of  increasingly  capable  stereo  systems.  To  help 


establish  a  cooperative  atmosphere  we  decided  to  con¬ 
centrate  on  the  positive  aspects  of  each  technique  and 
emphasize  potential  extensions,  realizing  that  existing 
techniques  were  developed  for  different  domains  and  dif¬ 
ferent  applications.  We  also  decided  to  share  all  the  raw 
results  with  the  participants  so  they  could  duplicate  our 
analysis  or  develop  their  own. 

For  the  first  phase  of  the  qualitative  evaluation, 
SRI  collected  imagery  from  five  groups,  JPL,  INRIA 
(in  France),  SRI,  CMU,  and  Teleos  (hence  the  name 
“JISCT”  for  the  first  evaluation  phase);  selected  49  im¬ 
age  pairs  for  analysis;  converted  them  into  a  standard 
format;  distributed  the  dataset  to  the  five  groups  for 
processing,  along  with  an  extensive  set  of  instructions; 
collected  the  results;  characterized  them;  and  finally  dis¬ 
tributed  the  results  and  the  associated  report  to  the  par¬ 
ticipants. 

We  intentionally  asked  each  group  to  process  a  large 
number  of  pairs  (10  training  pairs  and  45  “test”  pairs  ... 
6  pairs  were  in  both  the  training  and  test  sets),  because 
we  wanted  to  force  each  group  to  establish  a  standard  al¬ 
gorithm  that  was  automatically  applied.  As  result  of  this 
approach,  there  are  now  3  or  4  groups  around  the  world 
that  can  readily  apply  end-to-end  stereo  techniques  to 
new  data  and  compare  their  results.  As  part  of  the  sec¬ 
ond  phase  we  hope  to  expand  this  community  to  10  or 
more  groups.  This  process  has  opened  up  a  new  form  of 
interaction  within  the  computer  vision  community  that 
we  feel  will  help  stimulate  advances  and  reduce  redun¬ 
dant  development. 

In  the  instructions  to  the  participants  we  asked  each 
group  to  produce  several  results  for  each  match  point 
in  addition  to  its  computed  disparity.  For  each  point 
we  asked  for  an  x  and  a  y  disparity,  an  estimate  of  the 
precision  associated  with  each  reported  disparity,  an  es¬ 
timate  of  the  confidence  associated  with  each  match,  and 
an  annotation  for  each  unmatched  point,  indicating  why 
the  technique  could  not  find  a  match.  Possible  explana¬ 
tions  for  no  match  included  “area  too  bland,”  “multiple 
choices,”  and  “inconsistent  with  neighbors.”  Although 
none  of  the  groups  produced  all  this  additional  informa¬ 
tion  (they  all  produced  some  of  it),  we  felt  that  it  was 
important  to  begin  the  process  with  the  goal  of  produc¬ 
ing  this  auxiliary  information,  which  will  be  invaluable 
for  the  higher-level  routines  using  the  stereo  results.  We 
foresee  a  time  in  the  not  too  distant  future  when  the 
calling  routine  will  use  the  precisions,  confidences,  and 
annotations  to  actively  control  the  sensor  parameters  for 
the  next  data  accpiisition  step.  For  example,  if  the  cur¬ 
rent  stereo  results  contain  a  large  region  of  no  disparities 
and  the  image  regions  are  quite  dark,  the  controlling  rou¬ 
tine  could  open  the  irises  or  increase  the  integration  time 
to  re-examine  these  dark  regions. 

To  assist  in  the  analysis  of  the  results,  SRI  developed 
two  sets  of  routines,  one  to  gather  statistics  and  one  to 
display  the  disparities  in  a  variety  of  ways.  Since  we 


did  not  have  ground  truth  for  the  distributed  imagery, 
we  were  not  able  to  compare  the  computed  disparities 
with  objective  values.  However,  we  were  able  to  gather 
statistics  on  two  of  the  three  types  of  mistakes  that  we 
were  interested  in  by  outlining  selected  regions  in  the  im- 
agery  and  counting  the  occurrence  of  results/no-results 
within  these  regions.  We  made  a  distinction  between  the 
following  three  types  of  mistakes; 

False  Negatives:  No  disparities  computed  for  points 
that  should  have  results. 

False  Positives  in  Unmatchable  Regions;  Disparities 
reported  for  points  that  don’t  have  matches  in  the 
second  image,  for  example,  points  occluded  in  one 
image  or  points  out  of  the  field  of  view  of  one  of  the 
images. 

False  Positives  in  Matchable  Regions:  Incorrect  dis¬ 
parities  reported  for  matchable  points. 

By  interactively  outlining  regions  of  occluded  points,  re¬ 
gions  of  points  out  of  the  field  of  view  of  the  second 
image,  and  regions  of  points  in  the  sky,  we  were  able  to 
directly  measure  statistics  for  the  first  two  types  of  mis¬ 
takes.  In  addition,  we  outlined  regions  corresponding  to 
expected  problems,  such  as  dark  shadows,  foliage,  and 
bland  areas.  In  this  way  we  could  gather  statistics  on  the 
behavior  of  the  algorithms  on  these  special  problems. 

A  high-level  summary  of  the  results  of  the  first  phase 
evaluation  are  as  follows; 

•  We  were  surprised  by  the  completeness  of  the  re¬ 
sults.  Even  though  the  dataset  contained  a  wide 
range  of  imagery,  including  some  sequences  designed 
to  stretch  the  analysis  along  specific  dimensions, 
such  as  noise  tolerance  and  disparity  range,  the 
stereo  systems  computed  disparities  for  64%  of  the 
matchable  points.  On  8  image  pairs  selected  to 
be  the  most  appropriate  for  UGV  applications,  the 
techniques  computed  disparities  for  up  to  87%  of  the 
points.  Although  the  missing  points  (and  mistakes 
in  the  reported  matches)  could  cause  problems  for 
vehicle  navigation,  this  level  of  completeness  is  an 
indication  that  there  is  a  solid  basis  for  building  a 
passive  ranging  system  for  an  outdoor  vehicle. 

•  For  the  UGV-related  imagery  the  number  of  gross 
errors  was  relatively  small,  ranging  from  a  few 
“spike”  errors  to  small  regions  of  mistakes.  We  esti¬ 
mate  that  there  were  between  1  and  5%  gross  errors 
in  these  results.  Many  of  these  errors  would  have 
to  be  eliminated  in  order  for  the  data  to  be  used 
directly  for  planning  navigable  routes. 

•  The  stereo  systems  made  different  mistakes,  most  of 
which  could  be  explained  by  their  correlation  patch 
size,  search  technique,  or  match  verification  tech¬ 
nique.  However,  since  they  made  different  mistakes. 


there  is  a  possibility  of  combining  them  in  a  way  to 
check  each  other  and  fill  in  missing  data. 

•  All  the  stereo  systems  could  be  improved  signifi¬ 
cantly  with  a  relatively  small  amount  of  effort.  This 
was  the  first  test  of  this  type,  requiring  the  analy¬ 
sis  of  a  large  dataset,  and  it  uncovered  some  weak¬ 
nesses  in  the  different  stereo  systems  that  can  be 
corrected.  One  area  to  be  considered  is  the  devel¬ 
opment  of  pre-analysis  techniques  to  automatically 
set  key  parameters,  such  as  patch  size  and  search  ar¬ 
eas  (as  Teleos  did).  Another  place  for  improvement 
is  in  the  filtering  of  results,  eliminating  matches  that 
differ  significantly  from  their  neighbors  (as  SRI  did). 

•  There  were  a  few  surprises,  such  as  Teleos’s  suc¬ 
cessful  solution  to  one  set  of  image  pairs  from  CMU 
that  includes  a  carpet  with  a  repetitive  pattern  on 
it.  Teleos’s  large  patches  were  able  to  detect  large 
regions  of  subtle  differences,  which  allowed  recovery 
of  the  correct  disparities. 

Additional  information  about  the  JISCT  evaluation,  its 
results,  and  our  goals  for  the  second  phase,  can  be  found 
in  a  separate  paper  in  these  proceedings  [Bolles,  Baker, 
ic  Hannah]. 

3  Interactive  Modeling 

Our  work  in  the  area  of  Interactive  Modeling  is  con¬ 
cerned  with  the  development  of  an  interactive  environ¬ 
ment  and  associated  feature  extraction  and  visualization 
techniques  to  enable  effective  human  assisted  scene/site 
model  construction  -  especially  for  applications  in  the 
areas  of  cartography,  intelligence  analysis,  and  mission 
planning. 

3.1  Interactive  Techniques  for  Scene 
Modeling:  A  Cartographic  Model¬ 
ing  Environment 

Manual  photointerpretation  is  a  difficult  and  time- 
consuming  step  in  the  compilation  of  cartographic  in¬ 
formation.  However,  fully  automated  techniques  for  this 
purpose  are  currently  incapable  of  matching  the  human’s 
ability  to  employ  background  knowledge,  common  sense, 
and  reasoning  in  the  image-interpretation  task.  Neu- 
term  solutions  to  computer-based  cartography  must  in¬ 
clude  both  interactive  extraction  technicpies  and  new 
ways  of  using  computer  technology  to  provide  the  end- 
user  with  useful  information  in  the  form  of  both  image 
and  map-like  interactive  computer  displays. 

In  order  to  support  research  in  semiautomated  and 
automated  computer-based  cartography,  we  have  de¬ 
veloped  the  SRI  Cartographic  Modeling  Environment 
(CME).  In  the  context  of  an  interactive  workstation- 
based  system,  the  user  can  manipulate  multiple  images; 
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camera  models;  digital  terrain  elevation  data;  point,  line, 
and  area  cartographic  features;  and  a  wide  assortment 
of  three-dimensional  objects.  Interactive  capabilities  in¬ 
clude  free-hand  feature  entry,  feature  editing  in  the  con¬ 
text  of  task-based  constraints,  and  adjustment  of  the 
scene  viewpoint.  Synthetic  views  of  a  scene  from  arbi¬ 
trary  viewpoints  may  be  constructed  using  terrain  and 
feature  models  in  combination  with  texture  maps  ac¬ 
quired  from  aerial  imagery.  This  ability  to  provide  an 
end-user  with  an  interactively  controlled  scene-viewing 
capability  could  eliminate  the  need  to  produce  hard¬ 
copy  maps  in  many  application  contexts.  Additional  ap¬ 
plications  include  high-resolution  cartographic  compila¬ 
tion,  direct  utilization  of  cartographic  products  in  digital 
form,  and  generation  of  mission-planning  and  training 
scenarios. 

In  cooperation  with  General  Electric,  SRI  has  en¬ 
hanced  the  CME  to  fully  support  the  needs  of  the  RA¬ 
DIUS  Program  (as  described  below). 

3.1.1  The  RADIUS  Common  Development  En¬ 
vironment 

Progress  in  image  understanding  has  often  been  ham¬ 
pered  by  the  difficulty  involved  in  sharing  results  and 
software  among  collaborating  institutions.  This  prob¬ 
lem  is  being  addressed  within  the  RADIUS  Program 
through  development  of  a  common  development  envi¬ 
ronment.  RADIUS  is  a  government  sponsored  program 
whose  aim  is  to  support  image  analysts  through  the  con¬ 
struction  and  use  of  site  models.  Image  understanding 
techniques  are  expected  to  play  a  major  role,  in  both 
the  extraction  of  cartographic  features  to  populate  site 
models,  as  well  as  in  the  employment  of  site  models  in 
more  detailed  scene  analysis.  Transfer  of  research  results 
within  RADIUS  is  being  facilitated  through  the  use  of  a 
common  development  environment,  based  on  SRl’s  Car¬ 
tographic  Modeling  Environment  (as  described  above). 

The  resulting  environment  has  been  named  the  RA¬ 
DIUS  Common  Development  Environment  (RCDE). 
The  RCDE  has  been  distributed  to  participants  in  the 
RADIUS  Program  and  is  now  available  for  distribu¬ 
tion  to  all  members  of  the  DARPA  Image  Understand¬ 
ing  community.  Widespread  adoption  of  the  RCDE 
throughout  the  lU  community  has  the  potential  for  pay¬ 
offs  at  3  levels; 

-  Sharing  research  results  among  the  various  research 
laboratories  to  foster  collaborative  work  and  to  build  on 
the  successes  of  others. 

-  Validation  of  research  results  by  other  laboratories  to 
insure  the  soundness  of  the  work,  to  compare  alternative 
techniques,  and  to  motivate  further  investigations. 

-  Technology  transfer  from  research  laboratories  to  de¬ 
velopment  organizations  which  also  utilize  the  common 
development  environment. 

The  RCDE  is  the  culmination  of  many  years  of  re¬ 


search  at  SRI  and  GE  on  interactive  programming  en¬ 
vironments  for  image  understanding.  SRI  and  GE  are 
pleased  to  be  able  to  make  it  available  for  use  by 
other  image  understanding  laboratories.  (See  references 
(CME91,  RCDE92,  RCDE93a,  and  RCDE93b].) 

3.2  Terrain  Modeling  and  Visualization 

The  DARPA  sponsored  MAGIC  (Multidimensional  Ap¬ 
plications  and  Gigabit  Internetwork  Consortium)  project 
has  been  established  to  develop  a  very  high-speed,  wide- 
area  networking  testbed  that  will  demonstrate  real¬ 
time,  interactive  exchange  of  data  at  gigabit-per-second 
(Gbps)  rates  among  multiple  distributed  servers  and 
clients.  Participants  in  the  project  include  a  variety  of 
organizations  from  government,  industry,  and  academia. 

The  SRI  role  in  this  project  includes  the  design  and 
implementation  of  the  MAGIC  terrain  visualization  ap¬ 
plication,  as  well  as  the  production  of  the  digital  eleva¬ 
tion  models  (DEMs)  and  high-resolution  ortho-rectified 
aerial  images  (ortho-images)  used  by  the  application. 

This  application  will  allow  a  U.S.  Army  commander  to 
“fly  over”  or  “walk/drive  through”  a  battlefield  at  his  or 
her  own  speed  in  real  time.  Views  of  the  (on-going)  bat¬ 
tle  available  for  selection  will  range  from  low-resolution, 
wide-area  coverage  to  high- resolution,  limited-area  cov¬ 
erage  and  will  include  fixed  and  mobile  objects  such  as 
buildings  and  vehicles.  The  positions  of  mobile  objects 
will  be  updated  in  real  time. 

What  makes  this  application  unique  is  that  the  area 
of  interest  is  quite  large  (tens  to  hundreds  of  gigabytes), 
and  that  the  data  must  be  accessed  across  a  high-speed 
network.  In  the  first  phase  of  the  project,  to  be  com¬ 
pleted  by  the  end  of  1993,  the  area  of  interest  is  a  900 
sq.  km.  exercise  area  of  the  National  lYaining  Center 
at  Fort  Irwin,  California.  The  full-color  ortho-images 
alone,  at  one  meter  ground-level  resolution,  require  2.7 
gigabytes  of  storage.  The  next  phases  will  involve  much 
larger  areas. 

The  size  of  the  database,  the  need  for  real-time  net¬ 
work  access,  and  the  wide  range  of  views  has  led  us  to 
represent  the  ortho-images  and  DEMs  as  a  “tiled  Gaus¬ 
sian  pyramid.”  That  is,  the  ground-level  distance  be¬ 
tween  pixels  doubles  from  one  level  to  the  next  (start¬ 
ing  with  one  meter  resolution  for  the  ortho-images,  and 
32  meters  for  the  DEMs),  and  each  level  is  divided  into 
equal-size  image  windows  called  “tiles.”  Using  equad-size 
tiles  (initially  planned  to  be  128  x  128  pixels)  facilitates 
the  storage,  transmission,  and  display  of  the  data. 

Although  the  network  itself  has  a  very  high  band¬ 
width,  there  are  inherent  delays  in  requesting  tiles  from 
a  disk-based  storage  system  (ISS)  across  large  distances. 
Consequently,  the  application  must  request  tiles  well  in 
advance  of  their  being  required  for  display.  This  requires 
processes  for  predicting  future  viewpoints,  determining 
which  tiles  are  visible  from  a  particular  viewpoint,  re- 
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questing  these  visible  tiles  in  an  appropriately  prioritized 
fashion,  and  receiving  and  buffering  the  tiles  in  a  shared- 
memory  data  structure  accessible  to  the  display  process. 

Also,  the  application  must  be  able  to  deal  with  lost 
and/or  delayed  tiles  in  a  relatively  seamless  fashion.  This 
is  accomplished  in  two  ways.  First,  the  tiled  Gaussian 
pyramid  representation  allows  the  entire  area  of  interest 
to  be  represented  at  some  coarse  resolution  using  a  rela¬ 
tively  small  number  of  tiles.  These  coarse-resolution  tiles 
can  reside  entirely  in  the  rendering  engine,  thereby  guar¬ 
anteeing  that  any  viewpoint  can  be  rendered  instantly, 
no  matter  how  radical  the  change  in  viewpoint,  albeit  at 
a  coarser  resolution  then  perhaps  desired. 

Of  course,  it  is  not  sufficient  to  merely  guarantee  that 
any  viewpoint  can  be  displayed  at  any  time  at  some 
coarse  resolution.  The  tile  pre-fetch  process  must  at¬ 
tempt  to  have  all  of  the  appropriate  resolution  tiles  in 
memory  when  they  are  needed  for  display.  1 .  >  is  done 
by  predicting  the  user’s  path,  and  se  ing  for  tiles  in 
ail  of  the  views  along  that  path  from  th  urrent  moment 
to  some  future  moment.  (In  fact,  as  one  attempts  to  pre¬ 
dict  further  into  the  future,  the  “view”  needs  to  become 
more  encompassing  in  order  to  account  for  the  increas¬ 
ing  uncertainty  in  the  predicted  path.)  Tiles  from  any  of 
these  views  that  are  not  currently  in  memory  are  added 
to  the  list  of  tiles  to  be  requested  from  the  ISS  using  a 
coarse-to-fine  search  algorithm,  where  coarser-resolution 
tiles  are  given  a  higher  priority  than  finer-resolution  tiles. 

Searching  ail  views  in  the  above  manner,  and  request¬ 
ing  all  visible  tiles  that  are  not  currently  in  memory  in  a 
coarse-to-fine  fashion  has  several  consequences.  First,  in 
the  worst  case  of  a  change  to  a  completely  new  viewpoint, 
one  will  immediately  be  able  to  display  a  coarse  resolu¬ 
tion  representation  of  the  scene  followed  by  increasingly 
finer  resolution  representations.  (It  is  currently  expected 
that  this  should  occur  over  at  most  a  few  frames.)  Sec¬ 
ond,  the  coarse-to-fine  ordering  of  the  requests  should 
prove  useful  even  when  the  user  is  not  radically  chang¬ 
ing  viewpoints  because  it  will  increase  the  likelihood  of 
intermediate  resolution  tiles  being  available,  since  these 
cover  a  larger  area  than  the  finest  resolution  tiles,and 
will  be  requested  with  a  higher  priority.  Third,  since 
tiles  not  currently  in  memory  are  always  requested,  lost 
or  delayed  tiles  are  automatically  re-requested  without 
the  need  for  special  protocols  between  the  ISS  and  the 
application. 

We  see  our  work  on  the  MAGIC  project,  not  only 
as  a  specific  application  of  technology  developed  in  the 
lU  program,  but  also  as  providing  important  new  tools 
needed  for  the  visualization  component  of  our  efforts  in 
interactive  scene  modeling  and  lU  related  applications. 


3.3  Additional  Topics  In  Interactive 
Scene  Analysis 

We  recently  made  a  significant  advance  in  the  long¬ 
standing  problem  of  duplicating  human  performance  in 
recovering  3-D  models  of  objects  with  straight  edges  and 
planar  faces  from  qualitative  and  imprecise  line  draw¬ 
ings  (e.g.,  building  edges  as  in  a  single  approximate 
projection  of  the  corresponding  wire-frame).  This  work 
can  greatly  simplify  communication  problems  between 
man  and  machine  in  such  applications  as  robotic  mis¬ 
sion  plzuining  and  in  construction  of  databases  for  use 
in  robotic  navigation.  A  paper  describing  this  work  has 
been  published  in  the  International  Journal  of  Computer 
Vision  [Leclerc  ti  Fischler].  Our  goal  in  this  on-going 
work  is  to  extend  the  basic  approach  to  both  curved- 
line  drawings  (e.g.,  of  terrain  elevations  as  in  an  approx¬ 
imate  and  uncalibrated  contour  map)  and  to  sketches 
and  actual  images  of  natural  terrain.  The  key  elements 
of  our  current  optimization-based  approach  are  a  way 
to  automatically  extract  planar  constraints  from  the  line 
drawings;  an  objective  function  that  honors  but  does  not 
insist  on  the  planarity  constraints  in  combination  with 
terms  measuring  regularity  and  symmetry  of  proposed 
solutions;  and  a  continuation  technique  (we  developed) 
to  find  the  solution.  For  the  limited  class  of  20-30  objects 
we  used  in  our  experiments,  we  have  been  able  to  con¬ 
sistently  obtain  the  desired  solution  for  the  same  object 
in  almost  all  of  its  possible  projections. 

4  Semantic  Labeling,  Partition¬ 
ing,  and  Delineation 

The  natural  outdoor  environment  poses  significant  ob¬ 
stacles  to  the  design  and  successful  integration  of  the 
interpretation,  planning,  navigational,  and  control  func¬ 
tions  of  a  robotic  device  supported  by  a  general-purpose 
vision  system.  Many  of  these  functions  cannot  yet  be 
performed  at  a  level  of  competence  and  reliability  neces¬ 
sary  to  satisfy  the  needs  of  an  autonomous  robot.  The 
problem  lies  in  the  inability  of  available  techniques,  es¬ 
pecially  those  involved  in  sensory  interpretation,  to  use 
contextual  information  and  stored  knowledge  in  recog¬ 
nizing  objects  and  environmental  features,  inability  to 
provide  adequate  models  to  the  machine  as  a  basis  for 
recognizing  complex  man-made  and  natural  objects,  and 
the  lack  of  an  adequate  collection  of  low-level  feature  ex¬ 
traction  techniques  capable  of  robust  performance  over 
a  wide  range  of  scene  domains. 

Our  current  work  in  this  topic  area,  presented  below, 
exploits  three  key  ideas: 

(1)  use  of  an  objective  function  and  optimization  as  a 
descriptive  mechanism, 

(2)  use  of  evolving  context  in  a  production  system  for¬ 
malism  to  select  feature  extraction  technicpies  and  pa- 
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rameter  settings  in  a  knowledge-based  paradigm  for  low- 
level  vision  and  interactive  modeling,  and 

(3)  use  of  a  few  highly  refined  and  reliable  low-level 
techniques  as  the  base  for  a  much  wider  class  of  feature 
extraction  methods. 

In  this  section  we  discuss  the  development  of  tech¬ 
niques  for  automatically  recognizing  and  delineating 
complex  man-made  and  natural  objects,  especially  for 
applications  to  robotic  navigation  in  the  outdoor  world, 
and  to  provide  robust  automated  techniques  for  aerial 
image  analysis. 

4.1  Condor:  A  Context  Based  Approach 
to  Scene  Modeling 

Much  of  the  progress  that  has  been  made  to  date  in  ma¬ 
chine  vision  has  been  based,  almost  exclusively,  on  shape 
comptirison  and  classification  employing  locally  measur¬ 
able  attributes  of  the  imaged  objects  (e.g.,  color  and  tex¬ 
ture).  Natural  objects  viewed  under  realistic  conditions 
do  not  have  uniform  shapes  that  can  be  matched  against 
stored  prototypes,  and  their  local  surface  properties  are 
too  variable  to  be  unique  determiners  of  identity.  The 
standard  machine  vision  recognition  paradigms  fail  to 
provide  a  means  for  reliably  recognizing  any  of  the  object 
classes  common  to  the  natural  outdoor  world  (e.g. ,  trees, 
bushes,  rocks,  and  rivers).  In  this  effort  [StratflrFischler], 
we  have  devised  a  new  paradigm  that  explicitly  invokes 
context  and  stored  knowledge  to  control  the  complex¬ 
ity  of  the  decision-making  processes  involved  in  correctly 
identifying  natural  objects  and  describing  natural  scenes. 

The  conceptual  architecture  of  the  system  we  describe, 
called  Condor  (for  context-driven  object  recognition),  is 
much  like  that  of  a  production  system;  there  are  many 
computational  processes  interacting  through  a  shared 
data  structure.  Interpretation  of  an  image  involves  the 
following  four  process  types. 

•  Candidate  generation  (hypothesis  generation) 

•  Candidate  comparison  (hypothesis  evaluation) 

•  Clique  formation  (grouping  mutually  consistent  hy¬ 
potheses) 

•  Clique  selection  (selection  of  a  “best”  description) 

Each  process  acts  like  a  daemon,  watching  over  the 
knowledge  base  and  invoking  itself  when  its  contextual 
requirements  are  satisfied.  The  input  to  the  system  is  an 
image  or  set  of  images  that  may  include  intensity,  range, 
color,  or  other  data  modalities.  The  primary  output  of 
the  system  is  a  labeled  3D  model  of  the  .scene.  The 
labels  included  in  the  output  description  denote  object 
classes  that  the  system  has  been  tasked  to  recognize, 
plus  others  from  the  recognition  vocabulary  that  happen 
to  be  found  useful  during  the  recognition  process.  An 


object  class  is  a  category  of  scene  features  such  as  sky, 
ground,  geometric-horizon,  etc. 

Visual  interpretation  knowledge  is  encoded  in  context 
sets,  which  serve  as  the  uniform  knowledge  representa¬ 
tion  scheme  used  throughout  the  system.  The  invoca¬ 
tion  of  all  processing  operations  in  Condor  is  governed 
by  context  through  the  use  of  various  types  of  context 
sets;  an  action  is  initiated  only  when  one  or  more  of  its 
controlling  context  sets  is  satisfied.  Thus,  the  actual  se¬ 
quence  of  computations,  and  the  labeling  decisions  that 
are  made,  are  dictated  by  contextual  information,  by 
the  computational  state  of  the  system,  and  by  the  image 
data  available  for  interpretation. 

The  successful  processing  of  a  significant  number  of 
outdoor  images  has  demonstrated  the  validity  and  im¬ 
portance  of  the  Condor  paradigm.  Our  continuing  work 
on  Condor  addresses  the  problems  of  (1)  how  to  ef¬ 
ficiently  construct  (or  acquire)  the  large  site-specific 
database  needed  for  successful  operation,  and  (2)  how 
to  improve  the  effectiveness  of  a  few  key  low-level  rou¬ 
tines  that  Condor  depends  on.  We  have  also  modified 
and  extended  the  Condor  paradigm  to  permit  its  use  in 
an  interactive  environment;  this  work  is  presented  in  a 
separate  paper  in  these  proceedings  [Strat]  and  briefly 
discussed  in  the  following  paragraphs. 

The  semiautomated  nature  of  RADIUS  (see  earlier 
discussion  of  the  RADIUS  program)  obviates  the  need 
for  some  of  the  machinery  employed  in  the  fully  auto¬ 
mated  version  of  Condor.  The  availability  of  a  human 
operator  permits  access  to  some  kinds  of  context  that 
were  not  available  to  Condor,  such  as  the  level  of  interac¬ 
tivity  desired  and  manual  sketches  of  individual  features. 
The  existence  of  a  human  to  review  and  edit  the  lU  re¬ 
sults  offers  the  opportunity  to  use  a  supervised  learning 
scheme  to  improve  the  quality  of  the  knowledge  base  and 
to  extend  its  range  of  competence. 

The  large  number  of  features  and  wide  range  of  imag¬ 
ing  conditions  that  must  be  considered  for  site-model 
construction  in  RADIUS  stress  the  context  set  represen¬ 
tation  employed  in  Condor.  While  context  sets  were  ad¬ 
equate  to  represent  Condor's  knowledge  base,  it  has 
been  necessary  to  consider  more  effective  representa¬ 
tions  that  will  extend  to  the  requirements  of  site-model 
construction.  Two  new  constructs,  context  tables,  and 
context  rules,  offer  a  more  systematized  organization 
for  the  context  knowledge  base  that  should  facilitate 
its  construction.  These  representations  offer  additional 
economies  in  both  storage  and  computation  that  may  be 
vital  to  implementation  of  large  systems.  The  symmetry 
of  context  tables  anti  rules  encourage  their  use  in  either 
direction:  to  select  algorithms  and  set  their  parameters, 
as  well  as  to  describe  the  conditions  that  must  be  sat¬ 
isfied  for  a  given  algorithm  to  be  applicable.  This  final 
capability  raises  the  pos.sibility  of  using  context  rules  to 
choose  the  most  appropriate  images  for  interpretation. 
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4.2  Optimization-Based  Methods  for 
Partitioning,  Delineation  and  Re¬ 
cognition 

It  is  commonly  accepted  that  the  problems  of  partition¬ 
ing,  delineation,  and  recognition  require  non-local  infor¬ 
mation  for  their  solution.  Optimization  is  one  of  the 
major  paradigms  for  combining  and  evaluating  global 
information,  but  to  use  this  paradigm  one  must  address 
the  issues  of  how  to  select  an  objective  function  that  ac¬ 
curately  reflects  the  intended  solution,  and  how  to  devise 
a  procedure  for  the  optimization  process  that  will  return 
an  answer  with  a  practical  amount  of  computation. 

In  previous  reports  we  described  two  lines  of  research 
using  optimization-based  methods  for  partitioning  and 
delineation: 

1)  An  optimization-based  approach,  applicable  both 
to  image  partitioning  and  to  subsequent  steps  in  the 
scene  analysis  process,  that  involves  finding  the  “best” 
description  of  the  image  in  terms  of  some  specified  de¬ 
scriptive  language.  In  the  case  of  image  partitioning 
[Leclerc88,89a,89b,89c,89d,90a,90b,91],  we  employ  a  lan¬ 
guage  that  describes  the  image  in  terms  of  regions  hav¬ 
ing  a  low-order  polynomial  intensity  variation  plus  white 
noise;  region  boundaries  are  described  by  a  differential 
chain  code.  The  best  description  is  defined  as  the  sim¬ 
plest  one  (in  the  sense  of  least  encoding  length)  that 
is  also  stable  (i.e.,  minor  perturbations  in  the  viewing 
conditions  should  not  alter  the  description).  A  continu¬ 
ation  method,  specially  designed  to  match  the  problem 
constraints,  was  used  to  solve  the  optimization  problem. 

2)  In  situations  where  the  required  image  description 
must  extend  beyond  that  of  a  delineation  of  coherent 
regions,  we  require  an  extended  vocabulary  relevant  to 
the  semantics  of  the  given  task.  Fua  and  Leclerc  deal 
with  the  problem  of  boundary /shape  detection  given  a 
rough  estimate  of  where  the  boundary  is  located  and 
a  set  of  photometric  (intensity-gradient)  and  geomet¬ 
ric  (shape-constraint)  models  for  a  given  class  of  objects 
[Fua&Leclerc88,  Fua&Leclerc90].  They  define  an  energy 
(objective)  function  that  assumes  a  minimal  value  when 
the  models  are  exactly  satisfied.  An  initial  estimate  of 
the  shape  and  location  of  the  curve  is  used  as  the  starting 
point  for  finding  a  local  minimum  of  the  energy  function 
by  embedding  this  curve  in  a  viscous  medium  and  solving 
the  dynamic  equations.  This  energy-minimization  tech¬ 
nique,  has  been  applied  to  straight-line  boundary  models 
and  to  more  complex  models  that  include  constraints  on 
smoothness,  parallelism,  and  rectilinearity.  In  an  inter¬ 
active  mode,  the  user  supplies  an  initial  estimate  of  the 
boundary  of  some  object  (which  may  be  quite  complex, 
like  the  outline  of  an  aeroplane)  and  then,  if  need  be, 
corrects  the  optimized  curve  by  applying  forces  to  the 
curve  or  by  changing  one  of  a  few  optimization/model 
parameters. 

We  believe  that  the  above  techniques  represent  signif¬ 


icant  advances  in  the  state-of-the-art  in  their  respective 
areas  of  image  partitioning  and  delineation.  The  im¬ 
plemented  systems  based  on  these  techni<|ues  have  been 
able  to  produce  excellent  results  in  complex  situations 
where  existing  (typically  local)  approaches  often  fail.  In 
the  following  subsection  we  describe  on-going  work  which 
employs  the  above  techniques  under  a  Condor  like  con¬ 
trol  structure  to  deal  with  the  problem  of  efficient  site- 
model  construction. 

4.2.1  Model-Based  Optimization  for  Site  Mod¬ 
eling 

As  part  of  our  work  on  the  DARPA  sponsored  RADIUS 
Program  (described  earlier)  our  research  seeks  to  in¬ 
crease  the  speed  and  accuracy  with  which  site  models 
can  be  constructed  from  available  imagery  by  developing 
a  new  family  of  image  understanding  (lU)  techniques. 

Model-Based  Optimization  (MBO)  is  a  paradigm  in 
which  an  objective  function  is  used  to  express  both  geo¬ 
metric  and  photometric  constraints  on  features  of  inter¬ 
est.  A  model  of  a  feature  (such  as  a  road,  a  building, 
or  coastline)  is  extracted  from  an  image  by  adjusting 
the  model  until  a  minimum  value  of  the  objective  func¬ 
tion  is  obtained.  The  optimization  procedure  yields  a 
description  that  simultaneously  satisfies  (or  nearly  satis¬ 
fies)  all  geometric  and  image  constraints,  and,  as  a  result, 
is  likely  to  be  a  good  model  of  the  feature. 

The  applicability  of  MBO  is  currently  limited  by  the 
expressive  power  of  terms  in  the  objective  function  and 
by  the  difficulty  of  optimization.  We  are  attempting  to 
extend  the  range  of  objects  that  can  be  modeled  within 
the  MBO  paradigm,  and  to  develop  suitable  optimiza¬ 
tion  procedures  to  support  their  extraction  from  im¬ 
agery. 

With  these  extensions,  the  MBO  technology  can  be 
used  (1)  automatically,  to  extract  from  an  image  all  ob¬ 
jects  of  a  given  type,  or  (2)  interactively,  to  extract  ob¬ 
jects  that  are  of  special  interest  or  that  were  missed  by 
a  fully  automated  system. 

This  research  seeks  not  only  to  develop  new  lU  tech¬ 
niques  for  cartographic  feature  extraction,  but  also  to 
develop  a  language  with  which  an  image  analyst  can 
communicate  with  such  a  system.  The  foundation  for 
this  language  lies  in  the  creation  and  implementation  of 
a  large  number  of  feature  extraction  operations,  each  of 
which  is  sensitive  to  the  context  of  a  particular  task. 

To  this  end,  we  have  developed  a  means  to  utilize  con¬ 
textual  information  (see  discussion  of  Condor)  to  auto¬ 
matically  produce  lU  operators  that  are  tailored  to  the 
specific  extraction  problem  of  interest,  and  hence  are 
more  likely  to  succeed  than  generic  lU  algorithms.  A 
description  of  this  approach  appears  as  a  separate  paper 
in  this  proceedings  [Strat]. 

The  technology  is  currently  being  implemented  as  a 
customized  system  built  using  the  RADIUS  Common 
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Development  Environment  (RCDE).  Its  design  is  being 
shaped  through  experiments  using  multiple  overhead  im¬ 
ages  in  a  site  model  construction  scenario.  The  benefits 
of  the  new  technology  will  be  measured  by  comparing 
its  performance  on  site-model  construction  tasks  with 
that  achievable  using  other  manual  and  semiautomated 
techniques. 

4.3  Curve  Partitioning  and  Delineation 
of  Man-Made  and  Natural  Objects 

We  have  identified  a  few  key  low-level  routines,  partition¬ 
ing  and  delineation  algorithms  in  particular,  that  if  made 
sufficiently  robust,  could  form  the  basis  of  a  wide  vari¬ 
ety  of  feature  extraction  algorithms.  In  addition  to  our 
optimization-b2ised  methods,  we  have  made  new  progress 
in  extending  some  of  our  past  work  in  this  problem  do¬ 
main. 

A  critical  problem  in  machine  vision  is  how  to  break  up 
(partition)  the  perceived  world  into  coherent  or  meaning¬ 
ful  parts  prior  to  knowing  the  identity  of  these  parts.  Al¬ 
most  all  current  machine  vision  paradigms  recjuire  some 
form  of  partitioning  as  an  early  simplification  step  to 
avoid  having  to  resolve  a  combinatorially  large  number 
of  alternatives  in  the  subsequent  analysis  process. 

Finding  salient  points  on  image  curves  (potential  par¬ 
tition  points)  plays  a  critical  role  in  both  two  and  three 
dimensional  object  recognition,  in  curve  approximation, 
in  tracking  moving  objects,  and  in  many  other  tasks  in 
machine  vision.  For  example,  in  cartography,  computer 
graphics,  and  in  many  scene  analysis  tasks,  it  is  often 
desirable  to  partition  an  extended  boundary  or  a  con¬ 
tour  into  a  sequence  of  simply  represented  primitives 
(e.g.,  straight  line  segments  or  polynomial  curves  of  some 
higher  degree)  to  simplify  subsequent  analysis  and  to 
minimize  storage  requirements.  “Corners”  on  the  con¬ 
tours  of  imaged  objects  are  often  used  as  features  for 
tracking  the  motion  of  these  objects  and  for  computing 
optical  flow. 

In  a  paper  (by  Fischler  and  Wolf,  appearing  elsewhere 
in  these  proceedings)  describing  our  current  work  in  this 
topic  area,  we  present  the  underlying  ideeis  and  algo¬ 
rithmic  details  of  a  computer  program  (the  SSS  algo¬ 
rithm)  that  performs  at  a  human  level  of  competence 
for  a  significant  subset  of  the  curve  partitioning  task.  It 
extends  and  “rounds  out”  the  technique  and  philosoph¬ 
ical  approach  originally  presented  in  a  1986  paper  by 
Fischler  and  Bolles.  In  particular,  it  provides  a  unified 
strategy  for  selecting  and  dealing  with  interactions  be¬ 
tween  salient  points,  even  when  these  points  are  salient 
at  “different  scales  of  resolution.”  Experimental  results 
are  described  involving  on  the  order  of  1000  real  and 
synthetically  generated  images. 

A  technique  [Fischler  k.  Wolf]  was  developed  for  de¬ 
tecting  and  delineating  low  resolution  linear  structures 
appearing  in  aerial  imagery,  such  as  roads  and  rivers. 


The  algorithm  was  effective  in  finding  such  structure, 
but  it  provided  no  mechanism  for  distinguishing  between 
the  semantically  meaningful  objects  and  the  “accidental” 
and  irrelevant  linear  features  found  in  most  real  images. 
In  related  work  now  in  progress  to  automatically  detect 
and  delineate  roads  in  aerial  images,  we  use  the  SSS  al¬ 
gorithm  to  “slice  up”  the  individual  curves  found  by  our 
existing  delineation  algorithm.  We  throw  away  the  very 
small  resulting  segments  which  are  typical  of  acciden¬ 
tal  linear  formations,  and  then  further  filter  the  longer 
segments  with  respect  to  a  set  of  semantic  constraints. 
Those  segments  that  pass  through  the  filtering  process 
are  then  “glued”  back  together  to  produce  the  desired 
delineation.  The  robustness  of  the  SSS  algorithm  is  es¬ 
sential  in  carrying  out  the  filtering  operation.  Insertion 
of  extraneous  partition  points  would  cause  the  lose  of 
portions  of  the  road  network;  absence  of  v  nartition 
points  would  allow  meaningless  appendages,  -o  become 
part  of  the  extracted  network. 
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Abstract 

At  CMU,  research  in  Image  Understanding 
spans  the  range  cf  topics  fivm  the  basic  sci¬ 
ence  of  imaging  to  cqtpUcadons  in  autono¬ 
mous  intelligent  systems.  The  focal  areas 
are: 

•  Physics-Based  Vision 

•  3D  Shape  and  Motion  Recovery 

•  Computational  Range  Sensor 

•  Parallel  Vision 

•  Vision  for  Object  Recognition 

•  Vision  for  Robot  Vehicles 

•  Vision  for  Human-Computer  Interaction 

1.  Physics-Based  Vision 

While  many  vision  systems  have  been  demon¬ 
strated  in  principle,  few  have  been  highly  reliable 
when  deployed.  This  is  largely  because  of  the  reli¬ 
ance  on  feature  detectors  such  as  edge-finding, 
which  are  based  on  over-simple  ai^roximations  to 
the  principles  of  imaging  ^ysics.  The  detailed 
study  of  the  imaging  process  can  lead  to  vision  sys¬ 
tems  with  both  greater  power  and  more  reliability. 
At  CMU,  we  have  pioneered  in  the  exploration  of 
physics-based  vision,  and  our  work  in  this  area  has 
led  to  major  advances  in  deployed  vision  systems. 
We  continue  to  study  these  fundamental  issues, 
paiticulaily  in  the  measuremem  of  object  color, 
shape,  and  surface  roughness. 

The  basic  principle  of  (^ysics-based  vision  is  that 
most  (non-metallic)  object  di^lay  two  colors;  the 
“objea  color”  of  the  material,  and  highlights 
caused  by  reflection  from  the  outer  surface.  The 
objea  color  is  important  for  identifying  objects, 
whereas  highli^its  reveal  the  smoothness  or 
roughness  of  the  surface.  Both  can  be  used  to  iden¬ 
tify  the  objea  shape  and  material.  Metals  display 
only  highlights  (i.e.  surface  reflection). 

1.1.  Color  and  Highlights 

In  1984,  CMU  vision  researchers  showed  that  the 


colors  of  pixels  in  an  image  of  a  non-metal  (“inho- 
mogeiteous  dielectric”)  have  an  imprtant  linear 
relationship  to  die  two  types  of  reflection.  We 
made  the  jHedictimi  that  die  colors  should  lie  on  a 
idane  in  the  RGB  color  space,  and  could  be  ana¬ 
lyzed  by  linear  algebra  to  actually  measure  the 
amounts  of  each  reflection  component  at  each 
pixel  In  1987,  we  accomplished  this  analysis,  and 
found  that  the  jnxel  colors  did  not  fill  the  jdane  in 
RGB  space,  but  formed  a  “skewed-T’  shape. 

We  (Novak  and  Shafer)  have  now  added  to  this 
theory  by  showing  that  ^  exact  dimensions  of  the 
skewed-T  sluqie  are  quantitatively  determined  by 
the  equations  of  body  and  surface  reflection  (such 
aslbirance-Sparnowl  In  our  new  method,  we  form 
die  color  histogram  of  the  pixel  values,  and  mea¬ 
sure  the  dimensions  including  the  ratio  of  die  base¬ 
line  to  the  stem  height  of  the  T,  the  angle  between 
the  parts,  and  the  positioi  of  their  intersection. 
From  these  measures,  by  inverting  the  equations, 
we  can  determine  the  surface  roughness  and  illu¬ 
mination  geometry.  Since  the  inversion  is  mathe¬ 
matically  intractaUe,  we  use  a  table  lookup.  With 
diis  method,  we  have  analyzed  images  of  several 
real  {dastic  objects  [Novak  and  Shafer  92].  This 
method  of  analysis  is  valuable  because  it  is  based 
entirely  on  the  color  histogram,  and  is  thus  inde¬ 
pendent  of  objea  shape.  It  also  contributes  to  our 
basic  understanding  of  the  relationship  between 
color,  shape,  and  surface  roughness. 

Unfortuiuoely,  such  analysis  does  assume  that  the 
objea  be  reasonably  smooth  and  that  the  directions 
of  surface  normals  in  the  image  are  widely  distrib¬ 
uted  in  all  directions.  Therefore,  it  cannot  handle 
cases  where  only  a  few  {danar  surface  patches  exist 
in  an  image,  lb  over  come  these  limitations,  we 
(Sato  and  Ikeuchi)  have  developed  a  novel  method 
for  measuring  surface  and  objea  properties  by  ana¬ 
lyzing  a  sequence  of  color  images  taken  witii  a 
moving  light  source  [Sato  and  Ikeuchi  92]. 

We  projea  the  data  into  a  four  dimensional  space, 
which  we  caU  the  temporal-color  space,  whose 
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Figure  1:  Solder  Joint  measured  by  photometric  stereo;  (a)  Picture  (b)  Elevation  map 


axes  are  the  three  color  axes  (RGB)  and  one  tem¬ 
poral  axis.  The  term  "temporal-color  space" 
implies  an  augmentation  of  the  RGB  color  space 
with  an  additional  dimension  that  varies  with  time. 
Because  the  light  source  is  moving,  this  dimension 
represents  the  geometric  relationship  between  the 
viewing  direction,  illumination  direction,  and  sur¬ 
face  normal.  Thus,  many  geometries  are  sampled 
over  time. 

The  significance  of  the  temporal-color  space  lies  in 
its  ability  to  represent  the  change  of  image  color 
with  time,  whereas  a  conventional  color  space 
analysis  yields  the  histogram  of  the  colors  in  an 
image,  only  at  an  instant  of  time.  Conceptually,  the 
two  reflection  components,  surface  reflection  and 
body  reflection,  form  two  subspaces  in  the  tempo¬ 
ral-color  space.  These  two  components  can  be 
extracted  by  principal  component  analysis  using 
the  singular  v^ue  decomposition  technique. 

This  technique  has  several  advantages  over  other 
methods  for  color  image  analysis:  it  does  not 
require  any  prior  knowledge  about  surface  reflec¬ 
tance  and  shape;  it  can  recover  the  surface  orienta¬ 
tion  and  reflectance  at  each  pixel  individually;  and 
it  does  not  depend  on  the  assumption  of  a  global 
distribution  of  surface  normals  in  the  image.  This 
method  has  been  successfully  applied  to  images  of 
real  colored  objects,  resulting  in  the  measurement 
of  the  specular  reflection  component  and  the  body 
reflection  component.  These  components  are  sub¬ 
sequently  used  to  recover  surface  orientation  and 
reflectance  at  each  point,  providing  an  easy-to- 
implement  method  for  visually  localizing  and 
inspecting  an  object. 


1.2.  Measuring  Shape  and  Roughness  of 
Metal  Surfaces 

Metal  surfaces  are  of  special  interest  in  physics- 
based  vision  because  they  are  important  for  many 
manufacturing  and  inspection  tasks,  because  they 
exhibit  shininess  or  roughness  clearly,  and  because 
they  are  not  directly  amenable  to  analysis  by  color. 
However,  they  yield  to  methods  of  controlled  illu¬ 
mination,  provided  that  the  illumination  and  the 
reflection  model  are  precisely  enough  known. 
Unfortunately,  the  i^ysics  community  has  prima¬ 
rily  chosen  to  model  surfaces  that  are  very  smooth, 
which  are  important  for  specialized  applications 
but  not  for  general  visual  inspection.  Therefore,  the 
image  understanding  cmnmunity  has  been  forced 
to  develop  its  own  models  that  are  more  directly 
useful  for  machine  vision. 

At  CMU,  we  proposed  a  model  in  1991  to  unify 
the  well-known  models  of  Torrance-Sparrow, 
Beckmarui-Spizzichino,  and  Lambert.  Later,  we 
built  a  3D  photosampling  device  and  successfully 
^plied  it  to  measure  smooth  surfaces  such  as  sili¬ 
con  wafers  and  transparent  plastic  lenses.  How¬ 
ever,  since  the  algorithm  used  for  analysis  ignored 
the  specular  diffuse  lobe  component,  it  could  not 
be  applied  to  rough  surfaces  such  as  solder  joints 
or  sand-blast  finished  surfaces,  which  are  com¬ 
monly  found  in  many  industrial  parts. 

We  (Kiuchi  and  Dceuchi)  have  developed  a  novel 
algorithm  to  recover  the  surface  shape  and  the 
roughness  of  such  surfaces  (Figure  1)  [Kiuchi  and 
Dceuchi  92].  Since  the  reflectance  model  depends 
on  surface  roughness  as  well  as  surface  orientation, 
we  are  able  to  recover  both  shape  and  roughness 
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with  our  method.  We  take  a  set  of  image  brightness 
values  measured  at  each  surface  point  by  using  our 
3D  photosampler  device.  Each  brighmess  value 
provides  one  non-linear  image  irradiance  equation, 
which  contains  unknown  parameters  for  surface 
orientation,  reflectance,  and  surface  roughness. 
The  algorithm  iteratively  solves  the  set  of  image 
irradiance  equations  at  each  pixel  with  respect  to 
these  parameters.  Experiments  conducted  on  sev¬ 
eral  rough  surfaces  show  a  high  accuracy  in  esti¬ 
mated  surface  orientations,  and  a  good  estimation 
of  surface  roughness.  We  have  been  able  to  deter¬ 
mine  the  shape  of  a  brass  cylindrical  object,  the 
shape  of  a  solder  joint,  and  the  roughness  of  two 
nickel  microfinish  comparators,  and  we  have  been 
able  to  detect  defects  in  a  thick  film  of  gold  depos¬ 
ited  on  a  LSI  chip  package. 

2. 3D  Shape  and  Motion  Recovery 

Geometric  methods  for  recovering  surface  shape 
and  camera  motion  are  cnicial  for  tasks  such  as  site 
modeling  -  the  generation  of  a  detailed  three- 
dimensional  model  of  a  surveyed  site  -  and  for 
robot  vehicle  navigation  and  control.  These  tech¬ 
nologies,  in  turn,  are  important  for  a  wide  range  of 
military  and  civilian  applications  including  cartog¬ 
raphy,  reconnaissance,  and  damage  assessment. 

These  tq)plications  require  the  development  of  effi¬ 
cient  and  reliable  Image  Understanding  methods  to 
determine  precise  three-dimensional  shape  infor¬ 
mation  from  a  stationary  or  moving  platform  such 
as  a  ground  or  air  vehicle,  or  a  stereoscopic  cam¬ 
era..  We  have  developed  four  new  approaches  to 
this  problem:  (1)  the  image  spectrogram  for  texture 
and  stereo  analysis;  (2)  the  multi-baseline  stereo 
method  for  depth  mtq^ping  from  multiple  images; 
(3)  the  factorization  method  for  motion  analysis 
under  orthography  and  perspective;  and  (4)  reli¬ 
ably  obtaining  sh^  from  lens  focus  and  defocus. 

2.1.  The  Image  Spectrogram 

Image  texture  is  confusing  to  neatly  all  vision 
methods  for  3D  shape  recovery,  particularly  out¬ 
doors.  Yet,  texture  can  actually  be  a  rich  source  of 
information  for  measuring  surface  and  terrain 
shape,  classifying  vegetation,  and  defeating  cam¬ 
ouflage.  We  (Krumm,  Maimone,  and  Shafer)  have 
developed  a  new  approach  for  image  texture  analy¬ 
sis  using  the  image  spectrogram,  which  comprises 
the  “local  Fourier  transfonns”  measured  in  the 
neighborhood  around  each  pixel.  Patterns  in  the 
texture  of  the  image  are  revealed  in  the  structure  of 
each  such  Fourier  transform,  and  3D  shape  is 


revealed  as  a  systematic  variation  from  one  to  the 
next 

There  are  two  classical  problems  in  image  texture 
analysis:  segmenting  flat  (2D)  texture  regions  in 
the  image,  and  measuring  3D  shape  from  texture 
gradients  on  slanted  or  curved  surfaces.  We  devel- 
(^d  methods  for  each  of  these  problems  that 
^owed  the  spectrogram  can  solve  each  problem  as 
well  as  oflier  corrunon  approaches  [Knimm  and 
Shafer  92].  However,  the  unique  power  of  the 
spectrogram  lies  in  its  ability  to  integrate  these 
problems  in  the  same  representational  framewoilc. 
Using  the  spectrogram  we  can  solve  the  combined 
problem  of  texture  segmentation  and  shape  analy¬ 
sis.  The  local  Fourier  transforms  of  pixels  on  the 
same  surface  are  similar,  within  a  linear  (affine) 
transform  of  each  other,  but,  across  surfaces,  the 
Fburier  transforms  differ  significantly.  Our  analy¬ 
sis  proceeds  by  creating  several  hypotheses  about 
surface  orientation  based  on  small  regions  in  the 
image.  Each  hypothesis  consists  of  an  estimate  of 
the  local  surface  normal  and  an  estimate  of  what 
the  local  frequency  distribution  of  the  texture 
would  look  like  if  viewed  frontally.  This  "frontal- 
ized"  frequency  distribution  is  computed  by  undo¬ 
ing  the  effects  of  the  estimated  surface  normal  on 
the  local  frequency  distribution.  Thus,  two  adjoin¬ 
ing  regions  of  similar  texture  and  different  surface 
normal  will  have  the  same  hypothesized  local  fre¬ 
quency  distribution.  We  then  merge  similar 
hypotheses  to  form  regions.  This  method  is  now 
being  tested  in  the  Calibrated  Imaging  Laboratory. 
In  addition,  we  (Maimone  and  Shafer)  are  now 
applying  the  spectrogram  to  stereo  vision  in  tex¬ 
tured  environments,  where  it  appears  promising  for 
addressing  long-unsolved  problems  in  avoiding 
false  matches  and  occlusions.  The  spectogram  is 
[HTOving  to  be  a  powerful  bridge  between  the  Fou¬ 
rier-based  theories  of  signal  processing  and  the 
geometry-based  ^roaches  of  machine  vision. 

2.2.  Multi-Baseline  Stereo 

In  stereo  matching,  a  longer  baseline  gives  a  pre¬ 
cise  depth  estimate,  because  the  depth  is  calculated 
by  triangulation.  A  longer  baseline,  however,  poses 
problems  in  matching:  a  longer  disparity  range 
must  be  searched,  some  parts  in  the  scene  may  be 
occluded,  and  the  appearance  of  some  objects  in 
the  scene  may  change  significantly  between 
images.  Matching  becomes  more  difficult,  and 
there  is  a  greater  possibility  of  false  matches.  Con¬ 
versely,  a  shorter  baseline  makes  matching  easier, 
but  reduces  the  precision  of  the  estimate.  There  is  a 
trade-off  between  precision  and  correctness. 
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We  (Okutomi  and  Kanade)  developed  a  multiple- 
baseline  stereo  technique  to  solve  this  problem. 
This  method  uses  multiple  stereo  pairs  with  differ¬ 
ent  baselines  generated  by  a  lateral  displacement  of 
a  camera.  Matching  is  performed  simply  by  com¬ 
puting  the  sum  of  squared-difference  (SSD)  values 
between  multiple  stereo  pairs.  The  SSD  ftuKtions 
for  individual  stereo  pairs  are  represented  with 
respect  to  the  inverse  distance,  and  they  are  simply 
added  to  produce  the  sum  of  the  SSDs.  This  result¬ 
ing  function  is  called  the  SSSD-in-inverse-dis- 
tance.  The  range  estimate  is  calculated  by  finding 
the  minimum  of  the  SSSD-in-inverse-distance 
curve.  This  curve  shows  a  unique  and  clear  mini¬ 
mum  at  the  correct  matching  position  even  when 
the  underlying  intensity  patterns  of  the  scene 
includes  ambiguities  or  repetitive  patterns. 

Our  method  has  been  implemented  in  software  and 
tested  with  images  from  both  indoor  and  outdoor 


scenes  under  a  wide  vanety  of  conditions  [Oku¬ 
tomi  and  Kanade  92].  Indoors,  in  a  calibrated  labo¬ 
ratory,  at  a  distance  of  0.5-1. Om;  and  outdoors,  at  a 
distance  of  approximately  15-3Sm.  We  have  also 
tested  the  method  on  a  large  scale  outdoor  scene 
shown  in  Figure  2(a)  at  the  Wsstinghouse  Research 
Center  in  Pittsbuii^  where  Ambler,  fire  CMU 
Plarretary  Rover,  was  tested.  The  scene  contains  a 
grassy  ^d  wifii  a  line  of  trees  at  a  distance  of  60 
m.  Six  images  with  htmzontal  displacements  and 
six  additional  images  with  vertical  di^lacement 
were  used.  The  widest  horizontal  and  vertical  base¬ 
line  in  this  set  was  9cm.  Figure  2(b)  shows  the  dis¬ 
parity  image.  The  noisy  region  is  due  to  lack  of 
features  in  the  area  of  sky  in  the  origiruil  image. 
This  noise  is  sucessfuUy  detected  by  the  uncer¬ 
tainty  estimate.  The  plots  in  Figure  2(c)  are  the  3D 
terrain  profiles  shown  as  height  vs.  horizontal  dis¬ 
tance  along  the  vertical  columns  drawn  in  the  fig- 
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Figure  2:  Test  site  for  Muiti-Baseiine  Stereo:  (a)  Image  (b)  Depth  map  (c)  Elevation  profiles 
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ure  above.  We  can  observe  that  the  terrain  features 
of  the  scene  are  correctly  recovered;  a  flat  and 
somewhat  descending  region  at  the  front,  a  slope  in 
the  middle,  and  then  a  more  gentle  slope  at  the  rear 
before  the  tree  line.  The  measured  distances  to  a 
few  points  in  the  scene  have  been  verified  to  be 
correct  to  within  1%. 

The  multi-baseline  stereo  system  has  been  put  to 
practical  use  on  a  CMU  robot  named  Dante,  a 
large,  8-legged  walking  machine  sponsored  by 
NASA  and  NSF  for  the  exploration  of  a  live  vol¬ 
cano  in  Antarctica.  The  system  uses  three  cameras 
arranged  along  a  1 -meter  horizontal  baseline.  The 
system  software  was  highly  optimized  in  favor  of 
speed  in  exchange  for  somewhat  reduced  preci¬ 
sion.  Output  is  a  dense  depth  map  of  256x256  pix¬ 
els  with  40  disparity  levels  in  7  seconds,  which 
enables  the  robot  to  move  slowly  (2-3  MPH) 
through  a  field  of  obstacles.  The  system  has  also 
been  used  to  guide  the  CMU  robotic  truck,  NAV- 
LAB  II. 

2.3.  Factorization  for  Motion  Analysis 

Recovery  of  3D  shape  and  camera  motion  informa¬ 
tion  is  imprtant  for  robot  vehicle  control  and  ter¬ 
rain  mapinng.  Unfortunately,  the  problem  can  be 
mathematically  unstable.  We  developed  a  novel 
factorizatitm  method  for  analyzing  image 
sequences  that  avoided  the  instability  by  using  sin¬ 
gular  value  decomposition  of  the  observation  data. 
The  method  was  limited  to  telephoto  lenses  due  to 
its  use  of  the  orthographic  projection  model. 

We  (Poelman  and  Kanade)  have  recently  devel¬ 
oped  a  new  factorization  method  that  uses  a  parap- 
erspective  projection  model  [Poelman  and  Kanade 
92].  Paraperspective  correctly  models  several 
aspects  of  real  camera  image  projection  which 
orthography  fails  to  account  for,  including  the 
change  in  the  image  size  of  an  object  as  it  moves 
towards  or  away  from  the  camera,  and  the  chang¬ 
ing  angle  from  which  an  object  is  viewed  as  the 
object  translates  across  the  field  of  view.  These 
properties  allow  the  method  to  be  used  in  a  much 
wider  range  of  situations,  and  allow  the  recovery 
of  the  distance  from  the  camera  to  the  object.  Yet 
the  paraperspective  pro  jection  model,  like  ortho¬ 
graphic  projection,  can  be  described  by  linear 
equations.  This  allows  us  to  recover  the  shape  and 
motion  in  an  efficient  and  robust  manner  similar  to 
the  original  factoriza  tion  method.  The  method  has 
been  tested  on  both  synthetic  and  real  data,  and 
will  enable  of  reliable  terrain  mapping  from  a  mov¬ 
ing  vehicle  with  a  single  camera. 


2.4.  Camera  and  Lens  Modeling 

Although  the  basic  optics  of  lenses  and  cameras  is 
tmown  in  theory,  most  models  do  not  account  for 
the  aberrations  encountered  in  practice.  At  CMU, 
in  the  Calibrated  Imaging  Laboratory,  we  (Willson, 
Xiong,  and  Shafer)  are  developing  new  models  for 
modeling,  calibration,  and  control  of  automated 
zoom  lenses.  Our  recent  work  has  led  to  advances 
in  the  calibration  of  image  center  and  in  develop¬ 
ing  3D  shape  recovery  from  lens  focus. 

Nearly  all  methods  for  machine  vision  assume  that 
the  image  center  is  considered  to  be  the  point  of 
intersection  of  the  camera's  optical  axis  with  the 
camera's  sensing  plane.  In  fact,  there  are  many  pos¬ 
sible  definitions  of  image  center,  and  in  real  lenses 
most  do  not  have  the  same  coordinates.  We  have 
identified  16  different  ways  to  define  “image  cen¬ 
ter",  and  developed  a  taxonomy  of  image  centers 
based  on  the  number  of  different  camera  settings 
used  and  on  the  type  of  measurements  that  are 
made  during  calibration  [Willson  and  Shafer  93]. 
By  using  the  proper  image  center  for  each  image 
property  that  we  are  trying  to  model  and  by  cali- 
iMeting  the  image  centers  over  the  approfxiate 
ranges  of  lens  parameters,  we  have  improved  the 
{decision  of  a  standard  (Ikai-Lenz)  calibration 
horn  0.23±0.10  pixels  RMS  error  to  0.06±0.04 
pixels. 

Based  on  our  new  lens  models,  we  have  developed 
new  methods  that  dramatically  improve  3D  shape 
recovery  from  lens  focus  and  defocus  [Xiong  and 
Shafer  93].  In  the  range-fiom-focus  task,  we  obtain 
an  accuracy  of  1  part  in  1(X)0  at  distances  of  1.2m, 
which  is  an  improvement  by  fivefold  over  previ¬ 
ously  published  results  for  this  task.  The  improve¬ 
ment  comes  from  smoothing  the  criterion  fimction 
fitting  a  polynomial  curve  to  it  in  the  vicinity  of  the 
peak  value,  where  noise  becomes  the  limiting  fac¬ 
tor  to  precision.  More  significant,  for  range-from- 
defocus,  we  made  two  improvements  -  we  use  an 
iterative  method  to  overcome  the  effects  of  win¬ 
dow  size  on  the  calculation,  and  we  developed  an 
improved  blur  model  in  terms  of  motor  control 
variables  rather  than  abstract  optical  idealizations, 
allowing  more  accurate  calibration.  Taking  just 
two  images  with  differing  focus,  we  have  obtained 
dense  depth  m:q)s  with  a  precision  better  than  1 
pan  in  2(X),  compared  to  results  of  1  pan  in  75 
reponed  in  the  literature.  Vfith  this  precision, 
depth-from-defocus  is  becoming  a  viable  comple¬ 
ment  to  more  established  techniques  for  3D  shape 
recovery  such  as  stereo  vision. 


31 


3.  Computational  VLSI  Range  Sensor 

While  vision  software  is  useful  for  long-range  3D 
shape  recovery,  at  shon  distances,  hardware  can  be 
used  to  accomplish  the  purpose  more  rapidly  and 
reliably.  This  requires  computational  sensors  that 
place  computing  power  on  tte  sensing  chip  itself  to 
perform  the  range  computation  without  the  bottle¬ 
neck  of  data  transmission  to  a  CPU.  The  resulting 
sensors  will  be  useful  for  robotic  inspection, 
manipulation,  and  control  in  real  time. 

At  CMU,  we  (Gruss.  Tada,  and  Kanade)  have 
developed  such  a  VLSI  range-image  sensor  based 
on  the  light-stripe  trianguladon  technique,  which 
has  proven  to  be  robust  as  well  as  amenable  to 
hardware  implementatioa  Our  sensor  (Figure  3) 
produces  1(X)  frames  of  range  data  per  second, 
which  is  two  orders  of  magnitude  faster  than  con¬ 
ventional  light-stripe  sensors  [Gruss  et  al.  1992]. 


Figure  3:  This  chip  measures  a  32x32  inuige  of 
range  data,  1000  times  per  second 

The  drip  consists  of  an  array  of  i^tosensitive 
cells  which  iralependently  determine  when  they 
see  light  from  the  stripe  reflected  bade  by  objects 
in  the  scene.  Working  in  parallel,  the  32x32  array 
of  cells  acquires  a  1,024  pixel  range  image  in  one 
millisecond.  The  accuracy  and  repeatability  of 
each  pixel  has  been  measured  to  be  within  0.5  mm 
at  500  mm  distances  (0.1%). 

This  sensor,  the  second  version  of  our  design,  has 
cells  that  are  40%  smaller  than  our  first  design, 
giving  an  increase  from  28x32  to  32x32  cells.  In 
addition,  a  true  peak  detector  has  replaced  the 
thresholding  dreuit  previously  used  to  identify  the 
light  stripe.  The  new  peak  detection  scheme  has 
two  important  advantages.  First,  the  new  design  is 
more  sensitive  to  the  stripe,  which  allows  the  sen¬ 
sor  to  operate  in  the  presence  of  bright  indoor 


amdent  lighting,  and  the  increased  sensitivity  per¬ 
mits  its  use  on  a  wider  variety  of  objects.  In  addi¬ 
tion,  the  range  sensor  chip  now  provides 
reflectance  information  as  an  artifact  of  the  new 
peak  detection  process.  The  reflectance  image  is 
read  from  the  chip  along  with  acquired  range  data, 
and  pixels  of  the  reflectance  image  are  perfectly 
aligned  with  corresponding  pixels  in  the  range 
image.  The  reflectance  data  is  useful  for  objea  rec¬ 
ognition,  and  we  also  use  it  for  efficient  calibration 
of  the  VLSI  sensor  itself. 

One  of  the  distinguishing  features  of  this  research 
is  the  very  practical  nature  of  the  problem  that  has 
been  solved  -  our  sensor  technology  provides  the 
high  frame  rates  required  by  the  most  demanding 
autonomous  robotic  systems.  The  advantages  of 
VLSI  computational  sensing  have  been  advocated 
by  many,  but  few  practical  sensors  of  this  type 
have  been  developed.  We  now  plan  to  deploy  these 
systems  for  research  in  robotic  applications  within 
the  DARPA  and  NSF  communities. 

4.  Parallel  Vision 

In  addition  to  sensor  hardware,  researchers  at 
CMU  are  using  parallel  computing  to  speed  up 
vision  software  for  practical  application.  One  focus 
of  our  parallel  vision  efforts  (Webb)  has  been  the 
development  of  languages  for  easy  but  efficient 
[Nngrattuning  of  image  processing  operations.  This 
led  to  AdiqH.  which  covers  both  local  and  global 
(^rations.  Recently,  Adapt  was  released  commer¬ 
cially  by  Intel  Corporation  for  the  iWarp  computer, 
nans  are  in  place  to  support  Adapt  on  ffiture  paral¬ 
lel  computers  from  Intd,  as  pan  of  the  effon  to 
support  current  iWarp  users.  Based  on  our  experi¬ 
ence  with  Adept,  we  have  developed  a  new  parallel 
FORTRAN  with  a  "Do&Merge"  loop  construct 
which  allows  the  programmer  to  describe  a  parallel 
program  at  two  levels  (similar  to  Adrqx).  one 
describing  an  operation  to  be  performed  in  parallel 
across  an  iteration  range,  and  another  describing 
how  to  combine  the  independently  computed 
results.  This  Do&Merge  loop  is  being  incorporated 
into  the  CMU  FORTRAN  compiler  for  easy  gener¬ 
ation  of  efficient  image  processing  {xograms. 

The  Adapt  language  has  been  used  to  develop  an 
efficient  implementation  of  Kanade-Okutomi 
multi-baseline  stereo  vision  [Webb  93].  The  iWarp 
implementation  of  Adrqrt  allows  processing  of 
three  240x256  images  to  recover  sixteen  disparity 
levels.  Plans  are  underway  to  demonstrate  real¬ 
time  performance  by  using  an  Ironies  fnunegrab- 
ber  for  the  camera  interface;  we  expect  to  process 
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10  stereo  pairs  per  second.  This  passive  3D  vision 
system  will  be  tremendously  useful  in  a  wide  vari¬ 
ety  of  robotic  i4)plications,  including  SSV.  UGV. 
Air  Vehicle,  and  industrial  aj^Iications. 

5.  Vision  for  Object  Recognition  and 
Manipulation 

Robotic  manipulation  of  objects  is  one  of  the  most 
important  a{^lication  of  autonomous  systems.  At 
CMU,  we  have  developed  a  new  representation  for 
recognizing  curved  objects,  expanded  our  abUity  to 
automatically  generate  vision  programs  with  the 
Vision  Algorithm  Compiler  (VAC),  and  learning 
assembly  by  observing  a  human  do  the  task. 

5.1.  Surface  Modeling  for  Recognition 

Recognizing  curved  objects  is  important  for 
manipulating  and  inspecting  equiinnent,  vehicle 
parts,  and  manufacture  items.  CMU  researchers 
(Delingette,  Hebert,  and  Ikeuchi)  have  developed  a 
novel  representation  of  curved  objects  called  the 
Simplex  Angle  Image  that  allows  matching  even 
when  one  of  the  objects  is  partly  blocked  from 
view  [Delingette  et  al.  93]. 

To  compute  the  SAI,  we  begin  with  dense  3D 
range  data.  We  pose  an  ellipsoidal  mesh  of  points 
that  surrounds  the  object,  and  by  a  method  based 
on  deformable  surfaces,  we  shrink  the  mesh  to  fit 
the  surface  data  closely  and  in  a  way  that  is  unique, 
regardless  of  the  object's  angle  of  orientation.  At 
each  node  on  this  mesh,  we  compute  the  “simplex 
angle”  which  expresses  the  3D  curvitture  of  the 
surface  between  this  node  and  its  neighbors  on  the 
mesh.  Each  simplex  angle  is  paired  with  the  sur¬ 
face  normal  at  that  point,  and  mapped  onto  the 
point  of  the  unit  sphere  corresponding  to  the  nor¬ 
mal.  The  result,  c^ed  the  Simplex  Angle  Image 
(SAI),  is  a  spherical  representation  of  the  object.- 
The  key  feature  of  this  representation  is  that  two 
instances  of  the  same  object  in  two  different  poses 
have  the  same  SAI  up  to  a  rotation  of  the  unit 
sphere.  Using  this  approach,  we  have  demonstrated 
the  recognition  of  3-D  curved  objects  in  range 
images  of  complex  scenes  with  multiple  objects, 
and  we  have  also  used  the  SAI  to  piece  together 
multiple  views  of  a  complex  3D  object  such  as  a 
hand. 

5.2.  Vision  Algorithm  Compiler 

The  Vision  Algorithm  Compiler  (VAQ  is  a  method 
developed  at  C^U  to  replace  the  expensive  cus¬ 
tom-building  of  vision  software  with  an  automated 


vision  programming  system.  Provided  with  sym¬ 
bolic  descripticMis  of  the  objects,  sensor,  and  task, 
the  VAC  generates  a  vision  program  that  can  be 
rapidly  executed  on-line  to  perform  the  task.  This 
iq)prDach  will  greatly  reduce  the  cost  of  construct¬ 
ing  vision  systems  for  robotic  manipulation  and 
assembly.  To  prepare  the  VAC  for  deployment,  we 
have  made  recent  advances  in  two  problem  areas: 
planning  multiple  observations  to  resolve  ambigu¬ 
ities,  and  developing  an  efficient  algorithm  for 
object  recognition  for  large  object  databases. 

We  (Ikeuchi,  Wheeler  and  Gremban)  have  devel¬ 
oped  techniques  to  utilize  sensor  motions  to 
acquire  observations  to  resolve  ambiguous  inter¬ 
pretations  of  an  objea's  pose.  The  solution  utilizes 
a  resolution  tree  specifying  the  motions  from  the 
initial  vantage  poim  that  will  reduce  the  ambiguity 
until  a  unique  pose  can  be  resolved  from  the  obser¬ 
vations.  The  resolution  tree  is  created  off-line  by  a 
traditional  plaruier  that  utilizes  knowledge  of  the 
aspect  classes  and  the  spatial  extent  of  the  aspects. 
New  data  representations  were  developed  to  facili¬ 
tate  this  planning  operation.  We  have  conducted 
experiments  in  recognizing  specular  objects  and 
finger-g^  sensing,  showing  the  utility  of  multiple 
ob^rvations  for  resolving  pose  ambiguity. 

We  have  developed  a  iwvel  algorithm  for  object 
recognition  in  range  images  that  is  efficient  for 
large  model  databases  [Wheeler  and  Ikeuchi  93]. 
Our  algorithm  performs  optimal  selection  of 
hypothesized  views  to  model,  and  for  each  one.  the 
visible  image  features  are  generated  through  simu¬ 
lation  of  the  imaging  and  feature  extraction  pro¬ 
cess.  In  these  simulated  views,  the  correspondence 
of  model  features  arul  image  features  is  known. 
Using  this  correspondences,  statistics  are  accumu¬ 
lated  to  produce  conditional  probability  distribu¬ 
tions  of  the  extracted  features  given  the  visibility 
of  each  model  feature  in  the  scene  using  Markov 
Random  Fields  and  a  Highest  Confidence  First 
estimation  tec^que.  This  method  has  proven 
effective  in  substantially  reducing  the  number  of 
hypotheses  to  be  verified  while  dxrosing  accurate 
hypotheses.  Because  of  ability  to  search  through 
large  sets  of  possible  hypotheses,  this  method 
allows  us  to  automatically  genersoe  vision  pro¬ 
grams  for  large  sets  of  objects  and  also  for  p^y 
occluded  objects. 

5.3.  Learning  From  Observation 

We  (Ikeuchi,  Kang,  Suehito,  Kawade)  have  been 
working  on  a  task  programming  approach  called 
Assembly  Plan  from  Observation  (APO),  which 
will  enable  the  robot  system  to  observe  a  human 
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perform  a  task,  understand  it.  and  perform  the 
same  task  with  minimum  human  intervoitioa  In 
this  iy)proach.  the  human  provides  the  inteUigence 
in  choosing  ^  initial  h^  (end-effector)  trajec¬ 
tory.  the  grasping  strategy,  and  the  hand-object 
interaction  by  directly  acting  diem  out  This 
apiMoach  helps  to  alleviate  the  problems  of  sym¬ 
bolic  path  pkuming.  gra^  synthesis,  and  task  spec- 
ificatitm. 

Recently,  we  have  developed  a  method  of  APO  for 
objects  with  curved  surfaces  [Dceuchi  et  al.  93]. 
Ea^  surface  patch  is  categorized  according  to  die 
signs  of  its  Gaussian  and  mean  curvatures.  In  our 
current  implementation,  the  human  operator  dem- 
(Histrates  a  task  one  step  at  a  time  to  the  system. 
Each  task  is  crated  in  intensity  and  range 
images.  The  intensity  images  (sampl^  at  a  regular 
interval)  are  analyzed  to  detect  the  next  meaningful 
action  of  the  human  operator,  while  the  range 
images  are  used  to  recognize  objects  and  locate  the 
hand  in  the  scene.  A  significant  brightness  differ¬ 
ence  between  consecutive  intensity  images  signals 
the  occurrence  of  a  meaningful  action,  and  triggers 
the  range  finder  to  crqitute  the  range  image  of  the 
scene.  The  system  roughly  locates  the  grasping 
points  from  the  last  two  images  by  image  subtrac¬ 
tion.  superquadric  fitting  of  the  distal  portions  of 
the  thumb  and  index  finger,  and  determination  of 
die  intersecdon  points  between  the  superquadrics 
and  the  grasped  Object  The  contact  transition  is 
recognized  based  on  the  before-  and  after-task 
range  images:  the  appropriate  task  model  is  deter¬ 
mined  from  die  contaa  relationship  (Figure  4)  and 


determined  fhim  *^before**  and  **after**  contacts 


ters.  This,  couided  with  mechanical  pre^rties  such 
as  bolt-like  and  nut-like  motions,  eriables  the  robot 
to  replicate  tasks  such  as  picking  and  {dacing,  and 
screwing  a  bolt  into  a  hole. 

Grasp  recognititm  is  central  to  automatic  learning 
of  a  graqnng  task.  We  (Kang  and  Ikeuchi)  have 
developed  a  representation  called  die  ctmtaa  web 
in  conjunction  with  a  grasp  taxonomy  to  identify  a 
grasp  (Figure  5)  [Kang  and  Ikeuchi  ^].  The  grasp 
is  represented  by  a  grasp  abstraction  hierarchy 
sdiich  comprises  high-level  (type  of  grasp),  inter¬ 
mediate-level  (finger  groups),  and  low-level  (loca- 
timis  and  joint  angles)  information. 


Figure  5:  Recognizing  a  grasp  (a)  Range  image  of 
ha^  (b)  Hand  object  (c)  Solid  model  (d)  Grasp 

Our  method  for  grasp  recognition  detects  all 
phases  of  die  gnsfung  operatioa  Given  a  temporal 
image  sequence  of  a  gra^^^g  task,  the  pregrasp 
phase  is  fim  inferred  qiproximately  (using  param¬ 
eters  such  as  qieed.  grip  aperture,  and  aq^roach 
polygon  area)  until  the  grasp  itself  has  been  tempo¬ 
rally  located  in  the  sequence  and  identified.  Pre¬ 
liminary  work  has  indicrued  that  we  can  temporally 
locate  the  static  grasp  phase  within  the  sequence  by 
analyzing  both  die  ^eed  and  approach  polygon 
area  profiles.  The  identification  of  the  grasp  will 
either  strengthen  or  weaken  the  pregrasp  phase 
hypothesis.  In  addition,  it  constrains  the  types  of 
manipulation  that  can  be  applied  to  the  object.  Our 
system  for  recognizing  gra^ng  tasks  comprises  a 
CyberGlove  hand  tracker  (with  a  Polhemus 
device),  a  monochrome  camera,  and  a  range  finder. 
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6.  Vision  for  Robot  Vehicles 

Mobile  robots  are  vital  for  reconnaissance,  explo¬ 
ration,  and  ail  missions  to  be  carried  out  in  remote 
locatitms.  Navigatitm  by  GPS  and  dead  reckoning 
can  be  used  to  tell  where  the  vehicle  is,  but  to 
guide  it  dirough  terrain  or  to  reach  a  taiget,  visual 
sensing  is  needed.  This  is  challenging  because  of 
the  need  for  reliability  in  a  complex,  ever-changing 
outdoor  environment.  CMU  has  been  a  leader  in 
vision-guided  robot  vehicle  development,  and  has 
several  major  vehicle  programs:  the  Navlab/UGV 
effoit  for  wheeled  land  vehicles,  the  Ambler  for 
legged  locomotion  in  rough  terrain,  and  a  new 
autonomous  helicq>ter. 

6.1.  Navlab  and  UGV 

CMU  is  a  key  member  of  the  DARPA  Unmanned 
Ground  Vehicle  (UGV)  program.  Based  on  our 
Navlab  autonomous  van  and  our  Navlab  II 
HMMWV,  we  are  providing  basic  mobility  for  the 
Demo  II  effoit.  including  perception,  planning, 
vehicle  control  and  modeling,  and  human-com¬ 
puter  interaction.  We  have  recently  demonstrated 
cross-country  navigation  for  5  km;  autonomous 
parallel  parking  (including  finding  the  parking 
space);  and  a  mission  that  includes  driving  on  dirt 
roads,  avoiding  obstacles,  following  a  mr^.  driving 
off  road,  and  stopping  at  a  designated  landmark. 
Our  sof^are  is  being  transferred  to  Martin  Mari¬ 
etta  for  integration  into  the  UGV  testbed  vehicle. 

We  have  made  advances  in  several  areas  to  achieve 
these  goals: 

Multiple  types  of  roadway:  ALVINN  (Pomer- 
leau)  is  our  neural-net  based  road  following  pro¬ 
gram,  which  learns  to  drive  from  five  minutes  of 
observing  the  human  driver’s  control  of  the  vehi¬ 
cle.  Recently,  we  have  extended  ALVINN  to  drive 
on  a  wide  variety  of  roads,  using  information  from 
several  different  networks,  each  trained  for  an  indi¬ 


vidual  road  type  (Figure  6)  [Pomeileau  92].  Differ¬ 
ent  networks  are  selected  based  on  the  match 
between  the  iiqxit  scene  and  its  eiKxxling  as  repre¬ 
sented  in  each  network.  MANIAC  (Jochem, 
Pomeileau,  and  Thorpe)  does  ix)t  select  an  individ¬ 
ual  network,  but  instead  uses  an  additicxial  neural 
network  which  looks  at  the  outputs  of  the  individ¬ 
ual  nets  [Jochon  et  al.  93].  This  top-level  net  can 
then  use  ii^t  from  each  of  the  lower-level  nets  to 
produce  a  steering  response  which  may  be  superior 
to  any  of  the  responses  from  the  individual  nets. 

Obstacle  location  and  avoidance:  GANESHA 
(Langer,  Hebert,  and  Thorpe)  uses  several  sonar 
sensors  placed  around  the  front  of  the  vehicle  for 
obstacle  avoidance  using  an  occupancy  grid  repre¬ 
sentation  centered  on  the  moving  vehicle.  It  has 
also  been  used  for  parallel  parking  (Figure  7) 
[Langer  and  Thorpe  92],  and  the  moving  occu¬ 
pancy  grid  representation  has  been  used  to  inte¬ 
grate  stereo  vision  and  laser  range  data  as  well. 

Intersections:  YARF  (Kluge  and  Thorpe),  our  sys¬ 
tem  for  driving  by  tracking  lane  markings,  can  now 
detect  and  navigate  intersections  as  well  as  road¬ 
way  stretches  [Kluge  and  Thorpe  93].  This  will 
allow  autonmnous  traversal  of  road  networks. 

Teleoperation:  STRIFE  (Kay  and  Thorpe)  is  a 
metlx^  of  semi-autoiK)mous  teleoperation  of  a 
vehicle  which  allows  it  to  accurately  traverse  hilly 
terrain  while  communicating  with  the  operator 
across  a  very  low  bandwidth  link  [Kay  aiul  Thorpe 
93].  The  operator  plots  the  vehicle's  chosen  trajec¬ 
tory  based  on  a  single  2-D  image,  and  the  transfor¬ 
mation  of  2-D  image  points  to  3-D  world  points  is 
done  in  real  time  on  die  vehicle. 

Automatic  convoying:  RACCXION  (Sukthankar) 
is  a  viaon  system  that  tracks  taillights  for  car-fol¬ 
lowing  at  night,  building  a  map  in  real  time  of  the 
lead  vehicle's  position  for  accurate  control  [Suk¬ 
thankar  92].  It  has  been  demonstrated  on  the  NAV¬ 
LAB  n  at  32  km/h  on  a  winding  road. 


Figure  6:  ALVINN  has  been  trained  for  several  roadways:  dirt  road,  bicycle  path,  two-lane  highway 
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Figure  7:  Automated  parallel  parking:  (a)  Approach  (b)  Detect  gap  (c)  Prepare  to  enter  (d)  Roadway 


Landmark  recognition:  Landmark  recognition  is 
also  important  for  both  roadway  and  off-road  navi¬ 
gation,  and  we  (Hebert)  have  a  rrew  i^toach 
based  on  surface  matching  with  range  data  t^  has 
been  integrated  on  the  Navlab  to  register  the  vehi¬ 
cle  positioi  with  the  map  [Hebert  92]. 

6.2.  Ambler  for  Planetary  Exploration 

In  its  fourth  year  of  operation,  the  Ambler  walking 
robot  established  new  world  records  for  long-term 
autonomous  walking,  both  outdoors  and  indoors. 
M^th  all  computing,  sensing,  power,  and  telemetry 
on-board,  the  robot  is  completely  self-reliant  To 
date,  the  Ambler  has  walked  autonomously  a  total 
of  over  4  km,  much  of  it  over  rugged,  difficult  ter- 
rairt 

We  (Krotkov)  have  extended  the  Ambler  terrain 
mapping  system  to  operate  reliably  and  continu¬ 
ously  u^r  a  wide  variety  of  environmental  condi¬ 
tions  in  natural,  outdoor  environments  [Krotkov  et 
al.  93].  The  fielded  system  has  been  thoroughly 
tested  in  numerous  walking  experiments,  process¬ 
ing  tens  of  thousands  of  range  images  and  tens  of 
millions  of  terrain  elevation  points.  In  a  single  run, 
the  perception  system  acquired  12(X)  range  images 
and  built  47(X)  terrain  elevation  maps  containing  a 
total  of  2.6  million  elevation  points. 

The  extensions  include  using  feedback  fiom  leg 
contaa  with  the  terrain  to  increase  the  accuracy  of 
the  elevation  maps,  aggressive  preprocessing  of 


range  sensor  data,  periodic  recalibrations  to  com¬ 
pensate  for  long-term  sensor  drift,  and  extensive 
error  detectiem  and  recovery  procedures  to  respond 
to  hardware  and  memory  management  errors.  We 
also  devdoped  a  new  fractal-based  method  for 
recovering  the  terrain  map  ftom  range  data. 

6 J.  Autononmus  Vision>Guided  Helicopter 

We  (Amidi  and  Kanade)  are  dso  developing  a 
vision-guided  autonomous  helicopter.  At  present, 
we  are  researching  visual  feedback  for  close  and 
precise  helicopter  hovering  near  a  known  object  of 
interest  to  perform  inspection  tasks.  Our  control 
scheme  is  based  on  a  linear  helicopter  control 
mcxlel  which  is  updated  in  real-time  by  a  fuzzy 
controller.  The  helicopter  model  is  the  basis  for 
image  feature  detection  and  tracking  as  well  as 
helicopter  ctxitroL  We  have  been  testing  our  con¬ 
trol  ideas  on  a  model  electric  helicopter  using  an 
indoor  ti-degrees-of-ftoedom  (6-DOIO  testbed  that 
provides  both  grourxl  truth  data  and  controlled  test 
environments.  We  plan  to  upgrade  to  a  mid-size 
helicopter  (Yamaha  R50,  2.6-meter  long,  20  kg 
payload)  with  a  larger  6-IX)F  stand  for  both  indoor 
and  out^r  experiments.  Finally,  we  plan  to  per¬ 
form  free  flight  experiments  using  a  camera  and 
on-board  soisors  to  fly  the  helicopter  while  per¬ 
forming  a  real-worid  ui^. 
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7.  Vision  for  Human-Computer 
Interaction 

It  is  now  being  recognized  by  DARPA  and  others 
that  one  of  the  biggest  barriers  to  the  effective 
application  of  computer  technology  is  the  difficulty 
of  communicating  between  the  human  and  the 
computer.  The  Image  Understanding  group  at 
CMU  has  found  several  ways  that  machine  vision 
can  contribute  to  improving  this  communication  to 
impact  all  uses  of  computers,  from  word  and  data 
processing  to  gesture  iiput  and  even  for  improving 
the  study  of  Human-Computer  Interaction  (HCI) 
itself.  Our  current  focus  areas  are  in  gesture  input, 
perception  of  the  user’s  face,  tracking  where  the 
user  is  looking,  and  aids  for  surgery. 

7.1.  The  Vision  Dataglove 

Gesture  input  is  useful  for  tasks  such  as  robotic 
manipulation,  directing  the  attention  of  a  robot 
vehicle,  and  providing  input  for  map-based  mis¬ 
sion  planning  and  coordination.  We  (Rehg  and 
Kanade)  are  developing  the  “Vision  Dataglove”,  is 
a  system  for  model-based  visual  tracking  of  human 
arm  and  hand  motion.  By  exploiting  geometric  and 
kinematic  models  of  the  human  hand  and  aim,  we 
can  estimate  its  motion  horn  a  sequence  of  inten¬ 
sity  images.  We  model  the  hand  as  a  collection  of 
16  rigid  segments  (12  individual  finger  segments,  3 
thumb  segments  and  one  segment  for  the  palm). 
The  vision  dataglove  has  potential  applications  in 
man-machine  interfaces  and  teleoperation. 

7.2.  Face  Perception 

The  second  focus  of  our  work  on  vision  for  HCI  is 
in  perception  of  the  human  face.  We  (Rander  and 
Kanade)  have  demonstrated  a  system  for  tracking 
specific  facial  features  such  as  the  eyes,  nose  and 
mouth.  The  system  builds  a  multi-resolution  image 
pyramid  from  a  digitized  image  of  a  person's  face. 
It  uses  coarse  templates  to  localize  the  person's 
face  within  the  lowest  resolution  image  in  the  pyra¬ 
mid.  It  then  searches  the  corresponding  region  of 
the  higher  resolution  images  for  smaller  facial  fea¬ 
tures  such  as  the  eyes,  nose  and  mouth.  Contraints 
imposed  by  the  positions  of  features  in  the  previ¬ 
ous  image  and  by  a  geometric  model  of  facial  fea¬ 
ture  relationships  allows  the  system  to  limit  its 
search  and  achieve  near  real-time  cycle  rates  (cur¬ 
rently  about  5  Hz). 

7.3.  Tracking  the  User’s  Gaze 

One  of  the  ways  machine  vision  can  aid  HCI  is  by 


providing  a  hands-free  replacement  for  a  pointing 
device.  One  such  system  we  (Pomerleau  and  Bal- 
uja)  are  developing  is  a  non-intrusive  gaze  tracker 
based  on  artificial  neural  networks.  Once  the  posi¬ 
tion  of  the  eyes  are  located  in  the  video  image,  the 
gaze  tracker  extracts  a  small  window  centered  on 
the  right  eye  of  the  person,  and  provides  it  as  input 
to  a  neur^  network.  The  network  is  trained  to 
determine  where  on  the  computer  screen  the  per¬ 
son  is  looking  from  the  appearance  of  his  eye  in  the 
input  image.  By  exploiting  the  unique  characteris¬ 
tics  of  an  individual's  eye  appearance,  it  is  able  to 
estimate  the  location  of  a  person's  gaze  to  within 
approximately  1  degree  (about  the  size  of  a  6  letter 
word  viewed  in  a  normal  font  from  a  comfortable 
distance  from  the  screen).  This  level  of  accuracy  is 
comparable  to  that  of  the  most  expensive  commer¬ 
cially  available  vision-based  eye  trackers.  Unlike 
conventional  eye-trackers,  it  does  not  require  the 
user’s  head  to  fixed  in  a  frame  -  the  user  simply 
sits  normally  in  a  chair.  The  gaze  tracker  is  useful 
as  a  rapid,  hands-off  pointing  device,  and  also  for 
studying  the  process  of  human-computer  interac¬ 
tion  itself,  to  improve  interface  designs. 

7.4.  Vision-Aided  Laparoscopic  Surgery 

Laparoscopic  surgery  is  a  minimally  invasive  sur¬ 
gical  technique  which  involves  low  trauma, 
reduced  risk  of  infecticm,  and  less  post-operative 
pain  than  conventional  open  surgery.  Unlike  open 
surgery  of  the  digestive  system,  which  involves  a 
large  incision,  liq>aroscopy  is  performed  by  special 
instruments  inserted  through  small  holes  cut  into 
the  abdomen.  The  interior  of  the  patient  is  imaged 
by  a  small  camera  mounted  on  a  special  instru¬ 
ment,  called  a  laparoscope,  and  displayed  on  a 
standard  video  monitor. 

We  (Gibson  and  Kanade)  are  addressing  two  of  the 
fundamental  limitations  of  current  laparoscopic 
surgery:  (1)  The  Irqraroscope  camera  does  not  pro¬ 
vide  three-dimensional  depth  perception,  and  (2) 
misalignment  of  the  camera  view  can  greatly 
increase  the  complexity  of  hand-eye  coordination 
for  the  surgeon.  We  are  developing  a  prototype 
system  that  uses  a  pair  of  video  cameras  to  image 
the  operative  field.  This  stereo  view  will  allow  the 
surgeon  to  perceive  three-dimensional  depth.  Elec¬ 
tronic  sensors  will  be  fixed  to  both  the  laparoscope 
and  the  surgeon  to  help  align  the  camera  view. 
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Abstract^ 

This  report  summarizes  progress  in  image 
understanding  research  at  the  University  of 
Massachusetts  over  the  past  year.  Many  of  the 
individual  efforts  discussed  in  this  paper  are 
further  developed  in  other  papers  in  this 
proceedings.  The  summary  is  organized  into 
several  areas: 

1.  Mobile  Robot  Navigation 

2.  Motion  Analysis 

3.  Inteipretation  of  Static  Scenes 

4.  Image  Understanding  Architecture 

5.  RADIUS  Image  Exploitation 

The  research  program  in  computer  vision  at 
UMass  has  as  one  of  its  goals  the  integration  of  a 
diverse  set  of  research  efforts  into  a  system  that  is 
ultimately  intended  to  achieve  real-time  image 
inteipretation  in  a  variety  of  vision  applications. 

1.  Mobile  Robot  Navigation 

1.1.  Automated  Model  Acquisition  and 
Extoiskm 

The  focus  of  the  UMass  mobile  robot  navigation 
project  is  robust  landmark-based  navigation,  with 
a  focus  on  automated  model  acquisition  and  model 
extension.  Thus,  for  navigation  in  unmodelled  or 
sparsely  modelled  enviroiunents,  our  general 
scenario  would  involve  the  initial  acquisition  of 
prominent  visual  features  that  can  serve  as 
landmarks.  This  initial  phase  of  partial  model 
acquisition  is  necessary  because  there  are  few 
situations  where  a  model  of  a  complex  outdoor 
scene  will  be  available  a  priori.  CMce  a  sparse 
model  is  available,  then  the  vehicle  position  and 
orientation  (i.e.  pose)  can  be  recovered  by 
recognizing  landmarks.  The  model  extension 
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phase  involves  tracking  new  unmodelled  features 
(points  and/or  lines),  and  using  the  landmarks  and 
partial  model  to  determine  the  camera  pose  for 
tiiangulation  of  the  new  features  ^  incorporation 
into  the  3D  model. 

Most  of  the  algorithms  have  been  described  in 
previous  lUW  proceedings  and  the  general  vision 
literature  [Beveridge  92,  Kumar  92,  Sawhney  92, 
93].  These  ^goiitlms  have  been  shown  to  be  very 
accurate  in  many  indoor  experiments  using  a 
camera  mounted  on  a  mobile  robot  and  on  a 
moving  robot  arm.  One  new  experiment  that 
integrated  several  components  involved  the 
detection  of  shallow  stmetures  -  an  aggregatation 
of  line  features  that  can  be  approximated  in  an 
image  s^uence  as  a  frontal  planar  surface.  The 
3D  position  of  these  features  served  as  the 
acquired  model,  with  a  depth  error  of  less  than  4%. 
As  motion  of  the  camera  continues,  the  model  is 
extended  with  depth  information  on  other  tracked 
points  to  accuracies  of  less  than  2%  error  in  depth. 

1.2.  Status  ofthe  UMass  Mobile  Perception 
Laboratory  (MPL) 

1.2.1.  Physical  Description 

The  UMass  Mobile  Perception  Laboratory  (MPL) 
is  based  on  a  signiflcantly  modified  HMMWV. 
The  design  of  the  overall  system  includes  actuators 
and  encoders  for  the  throttle,  steering  column  and 
brakes  that  closely  match  those  being  used  by 
CMU,  controlled  by  68020's  in  a  6u  VME  cage. 
The  low-level  control  software  for  controlling 
speed  and  steering  angle  will  also  be  the  same  as 
that  of  CMU.  The  modifications  and  component 
installation  is  being  performed  by  RedZone,  Inc.,  a 
Pittsburgh-based  firm  specializing  in  custom 
robotics,  and  was  completed  at  the  beginning  of 
February  1993. 

Electrical  power  is  supplied  by  a  lOkW  diesel 
generator,  whose  outimt  is  ^lit  into  two  SkW 
circuits.  The  first  circuit  is  conditioned  and 
backed  by  a  SkW  uninterruptible  power  supply 
(UPS)  system,  and  is  used  to  supply  power  to  all 
sensitive  electronic  equipment  Tte  second  circuit 
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is  not  conditioned  and  is  used  to  power  the  air 
conditioners.  Both  circuits  are  attached  to  a  shore- 
power  hook-up  that  provide  an  alternative  power 
source  to  the  on-board  generator. 

The  physical  lay-out  of  equipment  was  designed  to 

1)  provide  for  two  on-board  programmer  stations. 

2)  minimize  destructive  modifications  to  the  body 
of  the  vehicle,  and 

3)  keep  the  center  of  gravity  as  far  forward  as  as 
possible,  in  order  to  minimize  stress  cn  the 
suspension  system. 

The  first  programmer  station  is  located  in  the 
HMMWV's  passenger  seat,  with  a  17"  color  x- 
terminal  fixed  to  the  metal  platform  between  the 
passenger's  and  driver's  seats.  The  second 
programmer  station  is  located  behind  and  slightly 
above  the  driver,  and  includes  a  car  seat,  mounting 
brackets  for  both  an  SGI  color  terminal  and  a 
small  SONY  monitor  for  viewing  raw  TV  signals. 

The  back  of  the  vehicle  is  filled  with  equipment. 
On  the  driver's  side  of  the  vehicle,  behind  the 
second  programmer  station,  is  all  equipment 
associated  with  providing  power.  On  the 
pas.senger's  side  there  are  four  enclosed,  air 
conditioned  19"  computer  frames  for  the  on-board 
computer  systems.  The  first  frame  will  hold  the 
6u  VME  cage  for  throttle,  brake  and  steering 
controllers  and  a  second  6u  VME  cage  for  holding 
digitizers,  image  frame  stores  and  a  Datacube 
MaxVideo20.  The  second  computing  frame  will 
contain  a  9u  cage  for  the  Silicon  Graphics  four- 
node  multiprocessor,  as  well  as  the  SGI's  disk 
drives,  power  supply  and  (removable)  tape  drive. 
The  third  frame  is  reserved  for  the  Image 
Understanding  Architecture  (lUA).  The  fourth 
frame  is  for  future  additions,  including  video 
recorders  for  collecting  data  and  recording 
experiments.  Together,  the  four  frames  take  up 
the  length  of  the  vehicle's  bed,  as  do  the 
programmer  station,  UPS  cage  and  generator  on 
the  left  side. 

12.2.  Sensor  Configuration 

The  vehicle's  sensor  package  includes  a  Staget, 
which  is  a  rotating  stabilized  platform  being 
supplied  to  the  UMass  and  CMU  vehicles  by 
TACOM.  Tlie  UMass  Staget  is  mounted  on  a 
level  platform  located  at  the  center  of  the  roof  of 
the  cab.  We  are  planning  to  put  two  CCD  color 
cameras  on  the  Staget,  one  with  a  wide  angle  lens 
and  the  other  with  a  telephoto  lens.  The  first  will 
be  used  to  locate  landmarks  in  the  larger  scene, 
and  the  second  will  be  used  for  landmark  matching 
and  accurate  pose  refinement  The  Staget  will  also 
contain  a  FLIR  sensor.  The  Staget's  hardware  is 
mounted  above  the  driver's  head  in  the  enclosure 


originally  occupied  by  the  HMMWV's  NBC 
system.  Forward  of  the  Staget,  at  the  edge  of  the 
cab's  roof,  is  a  long  (5"  by  12"  by  12")  rectangular 
enclosure  with  a  glass  front  and  hinged  roof  for 
forward-looking  stereo  cameras. 

1.2 J.  Software  Environment 

MPL  is  an  experimental  laboratory  for  testing  and 
integrating  different  approaches  to  problems  in 
autonomous  navigation,  including,  but  not  limited 
to,  landmark-bas^  navigation,  obstacle  detection 
and  avoidance,  model  acquisition,  and  road 
following.  It  is  therefore  important  that  MPL  have 
a  software  environment  where  multiple  visual 
modules,  addressing  different  subtasks,  can  be 
easily  integrated,  and  where  researchers  can 
quickly  experiment  with  different  combinations 
and  parameterizations  of  those  modules.  At  the 
same  time,  MPL's  software  environment  must  be 
efficient  enough  to  meet  the  demands  of  real-time 
navigation  research. 

The  need  to  balance  between  flexibility  and 
efficiency  has  led  us  to  design  a  software 
environment  with  two  major  components:  the 
ISR3  in-memory  data  store,  and  a  graphical 
programming  interface  adapted  from  Khoros. 
ISR3  is  the  glue  that  binds  independent  visual 
modules  together  [Draper  93a].  It  is  an  in-memory 
database  that  allows  users  to  define  structures  for 
storing  visual  data,  such  as  images,  lines  and 
surfaces.  ISR3  then  serves  as  a  buffer,  so  that,  for 
example,  lines  produced  by  one  module  can  be 
used  by  another,  even  if  the  second  module  is  run 
later  or  on  a  different  processor  than  the  first. 
ISR3  also  provides  modules  with  efticient  spatial 
access  routines  for  visual  data,  and  protects  data 
from  being  simultaneously  modified  by  two  or 
more  concurrent  processes.  The  graphical 
programming  interface  allows  programmers  to 
easily  sequence  modules  and  modify  their 
parameters. 

1.2.4.  Navigation  System 

A  preliminary  version  of  a  behaviour-based 
system  for  determining  vehicle  pose  from  known 
landmarks  has  been  designed.  It  is  assumed  that 
pose  estimates  and  associated  covariance  (error) 
estimates  are  returned  from  several  subsystems 
(GPS,  INS,  Landmarks,  and  dead  reckoning) 
asynchronously.  These  estimates  are  continually 
combined  via  a  Kalman  filter  into  a  single  pose 
estimate  (and  associated  covariance  matrix 
estimate)  and  stored  in  a  vehicle  state  vector.  The 
vehicle  pose  error  is  continually  monitored  in  a 
simple  loop  which  branches  to  a  behavior  selection 
strategy  when  the  vehicle  pose  error  exceeds  a 
preset  threshold. 
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The  system  also  contains  a  video  image  frame 
buffer  and  STAGET  control  subsystem.  This 
system  maintains  image  and  pose  temporal 
histories  (time-stamped  images  and  corresponding 
pose  estimates)  in  a  Hxed-length  first-in  last-out 
queue.  This  information  is  available  to  the 
remainder  of  the  system.  The  STAGET  control 
interface  permits  the  STAGET  to  be  repositioned 
relative  to  the  vehicle  and  maintains  information 
about  the  various  STAGET  parameters  and 
conditions,  including  information  about  the  current 
lens  aperture  and  focal  length. 

All  the  landmark  matching  and  pose  refinement 
algorithms  have  been  tested  extensively,  although 
to  a  great  extent  only  in  indoor  domains.  A  large 
portion  of  the  original  LISP  has  been  ported  to  C. 
The  plan  for  the  coming  year  of  research  is  to 
develop  the  following  behaviors:  road  following, 
obstacle  avoidance,  landmark  detection,  landmark 
tracking,  and  model  extension. 

Initially,  two  types  of  landmark  processing 
behavior  will  be  specified.  The  first  behavior  for 
landmark  tracking  assumes  that  a  landmark  (or  set 
of  landmarks)  are  currently  being  tracked  via  the 
STAGET  and  all  that  is  necessary  is  that  the 
vehicle  pose  be  recomputed  from  the  tracked 
landmarks.  However,  there  are  computational 
tradeoffs  as  a  lunction  of  the  speed  of  the  vehicle, 
and  the  distance  and  number  of  landmarks.  Thus, 
not  all  landmarks  may  be  tracked  frame  by  frame. 

The  second  landmark  navigation  behavior  assumes 
that  no  landmarks  are  currently  being  tracked  and 
therefore  a  new  landmark  must  be  acquired.  This 
will  involve  access  to  a  stored  3D  model  of  the 
campus  environment  (which  initially  has  been 
constructed  a  priori)  in  order  to  control  the  Staget 
and  window  on  subimages  via  the  Staget. 
However,  the  availability  and  density  of  landmarks 
will  vary  significantly  in  different  areas  of  the  test 
environment,  and  therefore  model  extension  will 
be  a  necessary  goal.  Ultimately  we  seek  to 
demonstrate  that  an  accurate  3D  model  of  the 
environment  can  be  acquired  via  exploration  in  a 
purely  bottom-up  manner,  while  carrying  out 
independent  goal-oriented  navigation  tasks. 

U.  Qualitative  Navigation  via  Image-Based 
Homing 

If  the  world  changes  or  the  robot  fails  to  recognize 
a  landmark,  the  robot's  perception  of  the  world 
will  not  correspond  to  its  current  map  of  the  world. 
However,  there  is  ambiguity  in  whether  the  errors 
are  in  its  perception  or  its  map,  and  if  the  latter,  it 
must  update  its  map. 

Pinette  [Pinette  91]  has  been  developing  a 
principled  approach  to  automatic  map  construction 


and  maintenance.  In  place  of  the  usual 
construction  of  a  geometric  map,  snapshots  of  the 
world  at  selected  target  locations  along  the  route 
are  stored  as  the  robot's  knowledge  of  that  path. 
By  noting  places  where  a  set  of  memorized  routes 
intersect,  a  topological  "road  map''of  routes  and 
junctions  are  represented.  To  retrace  a  stored 
route,  a  qualitative  homing  algorithm  based  on 
purely  local  visual  servoing  is  employed  to  home 
between  successive  target  locations  along  the 
route.  This  homing  algorithm  uses  no  geometric 
model  or  positional  information;  rather,  it  servos 
directly  on  the  stored  image  for  a  target  location, 
choosing  headings  that  reduce  the  difference 
between  features  of  the  current  bearings  and  those 
in  the  target  snapshot.  A  "consistency-filtering" 
algorithm  has  been  developed  for  handling 
incorrectly  matched  landmark  features  [Pinette 
92].  It  is  shown  that  this  algorithm  guarantees 
reliable  homing  as  long  as  more  than  two-thirds  of 
the  landmarks  are  correctly  identified. 

A  very  robust  implementation  of  a  robot 
navigation  system  has  been  developed  using 
image-based  homing  with  a  spherical  mirror  for 
encoding  a  360  degree  view  at  each  target 
location.  This  navigation  system  has  been 
implemented  as  part  of  an  irrdoor  manufacturing 
automation  application  domain.  It  is  not  yet  clear 
whether  these  techniques  are  directly  applicable  to 
unconstrained  outdoor  domains  and  large-scale 
space. 

2.  Motion  Analysis 

2.1.  Multi-Frame  Structure  foom  Motion 

In  robot  navigation  a  model  of  the  environment 
needs  to  be  reconstructed  for  various  applications, 
including  path  plarming,  obstacle  avoidance  and 
determining  where  the  robot  is  located. 
Traditionally,  the  model  was  acquired  using  two 
images  (two-f^rame  Structure  from  Motion)  but  the 
acquired  models  were  unreliable  and  inaccurate. 
Generally,  research  has  shifted  to  using  several 
frames  (multi-frame  Structure  from  Motion) 
instead  of  just  two  frames.  However,  almost  none 
of  the  reported  multi-frame  algorithms  have 
produced  accurate  and  stable  reconstructions  for 
general  robot  motion.  The  main  reason  seems  to 
be  that  the  primary  source  of  error  in  the 
reconstruction  -  the  error  in  the  underlying  motion 
-  has  been  mostly  ignored.  Intuitively,  if  a 
reconstruction  of  the  scene  is  made  up  of  points, 
this  motion  error  affects  each  reconstructed  point 
in  a  systematic  way.  For  example,  if  the 
translation  of  the  robot  is  erroneous  in  a  certain 
direction,  all  the  reconstructed  points  would  be 
shifted  along  the  same  direction. 
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Recently,  Thomas  [Thomas  93a, b]  has 
mathematically  isolated  the  effect  of  the  motion 
error  (as  correlations  in  the  structure  error)  and  has 
shown  theoretically  that  including  these 
correlations  in  the  computation  can  dramatically 
improve  existing  multi-frame  Structure  from 
Motion  techniques.  In  several  experiments  on  our 
indoor  robot,  the  enviroiunental  depths  of  points 
from  IS  to  SO  feet  away  from  the  camera  (and  for 
which  ground  truth  data  was  available)  were 
reconstructed  with  errors  in  the  1-3%  range.  In 
one  further  experiment,  the  multi- frame  full- 
correlation  algorithm  was  first  used  to  create  a 
model  (a  set  of  points)  of  an  indoor  hallway  from 
several  initial  frames  of  image  data.  This  model 
was  then  used  to  compute  the  pose  of  the  robot 
over  subsequent  frames  using  Kumar's  pose 
recovery  algorithm.  The  estimated  robot  pose  and 
actual  robot  position  in  the  hallway  differed  by  a 
maximum  of  three  to  four  inches  over  a  12.8  foot 
path. 

22.  Recovering  AITine  Transforms  from  Image 
Sequences 

Deformations  due  to  relative  motion  between  an 
observer  and  an  object  may  be  used  to  infer  3-D 
structure.  Up  to  first  order  these  deformations  can 
be  written  in  terms  of  an  affine  transform.  The 
recovery  of  an  affine  approximation  to  image 
deformation  has  recently  been  the  focus  of  a  large 
amount  of  research,  and  has  found  application  in 
such  disparate  areas  of  computer  vision  as  image 
stabilization,  optical  flow  computation  and 
segmentation,  structure  from  motion,  stereo,  and 
texture,  and  obstacle  avoidance. 

Manmatha  [Manmatha  93]  has  developed  a 
technique  for  measuring  the  affine  transform 
locally  between  two  image  patches  using  weighted 
moments  of  brightness.  Unlike  previous  methods, 
this  technique  correctly  handles  the  problem  of 
finding  the  correspondence  between  deformed 
image  patches,  as  is  necessary  for  a  correct 
computation  of  the  affine  transform.  It  is  capable 
of  determining  affine  transforms  of  arbitrary  size, 
whereas  most  previous  approaches  are  limited  to 
small  transforms.  It  is  first  shown  that  the 
moments  of  image  patches  are  related  through 
functions  of  affine  transforms.  Finding  the 
weighted  moments  is  equivalent  (for  the  purposes 
of  measuring  the  affine  transform)  to  filtering  the 
images  with  gaussians  and  derivatives  of 
gaussians.  In  the  special  case  where  the  affine 
transform  can  be  written  as  a  scale  change  and  an 
in-plane  rotation,  the  zeroth  and  first  moment 
equations  are  solved  for  the  scale.  In  experiments 
on  synthetic  and  real  images  for  diis  case,  the  scale 
was  recovered  robustly  and  shown  to  give  reliable 


depth  estimates.  Woilc  is  continuing  on  extending 
the  basic  techniques  to  the  general  case. 

23.  Multi-Sensor  Dextrous  Manipulation 

Grupen  and  Weiss  [Grupen  93]  have  continued 
their  worir  on  a  multi-sensor  approach  to  dextrous 
manipulation.  The  goal  of  this  project  is  the 
integration  of  sensing  and  control  for  the  task  of 
finding  a  stable  grasp  configuration  for  an 
unknown  object.  A  subgoal  is  £e  integration  of 
visual  and  h^ic  (proprioceptive)  sensoiy  data  to 
incrementally  build  a  model  of  the  object.  This 
approach  uses  knowledge  of  the  task  and  the 
accuracy  and  completeness  of  the  model  to  control 
the  sensing  actions. 

The  system  consists  of  a  camera  mounted  on  one 
robot  and  the  Utah/MIT  hand  mounted  on  another. 
The  system  calibration  or  identification  problem 
involves  computing  the  transformation  from  the 
coordinate  system  defined  by  the  manipulator 
robot  to  the  coordinate  system  defined  by  the 
camera  robot.  The  po.se  determination  algorithm 
of  Kumar  and  Hanson  [Kumar  92]  has  been 
adapted  for  this  purpose.  As  the  manipulator  robot 
moves,  known  feature  points  are  tracked.  Given 
the  kinematics  of  this  robot,  the  pose  of  the  camera 
with  respect  to  the  coordinate  frame  of  the 
manipulator  robot  arc  computed  and  incrementally 
relink  using  iterative,  extended  Kalman  filtering. 
Experiments  were  performed  to  demonstrate  that 
the  accuracy  of  the  filtering  algorithm  was 
comparable  to  that  of  smoothing  using  a  least 
squares  fit  with  all  of  the  data,  yet  the  computation 
time  was  much  less.  An  additional  feature  of  the 
method  is  that  the  kinematics  of  the  camera  robot 
can  be  computed  at  the  same  time. 

Grupen  and  Huber  [Huber  92]  have  obtained  3D 
surface  points  from  the  Utah/MIT  hand  without 
the  use  of  tactile  sensors.  The  measurements  used 
are  posture,  velocities,  and  torques.  This  will  be 
integrated  with  the  measurements  obtained  from 
the  camera  sensor. 

2.4.  Shape  Recovery  from  Occluding  Contours 

Recovering  the  shape  of  an  object  from  two  views 
(e.g.  stereo)  fails  at  occluding  contours  of  smooth 
objects  because  the  extremal  contours  are  view 
dependent.  For  three  or  more  views,  shape 
recovery  is  possible,  and  several  algorithms  have 
recently  been  developed  for  this  purpose.  Szeliski 
and  Weiss  [Szeliski  93]  have  developed  a  new 
approach  to  the  multiframe  shape  recovery 
problem  which  docs  not  depend  on  differentiid 
measurements  in  the  image,  which  may  be  noise 
sensitive.  Instead,  a  linear  smoother  is  used  to 
optimally  combine  all  of  the  measurements 
available  at  the  contours  (and  other  edges)  that  are 
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tracked  through  the  set  of  images.  This  allows  the 
extraction  of  a  robust  and  dense  estimate  of 
surface  shape  and  the  integration  of  shape 
information  from  both  surface  markings  and 
occluding  contours.  The  results  provide  an 
extremely  promising  path  for  recovery  of  3D 
shape  models  in  an  industrial  setting  where  the 
motion  is  known. 

3.  Interpretation  of  Static  Scenes 

3.1.  Learning  3D  Recognition  Strategies 

Most  knowledge-directed  vision  systems  are 
tailored  to  recognize  a  fixed  set  of  objects  within  a 
known  context.  Generally,  the  programmer  or 
knowledge  engineer  who  constructs  Uiem  begins 
with  an  intuitive  notion  of  how  each  object  might 
be  recognized,  a  notion  which  is  refined  by  trial- 
and-error.  Unfortunately,  human  engineering  is 
not  cost-effective  for  many  real-world 
applications.  Moreover,  there  is  no  way  to  ensure 
the  validity  of  hand-crafted  systems.  Worst  of  all, 
when  the  domain  is  changed,  the  systems  often 
have  to  be  rebuilt  from  scratch. 

The  Schema  Learning  System  (SLS)  [Draper  92, 
93b]  automates  the  construction  of  knowledge- 
directed  recognition  strategies.  Starting  from  a 
knowledge  base  of  visual  procedures  and  object 
models,  SLS  learns  robust  strategies  for  locating 
landmarks  in  images  and  recovering  their  positions 
and  orientations,  if  necessary.  Each  strategy  is 
specialized  to  a  landmark,  taking  advantage  of  its 
most  distinctive  characteristics,  whether  in  terms 
of  color,  shape,  or  contextual  relations,  to  quickly 
focus  its  attention  on  the  landmark  and  recover  its 
pose.  Furthermore,  because  SLS  learns  from 
experience  by  a  strict  generalization  algorithm,  it 
is  possible  to  predict  iMth  the  expected  costs  and 
the  expected  error  rates  (due  to  a  lemma  by 
Valiant)  of  the  strategies  it  develc^. 

3.2.  Figural  Completion  from  Principles  of 
Perceptual  Organization 

Figural  completion  is  the  preattentive  ability  of  the 
human  visual  system  to  build  complete  and 
topologically  valid  representations  of 
environmental  surfaces  from  the  fragmentary 
evidence  available  in  cluttered  scenes.  A 
description  of  a  grouping  system  developed  by 
Williams,  employing  a  two-stage  process  of 
completion  hypothesis  and  combinatorial 
optimization,  appeared  in  a  previous  workshop 
proceedings  [Williams  90].  Preliminary 
experimental  results  were  also  reported.  Since  that 
time  there  has  been  significant  progress  in  two 
major  areas.  First,  the  mathematical  basis  for  the 
grouping  constraints  employed  in  the  optimization 
stage  has  been  clearly  elucidated.  This  has 


allowed  a  proof  of  the  necessity  and  sufficiency  of 
the  grouping  constraints  for  scenes  composed  of 
flat  emteddings  of  orientable  surfaces  with 
boundary.  Second,  a  more  advanced  grouping 
system  which  uses  cubic  Bezier  splines  of  least 
energy  to  model  the  shape  of  perceptual 
completions  has  been  implemented.  The  new 
system  is  demonstrated  on  a  number  of  figures 
from  the  visual  psychology  literature  which  are 
beyond  the  capability  of  the  old  system. 

33.  Perceptual  Organization  of  Curvilinear 
Structure 

During  the  past  year,  Dolan  has  continued  his 
work  on  curvilinear  grouping  [Dolan  92].  A 
SIMD  implementation  of  the  curvilinear  grouping 
system  has  been  developed,  along  with  a 
simplified,  distributed  representation  of  curves  for 
use  in  the  CAAPP.  The  integration  of  multiple 
grouping  processes— in  particular,  curvilinear  and 
area  grouping  -  is  currently  being  examined. 
Many  of  these  ideas  are  being  incorporated  in  a 
general  grouping  module  for  KBVision,  which 
will  facilitate  research  and  experimentation  with 
many  diverse  forms  of  grouping. 

3.4.  Stochastic  Projective  Geometry 

The  use  of  projective  invariants  for  object 
recognition  and  scene  reconstruction  has  been  the 
subject  of  intense  interest  in  the  image 
understanding  community  over  the  past  few  years. 
Although  classic  projective  geometry  was 
developiki  with  mathematically  precise  objects  in 
mind,  practical  applications  must  deal  with 
errorful  measurements  extracted  from  real  image 
sensors.  A  more  robust  form  of  projective 
geometry  is  needed,  one  that  allows  for  possible 
imprecision  in  its  geometric  primitives.  In  his 
Ph.D.  thesis  [Collins  93],  Collins  represents  and 
manipulates  uncertain  geometric  objects  using 
probability  distributions  in  projective  space, 
allowing  valid  geometric  constructions  to  be 
carried  out  via  statistical  inference.  The  result  is  a 
methodology  for  scene  reconstruction  based  on  the 
principles  of  projective  geometry,  yet  also  dealing 
with  uncertainty  at  a  basic  level.  The  effectiveness 
of  this  framework  has  been  demonstrated  on 
several  geometric  problems,  including  the 
derivation  of  3D  line  and  plane  orientations  from  a 
single  image  using  vanishing  point  analysis,  the 
extraction  of  a  planar  patch  scene  model  using 
stereo  line  correspondences,  and  the  reconstruction 
of  planar  surface  structure  using  multiple  images 
taken  from  unknown  viewpoints  by  utKalibrated 
cameras. 

More  specifically,  Collins  shows  that  projective  N- 
space  can  be  visualized  as  the  surface  of  a  unit 
sphere  in  (N-i-l)-dimensional  Euclidean  space. 
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Each  point  in  projective  space  is  represented  as  a 
pair  of  opposing  or  antipodal  points  on  the  sphere. 
By  the  identiflcation  of  projective  space  with  the 
unit  sphere,  antipodally  symmetric  probability 
distributions  on  the  sphere  may  be  interpreted  as 
probability  distributions  over  the  points  of 
projective  space,  and  standard  constructions  of 
projective  geometry  can  then  be  augmented  by 
statistical  inferences  on  the  sphere.  Probability 
densities  defined  in  this  way  can  also  be  used  for 
representing  uncertainty  in  unit  vectors, 
orientations,  and  the  space  of  3D  rotations  (via 
unit  quaternions). 

3S.  Shape  from  Shading 

Oliensis'  previous  work  on  shape  from  shading 
[Oliensis  92]  has  beerk  extended  in  a  number  of 
ways.  First,  while  our  earlier  work  usually 
assumed  that  the  illumination  was  from  the 
direction  of  the  camera,  the  shape  reconstruction 
algorithms  and  convergence  proofs  have  been 
extended  more  recently  to  the  case  of  illumination 
from  any  direction  [Oliensis  93a].  As  before, 
these  algorithms  are  provably  and  monotonically 
convergent,  and  (in  many  cases)  can  be  shown  to 
conveige  to  the  correct  surface.  Moreover,  it  has 
been  shown  that  a  whole  family  of  algorithms 
could  be  developed,  and  that  all  would  give 
equivalent  surface  reconstructions.  This  is 
convenient  since  some  of  the  algorithms  are  better 
for  theoretical  analysis  while  others  are  more 
efficient  in  practice.  The  uniqueness  proofs  for  the 
surface  given  the  shaded  image,  and  the  corollary 
that  regularization  is  not  necessary  for  shape  from 
shading,  have  also  been  extended. 
Experimentation  with  these  algorithms  on 
synthetic  and  real  images  show  that  they  are  fast 
and  robust,  taking  less  than  10  seconds  on  a 
DECstation  5000  for  a  200  x  2(X)  real  image. 

These  algorithms  still  require  that  a  small  amount 
of  information  on  the  surface  be  provided,  namely: 
1)  a  list  of  those  singular  points  (the  brightest 
image  points)  corresponding  to  local  minima  of 
the  surface  height  (as  opposed  to  the  other 
possibilities  of  a  local  maximum  or  a  saddle 
point);  and  2)  the  heights  of  these  singular  points. 
However,  in  a  second  extension  of  previous  work 
[Oliensis  93b],  Oliensis  has  developed  a  new 
algorithm  that  is  capable  of  determining  this 
information  automatically,  and  thus  can 
reconstruct  a  general  surface  from  shading  with  no 
a  priori  information  on  the  surface.  In 
exfKiimental  tests  on  complex  synthetic  images, 
this  algorithm  has  produced  good  surface 
reconstructions  over  most  of  the  image.  For  128 
xl28  images,  the  reconstruction  takes  less  than  30 
seconds  on  a  DECstation  5000.  Moreover,  the 
algorithm  appears  noise  resistant,  giving  good 


reconstructions  even  in  the  extreme  case  of  an 
added  pixel  noise  of  10%.  It  appears  that  it  will 
also  be  possible  to  prove  the  convergeiKe  of  this 
algorithm  to  the  correct  surface  in  the  limit  of 
perfect  resolution. 

All  algorithms  thus  far  have  assumed  that  the 
imaged  surface  was  matte.  Even  with  this 
restriction,  the  algorithms  are  potentially  useful  in 
controlled  industrial  or  research  applications.  At 
UMass  these  algorithms  will  be  ported  to  the 
robotics  laboratory  environment,  tmd  used  in 
combination  with  other  means  of  shape  sensing 
and  recovery  to  aid  in  research  in  grasping 
partially  or  unmodeled  objects.  Further  extensions 
include  adapting  the  current  algorithms  to  the 
realistic  case  of  a  partially  specular  surface.  With 
this  extension,  shape  from  shading  could  become 
IM’actical  for  a  variety  of  applications. 

4.  Image  Understanding  Architecture  (lUA) 
Overview 

Work  on  the  lUA  [Weems,  1993]  has  advanced  in 
three  areas  in  the  preceding  year:  compilers  and 
system  software,  hardware  and  architecture,  and 
applications  and  algorithms.  The  lUA  is  a  tightly 
coupled,  heterogeneous  parallel  processor  being 
developed  by  UMass,  Hughes  Research  Labs,  and 
Amerinex  Artificial  Intelligence  (AAI)  under 
DARPA  funding.  It  is  intended  to  support  real¬ 
time  knowledge-ba.sed  vision  applications  and 
research  by  providing  three  distinct  parallel 
processors  in  a  single  architecture:  a  fine-grained 
SIMD/MuIti-associative  array  for  low-level  vision, 
a  medium-grained  SPMD  array  for  intermediate- 
level  symbolic  vision,  and  a  coarse-grained 
multiprocessor  for  high-level,  knowledge-based 
processing.  A  proof  of  concept  prototype  of  the 
lUA  was  constructed  under  a  previous  effort  and 
the  current  work  is  directed  at  developing  a  second 
generation  of  the  system  with  enhanced 
performance  and  the  ability  to  be  fielded  in  the 
DARPA  Unmanned  Ground  Vehicles  (UGV) 
program. 

4.1.  lUA  Compilers  and  System  Software 

AAI  has  completed  development  of  the  C-i-i-  class 
library  for  the  low  level  of  the  lUA.  The  class 
library  defines  a  set  of  image  plane  types  upon 
which  parallel  operations  may  be  performed. 
Work  at  AAI  includes  the  incorporation  of 
additional  optimization  code  into  the  Gnu  C-f-i- 
compiler  so  that  image  planes  are  treated  more  like 
first-class  objects  in  C-t-f.  An  automated  test 
system  has  also  been  developed  for  the  machine’s 
microcode  library,  to  facilitate  regression  testing 
of  new  releases.  For  the  intermediate-level 
processor,  basic  operating  system  support, 
multitasking,  and  messaging  have  been 
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implemented  on  a  TMS320C30  Single  Board 
Computer  (SBC),  and  recently  these  were 
transported  to  anoAer  SBC  with  two  TMS320C40 
processors  that  are  configured  to  simulate  the 
intermediate  level  of  the  lUA.  A  debugger  has 
also  been  implemented  for  the  intermediate  level. 
Work  is  now  under  way  to  transport  the 
KBVision™  system  to  the  lUA. 

UMass  has  implemented  a  version  of  the  Apply 
language  for  the  low-level  processor  of  the  second 
generation  lUA.  The  compiler  generates  code 
compatible  with  the  C-h-h  class  library.  It  permits 
us  to  easily  import  image  processing  operations 
written  for  the  CMU  Warp  or  Intel  iWarp 
machines. 

4J.  lUA  Hardware  Status 

The  prototype  lUA  has  been  running  at  Hughes  for 
most  of  the  last  year.  Under  the  prototype 
development  contract,  only  a  very  simple 
controller  was  built  to  demonstrate  the  basic 
functionality  of  the  processor  arrays.  It  was  never 
intended  that  the  prototype  controller  be  fully 
programmable.  However,  Hughes  and  Amerinex 
A1  invested  additional  effort  to  develop  software 
that  allows  C-»"i-  code  for  the  second  generation  to 
execute  on  the  prototype  hardware.  Because  of  the 
nature  of  the  controller,  instructions  can  only  be 
issued  at  VME  bus  rates  to  the  array,  which  is 
significantly  slower  than  the  array  can  accept 
them.  However,  it  does  permit  demonstration  of 
the  functionality  of  the  array  hardware  on  real 
applications.  Hughes  has  since  implemented  the 
low-level  portion  of  the  DARPA  lU  Benchmark, 
an  SDI  application,  and  an  ATR  application  on  the 
prototype. 

The  custom  chips  used  in  the  lUA  have  been 
fabricated  and  are  undergoing  testing.  Each  chip 
contains  256  bit-serial  processors  with  on-chip 
cache,  and  has  roughly  6(X),(XX)  transistors.  The 
system's  array  control  unit,  backplane,  and  chassis 
have  been  built  and  tested.  Processor  and  memory 
boards  are  currently  under  construction.  The  I/O 
subsystem  for  the  machine  has  also  been  designed, 
and  will  support  the  equivalent  of  20  simultaneous 
sensor  inputs  at  512  X  512  X  8-bit  resolution  with 
automatic  mapping  onto  the  processor 
virtualization  scheme  used  for  the  low  level,  with 
almost  no  latency.  The  I/O  subsystem  will  also 
support  the  selection  of  multiple  re^ons  of  interest 
from  an  image.  Hughes  has  indicated  that  the  First 
machine  should  be  assembled  by  the  end  of 
February,  1993. 

Work  has  already  begun  on  the  analysis  and 
design  for  the  third  generation  lUA.  UMa.ss  has 
developed  a  system  for  capturing  traces  of 
programs  written  in  the  C++  cla.ss  library  as  they 


execute  on  an  abstract  parallel  machine.  The 
traces  are  then  fed  to  a  simulation  system  that 
models  hardware  architectures  with  different 
features  and  parameters.  The  system  allows  us  to 
gather  real  performance  data  for  different 
architectural  configurations,  and  to  analyze  the 
data  statistically.  The  performance  data  will  then 
be  contrasted  with  cost  estimates  for  the  different 
configurations  to  produce  a  specification  for  the 
third  generation  lUA. 

4J.  lUA  Applications  and  Algorithms 

The  low-level  processor  of  the  lUA  is  a  square 
mesh  of  processing  elements,  augmented  with  a 
second  (reconfigui^le)  mesh,  called  the  Coterie 
Network  .  This  network  allows  the  mesh  to  be 
partitioned,  for  example,  into  areas  corresponding 
to  regions  in  an  image.  One  particularly  useful 
operation  is  the  ability  to  enumerate  elements 
within  a  partition  or  to  summarize  (reduce)  the 
information  in  a  partition  to  a  single  value.  The 
parallel  prefix  operation  is  the  general  form  of  this 
type  of  operation.  It  is  especially  desirable  to  be 
able  to  carry  out  parallel  prefix  in  all  partitions  at 
once,  i.e.  to  perform  a  multi-prefix  operation 
(Herbordt,  1992].  An  algorithm  has  been 
developed  for  multi-prefix  that  is  significantly 
faster  than  alternatives  using  general  purpose 
routing  in  the  mesh. 

As  recommended  by  the  DARPA  lU  Benchmark 
Workshop  participants,  much  of  the  benchmark 
fWeems,  1988, 19^]  has  been  recoded  as  a  set  of 
library  routines  which  are  called  by  the  core  of  the 
benchmark.  We  have  also  begun  developing  the 
second  level  benchmark,  which  will  incorporate 
tracking  of  moving  objects  over  a  sequence  of 
images.  The  goal  of  the  new  benchmark  is  to  test 
system  performance  over  a  longer  period  of  time 
so  that,  for  example,  caches  and  page  tables  will 
be  filled.  The  benchmark  will  also  explore  I/O 
and  real-time  capabilities  of  the  systems  under 
test,  and  involve  more  high-level  processing. 

UMass  has  developed  a  parallel  algorithm  for  the 
lUA  that  computes  a  dense  depth  map  for  a  scene 
from  a  pair  of  images  taken  by  a  moving  .sensor 
[Dutta  93].  The  algorithm  has  an  average  error  of 
about  8  percent  in  depth,  as  computed  from 
randomly  sampling  points  corresj^nding  to 
objects  in  the  scene  with  known  distances  from  21 
to  76  feet  from  the  camera.  The  experiments  were 
done  with  fairly  large  displacements  (four  feet  of 
forward  motion  between  the  images)  so  that  a 
large  (41  X  41  pixel)  search  window  was  required 
to  establish  correspondences,  resulting  in  1681 
image-to-image  correlations  being  performed.  In 
simulations  of  the  second  generation  lUA,  it  was 
determined  that  the  execution  time  will  be  about 
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0.S4  seconds,  of  which  0.53  seconds  is  taken  up 
solely  by  the  correlations.  We  are  thus  looking 
into  approaches  in  which  an  estimate  of  the  motion 
is  available  or  in  which  a  series  of  frames  with 
smaller  displacements  can  be  used  (allowing  the 
search  window  to  be  constrained). 

4.4.  Line  Extraction 

UMass  has  also  developed  a  parallel  algorithm  for 
extracting  strai^t  lines  from  an  image.  Using  the 
second  generation  lUA  simulator,  the  algorithm 
executes  in  as  little  as  31  milliseconds  for  images 
that  map  to  the  array  with  a  1:1  virtualization  ratio. 
We  are  currently  evaluating  the  quality  of  the 
results,  but  a  preliminary  examination  indicates 
that  the  algorithm  gives  very  cortsistcnt  lines  over 
sequences  of  images,  which  is  an  important 
attribute  in  the  suj^rt  of  algorithms  that  use  line 
tracking. 

5.  Image  Exploitation  under  RADIUS 

UMass  is  developing  mechanisms  for  site  model 
acquisition,  extension  and  refinement  (Collins 
93a]  based  on  technology  that  has  already  proven 
effective  in  the  mobile  robotics  domain.. 
Automatically  acquiring  the  initial  3D  site  models 
from  scratch  is  a  challenging  problem  that  will  be 
the  focus  of  future  research.  Our  current  work 
assumes  that  a  partial  model  of  the  site  is  provided 
apriori  by  the  image  analyst.  Our  model-based 
refinement  and  extension  algorithms  are  then 
applied  to  automatically  correct  inaccuracies  in  the 
initial  site  models,  and  extend  them  to  include 
previously  unmodeled  cultural  features  (buildings, 
roads,  etc.)  based  on  information  from  new 
images. 

Rather  than  building  a  turn-key  system,  UMass 
will  be  delivering  a  .set  of  modules  for  performing 
specific  tasks  of  direct  benefit  to  the  image 
analyst.  The  following  is  a  list  of  the  early 
deliverable  modules  that  are  currently  being 
evaluated  on  the  model  board  test  imagery 
supplied  to  the  research  community. 

1.  Feature  Extraction  Module.  This  module 
condenses  the  vast  amount  of  information  in  each 
image  into  a  manageable  set  of  symbolic 
descriptions.  Two  straight  line  extraction 
algorithms  are  being  evaluated:  the  Bums 
algorithm  based  on  fitting  planar  patches  to  the 
underlying  image  intensity  surface,  and  the  Boldt 
algorithm  for  hierarchical  geometric  edge 
grouping.  Also  under  development  arc  routines 
for  extracting  curved  lines,  and  for  locating 
dihedral  and  trihedral  junctions  to  subpixel 
accuracy. 


2.  Model  Matching  Module.  Given  the 
approximate  pose  Oocation  and  orientation)  of  the 
sensor,  a  partial  3D  wireframe  site  model,  and  a 
set  of  extracted  straight  lines,  the  best  match  of  the 
projected  3D  model  to  the  line  data  will  be  found 
using  a  novel  model  matching  algorithm  due  to 
Beveridge. 

3.  Model  Extension  Module.  Given  a  partial 
model  and  a  set  of  model-data  feature 
correspondences  over  multiple  images,  the  site 
model  will  be  extended  to  include  further 
unmodeled  features.  Two  techniques  are  being 
evaluated.  The  first  is  based  on  recovering  the 
camera  pose  using  a  robust  {Mse  estimation 
technique  due  to  Kumar.  This  algorithm  is 
effective  even  when  significant  numbers  of  feature 
correspondences  are  in  error.  Using  the  computed 
pose  for  multiple  images  from  multiple 
viewpoints,  the  3D  positions  of  unmodelled 
features  are  found  by  triangulation.  A  second 
approach  is  based  on  direct  estimation  of  the  3D  to 
2D  projective  transformation  relating  model 
features  to  image  features.  The  benefits  of  this 
approach  are  that  multiframe  triangulation  can  still 
be  performed  without  first  solving  for  camera 
pose,  and  without  relying  on  accurate  knowledge 
of  the  internal  camera  parameters. 

4.  Vanishing  Point  Module.  Vanishing  point 
analysis  is  a  flexible  tool  for  geometric  reasoning 
in  cultural  domains.  Among  its  many  uses  are  the 
deteimination  of  3D  line  plane  orientations, 
refinement  of  extracted  linear  features  based  on 
convergence  constraints,  pose  estimation,  and 
camera  calibration.  An  efficient  vanishing  point 
detection  and  estimation  algorithm  due  to  Collins 
and  Weiss  is  being  evaluated. 

In  addition  to  developing  new  techniques  for 
automatically  acquiring  initial  site  models,  new 
research  will  investigate  statistical  techniques  for 
applying  projective  invariants  to  the  modeling 
process  to  accurately  derive  structure  without 
explicit  camera  models  or  knowledge  of 
viewpoint.  Initial  experiments  in  this  direction 
have  yielded  promising  results.  Other  encouraging 
results  have  been  obtained  regarding  the  difficult 
problem  of  image  to  image  registration.  A 
technique  based  on  vanishing  point  analysis 
[Collins93b]  allows  an  oblique  aerial  view  to  be 
rectified  (unwarped)  to  present  a  simulated  vertical 
view,  allowing  full  perspective  aerial  images  to  be 
registered  with  a  computationally  tractable  affine 
matching  approach. 
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ABSTRACT 

Our  program  tn  Image  Understanding  has  maintained 
a  primary  focus  on  issues  in  object  recognition,  especially 
the  problems  of  selection,  indexing,  saliency  computation 
and  integration  of  visual  cues,  and  a  secondary  focus  on 
navigation.  We  have  also  continued  our  work  on  the 
computation  and  the  use  of  low  level  visual  cues  such  as 
motion,  stereo,  color  and  texture,  on  analog  VLSI  cir¬ 
cuits  and  on  learning. 

1  Introduction 

Image  Understanding  research  at  the  MIT  AI  Lab  has 
continued  along  a  range  of  fronts,  from  low  level  process¬ 
ing,  such  as  stereo,  motion,  color  and  texture  analysis, 
through  intermediate  stages  of  integration  of  visual  infor¬ 
mation,  to  higher  level  tasks  such  as  object  recognition 
and  navigation.  This  report  summarizes  our  main  recent 
accomplishments  in  these  areas.  As  is  usual  in  these  re¬ 
ports,  we  refer  interested  readers  to  other  publications 
for  more  details. 

2  Object  Recognition 

Because  it  hfU5  been  one  of  our  central  focal  points,  we 
begin  with  our  recent  work  in  object  recognition.  In  ap¬ 
proaching  the  problem  of  recognizing  objects  from  noisy 
images  of  cluttered  scenes,  we  have  found  it  convenient 
to  separate  out  several  different  aspects  of  the  problem: 

•  Selection;  (liven  a  large  set  of  image  features  from 
a  cluttered  scene,  select  (or  group)  subsets  likely 
to  have  come  from  single  objects,  and  use  a  rank 
ordering  to  place  the  most  salient  ones  first. 

•  Indexing:  (liven  one  of  these  image  feature  sub¬ 
sets,  select  a  small  set  of  object  models  from  the 
library  that  are  likely  to  match  the  data. 

•  Matching:  (liven  a  data  feature  subset  and  an 
object  model,  determine  if  there  is  a  legal  transfor¬ 
mation  that  would  carry  the  model  into  a  pose  in 
the  image  that  b  consistent  with  the  data,  possibly 
by  finding  a  matching  between  data  and  model  fea¬ 
tures.  It  is  often  useful  to  separate  this  stage  into 
two  subproblems: 

-  Hypothesize  possible  solutions,  using  mini¬ 
mal  model  and  image  information. 


-  Verily  such  hypotheses,  using  more  iletailed 
information. 

We  will  describe  our  recent  work  in  each  of  these  areas. 

2.1  Selectiou  and  Atteutiou 

We  have  argued  for  some  time  that  robust  and  efficient 
solutions  to  the  selection  (or  grouping)  problem  are  es¬ 
sential  to  practical  recognition  systems.  Earlier  work, 
using  both  formal  analysis  and  experimental  studies  [22; 
13;  27].  has  shown  that  the  complexity  of  many  ap¬ 
proaches  to  recognition  are  dramatically  reduced  if  de¬ 
cent  selection  is  provided,  and  that  the  false  posi¬ 
tive/false  negative  rates  for  such  methods  are  also  sig¬ 
nificantly  improved  with  good  selection. 

One  advantage  of  focusing  on  the  issue  of  selection  for 
recognition  is  that  it  provides  constraints  on  the  rei|iiire- 
ments  of  early  processing  stages.  For  example.  r\ies  such 
as  color  or  texture  are  often  considered  from  the  view¬ 
point  of  extracting  object  surface  measurements,  which 
requires  that  one  account  for  illumination  and  other 
scene  effects  in  inverting  the  image  measurements  ti')  ob¬ 
tain  object  parameters.  If  one  simply  wants  to  use  these 
cues  to  separate  regions  of  an  image  likely  to  have  cone' 
from  a  single  object,  much  less  stringent  requirement > 
are  placed  on  the  task,  leading  to  simpler  and  hopefully 
more  robust  2dgorithms. 

Towards  this  end,  Tanveer  Syeda-Mahmood  has  re¬ 
cently  completed  a  Ph.D.  thesis  [46]  that  explores  the 
role  of  cues  such  as  color  and  texture  in  selection  for 
recognition.  She  does  this  by  developing  and  implement¬ 
ing  a  computational  model  o*'  visual  attention.  whi<  h 
serves  as  a  general  purpose  selection  mechanism  in  a 
recognition  system. 

The  approach  supports  two  modes  of  attentional  be¬ 
havior,  namely  attracted  attention  and  pay-attentn’v 
modes.  The  attrewted  attention  mode  of  behavior 
spontaneous  and  is  commonly  exhibited  by  an  unl)ia.'>eil 
observer  (i.e.,  with  no  a  priori  intentions)  when  some 
object  or  some  aspect  of  the  scene  attracts  his/her  at¬ 
tention,  while  the  latter  is  a  more  deliberate  behavior 
exhibited  by  an  observer  looking  at  a  scene  with  a  pi  ton 
goals  (such  as  the  task  of  recognizing  an  object,  say)  ami 
hence  paying  attention  to  only  those  objects/asperts  of 
a  .scene  that  are  relevant  to  the  goal. 

Briefly,  the  model  suggests  that  the  scene  represent*  it 
by  the  image  be  processed  by  a  .set  of  interacting  feat  nr- 
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detectors  that  generate  a  hierarchy  of  maps,  representing 
features  such  as  brightness,  color,  texture,  depth,  group¬ 
ing  of  edges,  and  others  such  as  shape,  size,  symmetry, 
etc.  The  feature  maps  are  then  processed  by  filters  in¬ 
corporating  strategies  for  selecting  distinctive  regions  in 
these  maps.  The  choice  of  these  strategies  is  guided  by  a 
central  control  mechanism  that  combines  top-down  task 
level  and  a  pnort  information  with  the  bottom-up  infor¬ 
mation  derived  from  the  features,  to  demonstrate  either 
mode  of  attentional  behavior  as  desired.  Finally,  an  ar¬ 
biter  module  housing  another  set  of  strategies  selects  the 
most  significant  features  across  the  feature  maps,  which 
can  then  be  used  in  an  object  recognition  system. 

A  system  implementing  the  computational  model  de¬ 
scribed  here  was  built  using  three  features;  color,  tex¬ 
ture,  and  parallel-line-groups.  The  respective  feature 
maps  were  built,  and  the  selection  filters  for  finding  dis¬ 
tinctive  regions  in  these  maps  have  been  developed.  In 
addition,  a  version  of  the  arbiter  module  to  combine  the 
saliency  information  from  the  various  features  has  been 
built. 

Because  we  are  interested  mainly  in  separating  regions 
likely  to  have  come  from  a  single  object,  we  do  not  need 
to  exactly  recover  object  parameters  such  as  body  color. 
Rather,  we  can  focus  on  methods  that  describe  the  color 
image  as  consisting  of  perceptually  different  colored  re¬ 
gions.  This  can  be  done  by  focusing  on  the  components 
of  a  color  signal  that  are  most  relevant  to  human  cate¬ 
gorization  of  colors  (e.g.  saturation  and  brightness,  but 
not  hue).  .Syeda  has  developed  such  a  method  of  per¬ 
ceptual  categorization  of  a  color-space,  which  supports 
fast  color  region  segmentation.  A  color  saliency  map  was 
then  built  which  used  a  color  saliency  measure  that  em¬ 
phasized  attributes  that  are  also  salient  in  human  color 
perception.  The  key  point  is  that  such  a  saliency  mea¬ 
sure  serves  to  highlight  regions  of  interest  for  a  recogni¬ 
tion  system. 

The  texture  feature  map  was  generated  by  regarding 
the  image  as  being  generated  by  a  space-limited  station¬ 
ary  stochastic  process.  Here,  the  segmentation  of  the 
textured  image  was  obt2uned  by  a  comparison  of  the  lin¬ 
ear  prediction  spectra  of  adjacent  windowed  regions  of 
the  image.  Properties  such  as  the  relative  distribution 
of  dark  and  bright  blobs  were  then  made  use  of  to  judge 
the  distinctiveness  of  a  region.  This  was  used  to  generate 
the  texture  saliency  map. 

Lastly,  the  parallel-line-groups  feature  map  high¬ 
lighted  groups  of  closely-spaced  parallel  lines  in  an  edge 
image.  It  has  been  found  that  some  texture  information 
can  be  modeled  this  way.  For  example,  printed  letters  on 
a  surface  (such  as  a  bottle)  appear  as  a  bunch  of  closely 
spaced  parallel  lines  when  passed  through  an  edge  de¬ 
tector.  Similarly,  some  types  of  wooden  tables  show  this 
type  of  texture  in  an  image. 

These  feature  maps  can  then  be  combined  to  isolate 
regions  of  an  image  for  analysis  by  a  recognition  system 
(in  our  implementation  this  was  a  combination  of  a  tree 
search  algorithm  and  a  linear  combinations  approach). 
The  feature  maps  and  their  associated  saliency  maps  can 
be  driven  in  a  pay-attention  mode,  in  which  the  color 
and  texture  information  in  the  model  (extracted  using 


the  feature  maps  described  earlier)  is  used  to  build  a  de¬ 
scription  of  the  object-model.  This  description  was  then 
used  to  design  strategies  for  the  selection  filters.  This 
involved  developing  new  algorithms  for  finding  instances 
of  regions  in  the  image  satisfying  object-model  color  and 
texture  descriptions.  Such  regions  are  then  passed  to 
the  recognition  system  for  analysis.  Experimental  results 
show  that  the  methods  drastically  reduce  the  complex¬ 
ity  of  the  recognition  process  by  rejecting  clutter  from 
consideration.  The  system  can  also  be  driven  in  attract 
attention  mode,  in  which  the  most  salient  portions  of  the 
scene  are  analyzed  first,  again  reducing  the  complexity  of 
the  recognition  stage.  An  example  is  shown  in  Figure  1. 


Figure  1;  Example  of  attentional  selection  for  recognition. 
Top  Left;  Image  used  to  create  the  model.  Middle  left; 
Scene.  Bottom  Left;  Selected  salient  color  regions.  Top  cen¬ 
ter;  Model  showing  corner  and  line  features.  Middle  cen¬ 
ter;  Edge  image  of  scene,  showing  features.  Bottom  center: 
salient  features.  Top  right  and  middle  right;  matched  fea¬ 
tures  found  by  algorithm.  Bottom  right:  alignment  of  model 
with  image. 

The  key  results  here  are  a  framework  for  combining 
sensory  information  to  support  recognition,  the  use  of 
an  attention  mechanism  to  select  targets  for  recognition, 
and  novel  methods  for  handling  texture  and  color  infor¬ 
mation. 

2.2  ludexiug 

Even  if  we  can  isolate  key  regions  of  an  image,  we  still 
need  to  know  what  objects  may  be  present.  In  some 
tasks,  we  are  looking  only  for  a  specific  target,  or  a  small 
number  of  targets,  in  which  case  model-driven  selection 
can  be  used  to  isolate  regions  of  interest.  In  other  cases, 
we  need  to  consider  large  libraries  of  objects,  in  which 
case  we  need  some  means  of  using  selected  image  features 
to  identify  subsets  of  the  library  as  candidate  models 
Cues  such  as  color  and  texture,  discussed  above,  can  help 
with  this  object  indexing.  In  general,  however,  other 
information  is  also  needed. 


50 


David  Jacobs,  as  part  of  his  recently  completed  Ph.D. 
thesis  [27],  has  shown  that  simple  aspects  of  an  object’s 
shape  can  often  be  used  to  efficiently  index  objects  in 
a  library.  Jacobs’  method  views  the  indexing  problem 
as  stating;  can  one  compactly  represent  the  space  of  all 
images  that  an  object  model  can  create,  given  the  type 
of  projection  model  used?  If  one  can,  then  to  handle 
the  indexing  problem,  we  can  precompute  e2ich  model’s 
manifold  of  possible  images  in  an  image  space.  At  recog¬ 
nition  time,  we  can  compute  the  associated  representa¬ 
tion  of  the  current  image,  use  this  to  access  image  space, 
and  retrieve  all  models  that  could  have  caused  it.  The 
key  question  is  whether  one  can  actually  represent  all 
possible  images  of  a  model  in  an  efficient  way.  Jacobs 
has  shown,  perhaps  surprisingly,  that  in  several  nontriv¬ 
ial  cases,  one  can. 

The  results  are  summarized  as  follows.  We  assume 
that  a  3D  object  is  transformed  by  an  arbitrary  affine 
transformation,  followed  by  a  scaled  orthographic  pro¬ 
jection.  For  the  ceise  of  3D  points,  this  is  equivalent  to 
applying  an  arbitrary  3  x  3  matrix  to  the  points,  then 
translating  them,  then  projecting  them.  To  describe  the 
images  that  a  model  can  produce  under  this  class  of 
transformations,  we  first  define  image  space,  then  deter¬ 
mine  the  shapes  of  the  model  manifolds.  Jacobs  argues 
for  using  affine  coordinates  to  represent  image  space.  In 
particular,  if  one  selects  three  ordered  point  features  to 
establish  a  basis,  then  all  other  points  can  be  written  in 
terms  of  coordinates  with  respect  to  this  basis:  that  is  if 
<li > *l2i  ■ ' <ln  (l^note  the  image  points,  and  if 

o  =  q,  u  =  q,  -  qi  v  =  qg  -  q, 

are  the  affine  basis,  then  all  other  points  can  be  repre¬ 
sented  by  coordinates  (<Vi,/^i)  by 

q,  =  o  •+■  rtjU  + /ijV. 

These  coordinates  are  invariant  to  any  affine  transfor¬ 
mation,  and  hence  an  image  is  uniquely  identified  by 

0,U,V,(04,/i4),(rt5,/^5),... 

It  turns  out  that  the  first  three  vectors  do  not  provide 
any  information  about  whether  a  scene  could  produce 
this  image,  so  we  use  only  the  («i,  /^t)  parameters  to  rep¬ 
resent  an  image.  Thus,  an  image  with  n  ordered  points 
maps  to  a  point  in  a  2(n  —  3)  dimensional  space,  which 
can  be  divided  into  two  orthogonal  n  —  3  dimensional 
spaces,  one  for  the  «  coordinates  and  one  for  the  fi  co¬ 
ordinates.  The  advantage  of  doing  this,  as  Jacobs  has 
shown,  is  that  the  set  of  all  images  that  a  model  of  n 
ordered  points  could  produce  is  simply  a  pair  of  parallel 
lines,  one  in  each  space.  In  this  case,  indexing  simply 
says,  given  an  image  basis,  compute  the  affine  coordi¬ 
nates  of  the  points,  then  find  all  model  lines  that  pass 
through  the  associated  point  in  o  -  /J  space.  One  must 
allow  for  uncertainty  in  the  measurements,  but  this  can 
be  shown  to  be  easily  handled  and  simply  expands  the 
image  points  to  small  regions  in  the  o  and  spaces  [2l]. 
An  example  of  the  indexing  system  correctly  retrieving 
candidate  models  is  shown  in  Figure  2. 

Extensions  of  this  approach  to  deal  with  other  types 
of  features  are  discussed  in  an  article  by  Jacobs  in  these 


proceedings.  We  note  as  an  aside  that  considering  this 
model  of  linear  transformations  has  proved  fruitful  in 
other  problems,  such  as  the  linear  combinations  ap¬ 
proach  to  recognition  [48]  and  in  dealing  with  affine 
structure  from  motion  [28;  4l]. 

The  key  result  here  is  a  very  efficient  method  for  han¬ 
dling  indexing  for  .some  classes  of  objects,  as  well  as  a 
novel  framework  for  investigating  the  interactions  be¬ 
tween  object  structure  and  image  projections. 

2.3  Matching 

A  central  part  of  recognition,  once  we  have  found  sub¬ 
sets  of  image  features  of  interest  and  sets  of  models 
of  interest,  is  to  determine  whether  the  image  features 
are  in  fact  consistent  interpretations  of  the  model  fea¬ 
tures.  Over  the  past  several  years,  we  have  developed 
a  variety  of  different  approaches  to  this  problem,  in¬ 
cluded  Interpretation  Tree  Search  [22],  Alignment  [24; 
25]  and  Linear  ( ‘ombinations  [48].  Here  we  report  on 
some  new  edternatives  to  these  approaches,  as  well  as 
improvements  on  these  approaches. 

2.3.1  Making  Alignment  Robust 

The  alignment  approach  to  recognition  [24;  25;  6]  pro¬ 
ceeds  by  matching  a  small  set  (typically  3)  of  image  fea¬ 
tures  to  model  features,  using  this  match  to  determine 
the  associated  transformation  of  the  object  (modeled  as 
a  weak  perspective  transformation),  and  projecting  the 
remaining  model  features  into  the  image  for  verification. 
In  the  origind  system,  uncertainty  in  the  image  mea¬ 
surements  was  dealt  with  in  a  somewhat  ad  hoc  manner. 
Recently  [21]  we  have  shown  how  to  analyze  the  effects 
of  that  uncertainty  on  the  set  of  possible  transforma¬ 
tions.  Alter  [l]  has  extended  that  work  to  supplement 
alignment  approaches  .~ith  a  verification  stage  that  is 
guaranteed  to  be  correct.  In  particular,  he  shows  that 
using  a  bounded  error  model  on  the  image  features,  one 
can  compute  the  remge  of  image  positions  for  all  other 
model  features,  both  for  planar  and  solid  objects,  and 
for  point  and  line  features.  This  allows  one  to  exactly 
specify  the  range  of  image  positions  over  which  to  search 
for  matching  features,  so  that  one  will  not  miss  any  sup¬ 
porting  evidence,  while  at  the  same  time  keeping  the 
cheinces  of  false  matches  minimal.  One  can  further  ex¬ 
tend  this  approach  by  adding  a  Bayesian  analysis  of  the 
actual  matching  regions,  so  that  one  can  determine  the 
likelihood  of  each  verified  match  actually  being  correct. 
This  allows  one  to  determine  the  best  regions  in  which 
to  search  for  features,  by  determining  those  most  likely 
to  contribute  to  a  correct  interpretation.  An  example  nf 
the  image  search  regions  is  shown  in  Figure  3.  Exten¬ 
sions  of  the  method  to  line  features  has  also  been  done, 
and  results  show,  as  expected,  that  lines  are  considerably 
more  powerful  as  verification  features  than  points. 

A  related  result  concerns  the  models  of  sensor  tincer- 
tainty  used  both  in  the  analysis  of  recognition  method.s 
and  in  the  derivation  of  verification  and  likelihood  tech¬ 
niques.  Most  of  our  earlier  work  has  been  ba.sed  on  a 
bounded  error  of  sensor  uncertainty.  Karen  .Sarachik  has 
been  working  on  the  problem  of  estimating  the  effects  r)f 
sensor  noise  on  the  problem  of  model  based  object  recog¬ 
nition,  for  other  classes  of  uncertainty.  For  the  analysis 
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it  is  assumed  that  the  location  of  a  point  feature  in  an 
image  is  corrupted  by  noise,  which  is  modeled  as  a  two 
dimensional  Claussian  distribution,  and  the  presence  of 
the  model  in  the  image  is  posed  as  a  binary  decision 
problem.  The  positional  uncertainties  of  the  point  fea¬ 
tures  are  traced  through  the  recognition  algorithm,  re¬ 
sulting  in  analytic  expressions  for  the  confidence  level  of 
the  algorithm’s  decision.  Until  now  the  analysis  h2is  been 
completed  only  for  one  algorithm,  geometric  hashing  as 
introduced  by  Wolfeon,  Lamdan,  Schwartz  and  Hummel. 
Using  this  technique  it  is  possible  to  explicitly  compare 
the  expected  performance  of  different  recognition  algo¬ 
rithms  and  noise  models,  a  useful  tool  for  the  domain  of 
rnulti-sensor  fusion. 

2.3.2  Pose  Space  Analysis 

In  earlier  proceedings,  we  have  reported  on  our  work 
in  developing  recognition  algorithms  that  work  directly 
in  the  space  of  possible  poses  of  an  object,  rather  than  in 
the  space  of  feature  correspondences.  In  the  ideal  case  of 
perfect  sensor  data,  one  can  simply  search  over  all  pos¬ 
sible  pairings  of  model  and  image  features,  compute  the 
associated  transformation  and  vote  for  that  transforma¬ 
tion  in  pose  space,  a  la  Hough  transforms.  When  un¬ 
certainty  is  allowed  in  the  mecisurements,  however,  one 
must  be  more  careful  about  voting  for  the  entire  volume 
of  transformations  consistent  with  the  pairing  of  a  noisy 
sensor  measurement  and  a  model  feature,  and  this  in¬ 
creases  the  demand  on  searching  fine  tesselations  of  the 
pose  space. 

Todd  ('ass  has  recently  complete  a  Ph.D.  thesis  that 
presents  an  elegant  way  around  this  problem,  by  ex¬ 
ploiting  the  geometry  of  pose  space  directly.  C^ass  [lO] 
has  provided  a  formulation  of  the  problem  in  which  one 
can  develop  a  polynomial-time  algorithm  that  guaran¬ 
tees  finding  all  feasible  interpretations  of  the  data,  mod¬ 
ulo  uncertainty,  in  terms  of  the  model.  The  approach  is 
based  on  representing  the  model  and  the  sensory  data 
in  terms  of  local  geometric  features  such  as  vertices  and 
line  segments.  He  assumes  bounds  on  the  uncertainty 
in  the  position  or  orientation  of  the  data  features  due 
to  sensor  error.  He  then  shows  that  there  are  only  a 
polynomial  number  of  quantitatively  different  transfor¬ 
mations  that  align  the  model  and  the  data  modulo  error. 
Object  localization  is  accomplished  using  a  polynomial¬ 
time  search  through  the  set  of  all  model  transformations 
to  find  those  that  align  large  subsets  of  model  and  data 
features  within  the  uncertainty  bounds. 

Intuitively,  this  approach  can  be  considered  as  follows. 
For  each  pairing  of  a  data  and  model  feature,  there  is 
a  set  of  transformations  that  will  align  the  model  fea¬ 
tures  within  the  uncertainty  region  about  the  data  fea¬ 
ture.  This  set  of  transformations  carves  out  a  volume 
in  pose  space.  If  we  consider  all  pairings  of  data  and 
model  features,  we  get  a  set  of  such  volumes,  and  we 
are  interested  in  finding  points  in  the  pose  space  con¬ 
tained  within  the  intersection  of  a  large  number  of  such 
volumes.  CJne  could  find  such  points  by  simply  sampling 
points  in  pose  speice  at  some  fine  spacing,  a  method  used 
earlier  by  Cass  in  implementing  a  very  fast  recognition 
scheme  on  the  Connection  Machine,  It  turns  out.  how¬ 


ever,  that  one  can  efficiently  find  such  volumes  by  decou¬ 
pling  the  search  over  the  full  pose  space  into  a  coupled 
search  over  the  translational  components  and  a  second 
search  over  the  rotational  components.  Moreover,  one 
can  use  the  structure  of  these  geometric  arrangements  to 
find  very  efficient,  polynomial-time  algorithms  for  find¬ 
ing  the  boundaries  of  these  pose-space  volumes.  Cass 
has  extended  his  earlier  work  to  allow  for  unknown  scale 
factors,  unknown  uncertainty  values,  and  has  used  re¬ 
sults  from  computational  geometry  to  provide  efficient 
algorithms  for  exploring  pose  space. 

2.3.3  RAST  algorithm 

An  alternative  to  Ceiss'  approach  to  analyzing  Pose 
Spaces  is  the  RA.ST  algorithm  (Recognition  by  Adaptive 
Subdivision  of  Transformation  Space:  [s]).  The  R.AST 
algorithm  solves  bounded  error  recognition  problems  ef¬ 
ficiently. 

Bounded  error  recognition  is  one  of  the  most  com¬ 
monly  used  formulations  of  the  visual  object  recogni¬ 
tion  problem  and  has  proven  its  utility  in  a  number  of 
practical  systems  (for  further  references  and  related  re¬ 
sults,  see,  for  example,  [4],  [22]).  The  simplest  form  of 
the  bounded  error  recognition  problem  is  the  following: 
given  a  set  of  model  features  (points  in  R?  or  R^)  and  a 
set  of  image  features  (points  in  R^).  find  niciximal  sub¬ 
sets  of  image  and  model  features  that  can  be  brought 
into  correspondence  under  given  error  bounds  using  rigid 
body  transformations. 

The  recognition  algorithm  is  based  on  the  idea  of  car¬ 
rying  out  matching  with  variable  sized  error  bounds:  if. 
for  a  given  set  of  transformations,  a  good  match  can¬ 
not  be  found  for  large  error  bounds,  then  matches  with 
smaller  error  bounds  need  not  be  examined.  The  R.AST 
algorithm  uses  particularly  convenient  representations 
for  sets  of  transformations  that  make  it  simple  to  im¬ 
plement,  efficient,  and  flexible  in  the  kinds  of  features 
and  similarity  measures  that  can  be  used  with  it. 

•So  far,  we  have  applied  the  RA.ST  algorithm  to  2D 
recognition  problems  that  involve  a  very  large  number 
(thouseuids)  of  very  simple  image  features  (short  line 
segments).  In  such  applications,  the  RAST  algorithm 
is  found  to  be  faster  than  alternative  methods  (recog¬ 
nition  by  alignment.  Hough  transform,  search,  or  corre¬ 
lation).  Actual  applications  using  the  RAST  algorithm 
include  the  prototype  of  a  view-based  3D  object  recogni¬ 
tion  system  and  a  system  for  handwritten  optical  charac¬ 
ter  recognition  ((^(’R).  Examples  are  shown  in  Figure  1. 

2.3.4  View-Based  Representations 

For  curved  objects,  both  the  visibility  and  the  location 
of  object  parts/features  in  the  image  varies  in  a  cumpli- 
cated  way  with  the  viewpoint.  This  has  made  the  de¬ 
velopment  of  efficient  3D  object  recognition  algoritlmis 
difficult. 

In  order  to  avoid  these  difficulties,  many  3D  oliject 
recognition  systems  have  heuristically  used  view-hasid 
representations,  i.e.,  representations  that  encode  object 
properties  and  shape  from  a  large  number  of  different 
viewpoints. 

However,  using  view-based  representations  only  solves 
the  3D  object  recognition  problem  approximately  In 
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order  to  understand  the  nature  and  significance  of  this 
approximation,  we  have  formalized  the  notion  of  view- 
based  representations  and  established  error  and  com¬ 
plexity  bounds  on  the  performance  of  recognition  sys¬ 
tems  that  are  based  on  view-based  representations[9]. 

The  theoretical  results  suggest  that,  for  the  pur¬ 
poses  of  object  recognition  from  2D  images,  view-based 
representations  are  good  approximations  to  true  3D 
shape  representation.  Furthermore,  we  have  estab¬ 
lished  model-independent  upper  bounds  on  the  number 
of  views  needed  in  order  to  represent  a  model  in  a  view- 
based  system.  Finally,  a  complexity  analysis  suggests 
that  view-based  recognition  can  be  carried  out  more  ef¬ 
ficiently  than  3D  shape-b2ised  recognition. 

These  theoretical  results  are  supported  by  a  number 
of  simulations  and  experiments  on  real  images. 

2.3.5  Statistical  Recoguitiou 

An  alternative  approach  to  recognition  is  to  treat  it 
as  a  problem  of  optimal  estimation.  Sandy  Wells  has 
recently  completed  a  Ph.D.  thesis  [53]  that  develops  and 
tests  a  framework  for  statistical  object  recognition. 

To  formalize  this,  let  the  image  that  is  to  be  analyzed 
be  represented  by  a  set  of  v-dimensional  point  features 

The  model  to  be  matched  is  £ilso  described  by  a  set  of 
point  features,  these  are  represented  by  real  matrices: 

M  =  {A/iA/j.A/m} 

To  solve  the  recognition  problem  we  need  to  find  a 
legal  match  between  image  features  and  model  features. 
Here,  legal  means  that  there  is  some  physically  meaning¬ 
ful  way  of  positioning  the  model  in  the  scene  so  that  it 
would  produce  image  features  similar  to  those  actually 
observed.  We  can  treat  this  as  an  optimization  problem, 
wherein  we  seek  to  estimate  two  .sets  of  parameters:  the 
correspondences  between  image  and  model  features,  and 
the  pose  of  the  model  instance  in  the  image.  The  corre¬ 
spondences  are  described  by  an  interpretation  vector 

r  =  [r,r2...r„]  ,  neMuii} 

Here  F,  =  Mj  means  that  image  feature  i  corresponds 
to  model  feature  j,  and  Fj  =±  means  that  image  feature 
i  is  due  to  the  background. 

The  pose  of  the  model  instance  in  the  image,  D,  is 
a  real  vector.  An  associated  projection  function  P  is 
defined': 

P  maps  model  features  into  the  v-dimensional  image  ac¬ 
cording  to  the  model’s  pose. 

Our  goal  is  to  obtain  estimates  of  the  correspondences 
and  pose  by  maxinr.izing  the  posterior  density  with  re¬ 
spect  to  F  and  /i,  as  follows 

F,/:f  =  argmaxp(F,/^  I  (?) 
r,*t 

In  other  words,  we  want  to  find  the  assignment  of 
model  features  to  image  features,  and  the  related  po.se 
(position  and  orientation)  of  the  object  that  optimally 
accounts  for  the  observed  data.  W’e  can  treat  this  as  an 


estimation  problem,  and  use  Baye's  rule  to  calculate  the 
a-posteriori  probability  density  of  F  and  fi: 

where  C  is  a  normalizing  constant  independent  of  F  and 
fi.  Then  we  seek  estimates  for  F  emd  l3  that  optimize 
this  objective  function. 

Note  that  we  couple  the  effects  of  the  objects  pose  di¬ 
rectly  into  the  matching  problem.  There  are.  of  course, 
some  sensor  features  that  are  not  directly  pose  related, 
such  cis  the  fractal  dimension  of  a  region.  These  features 
can  also  be  incorporated  into  the  matching  process,  ei¬ 
ther  cis  priors  on  the  correspondence,  or  a.s  filters  that 
remove  some  correspondences  directly. 

To  solve  this  optimization  problem,  we  need  several 
things.  First,  we  need  to  model  9.  This  can  be  done 
by  a  careful  physical  modelling  of  the  sensor,  by  taking 
into  considerating  the  effects  of  noise  on  the  transduc¬ 
tion  process,  and  providing  careful  models  of  the  distri¬ 
bution  of  uncertainty  about  the  measured  values.  Such 
models  can  be  derived  for  widely  varying  sensors,  other 
than  visual,  and  in  we  are  in  the  process  of  applying  this 
approach  to  LADAR  and  SAR  sensors. 

As  an  example,  one  simple  model  is  to  assume  that 
the  probability  density  function  on  features  is  uniform 
for  features  arising  from  the  background,  cind  is  normally 
distributed  about  their  predicted  locations  in  the  image 
for  matched  features.  Of  course,  this  is  a  simple  model. 
For  some  types  of  features,  we  have  more  explicit  mod¬ 
els  of  the  distribution  of  the  feature,  which  will  simply 
replace  the  variance  of  the  normal  distribution. 

Second,  we  need  to  model  the  probability  of  an  inter¬ 
pretation  (or  matching  of  features)  and  the  probability 
of  a  pose.  One  simple  method  is  the  following.  The 
probability  that  a  feature  belongs  to  the  background  is 
fl;  the  remaining  probability  is  uniformly  distril)uted  fnr 
correspondences  to  the  m  model  features. 

Our  simple  model  also  assumes  that  prior  informa¬ 
tion  on  the  pose  is  a  normal  density.  With  this  choice 
for  the  form  of  the  pose  prior,  the  system  is  closed  in  tin- 
sense  that  the  resulting  pose  estimate  will  also  be  nor¬ 
mal.  This  is  convenient  for  coarse-to-fine  techniques  (or 
multi-resolution  methods),  in  which  we  use  coarse  dat.a 
to  get  an  initial  estimate,  then  refine  this  by  focusing  on 
subportions  of  finer  resolution  data. 

If  little  is  known  about  the  pose  a-priori,  the  prior 
may  be  made  quite  broad.  This  is  expected  to  often 
be  the  case.  Note,  however,  that  better  models  would 
incorporate  additional  information  about  the  scene  For 
example,  if  we  know  the  parameters  of  the  ground  plane, 
and  the  target  is  known  to  be  in  a  stable  position  on  that 
plane,  we  should  be  able  to  incorporate  this  knowledg- 
into  better  priors  on  the  pose  parameters.  For  example 
if  range  information  is  also  known,  then  only  two  of  tin- 
six  parameters  of  pose  are  completely  unknown.  The 
others  can  be  constrained  from  such  additional  informa¬ 
tion,  thereby  reducing  the  complexity  of  the  search  for  ;i 
match. 

If  we  assume  independence  of  the  correspontlences  and 
the  pose  (before  the  image  is  seen),  the  composite  prior 
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is  a  straightforward  product  of  the  prior  distribution  on 
pose  and  the  probability  distributions  on  matched  and 
unmatched  feature's.  Thus,  in  our  simple  example,  we 
can  describe  the  probability  of  a  pose  and  a  correspon¬ 
dence  in  terms  of  measurable  quantities  in  the  data. 

(liven  this,  we  need  efficient  methods  for  finding  op¬ 
timal  estimates  for  the  parameters  of  interest.  We  ran 
choose  those  estimates  that  maximize  the  a-posteriori 
probability  (MAP),  by  maximizing  the  posterior  den¬ 
sity  with  respect  to  the  matched  features  and  *he  pose. 
But  to  find  such  estimates,  we  need  efficient  methods  for 
searching  the  objective  function. 

To  handle  this  search  process.  Wells  has  considered 
several  approciches  including  beam  search  through  a  tree 
of  interpretations  [5l],  and  posterior  marginal  pose  es¬ 
timation  [52].  The  latter  is  motivated  by  the  obser¬ 
vation  that  in  tree  searches  of  the  objective  function 
of  MAP  model  matching,  hypotheses  having  "’.  or” 
matches  scored  poorly  in  the  objective  function.  The 
implication  was  that  summing  posterior  probability  over 
all  matches  (at  a  specific  pose)  might  provide  a  good 
pose  evaluator.  This  has  proven  to  be  the  case  in  the 
experiment  described  in  [5l] 

The  essence  of  posterior  marginal  pose  estimation  is  to 
choose  the  pose  that  maximizes  the  posterior  probability 
density  of  the  pose,  given  an  image; 

/i  =  arg  max  p(/:^  I  (-)) 

IS 

The  posterior  probability  density  of  the  pose  is  computed 
from  the  joint  posterior  probability  on  pose  and  match, 
by  taking  the  marginal  over  the  possible  matches: 

p(/i|e)  =  5^p(r,/i|0) 

r 

Using  Bayes’  rule,  the  posterior  marginal  may  isolated 
as  a  function  of  the  priors  described  above,  and  this 
leads  to  a  convenient  objective  function  for  optimization. 
One  can  utilize  the  EM  algorithm  to  provide  an  efficient 
method  for  optimizing  this  objective  function,  thereby 
leading  to  solutions  to  the  pose  problem.  A  more  com¬ 
plete  description  is  given  the  paper  by  Wells  in  these 
proceedings. 

Wells  has  applied  the  method  to  several  cases,  include 
21)  point  features,  2D  oriented  range  features  and  linear 
;5d  projection  modeL.  An  example  of  recognition  from 
visible  image  features  is  shwon  in  Figure  5. 

A  second  example  shows  the  method  applied  to  a  very 
different  type  of  imagery.  This  work  was  done  in  con¬ 
junction  with  (iroup  53  of  Lincoln  Labs,  directed  by 
A.  (ischwendtner.  In  this  example,  the  features  were 
oriented-range  features,  consisting  of  fragments  of  image 
edge  contours,  augmented  with  a  vector  pointing  in  the 
direction  normal  to  a  range  discontinuity,  with  length  re¬ 
flecting  the  inverse  range  at  the  discontinuity.  Two  sets 
of  features  were  prepared,  the  “model  features",  and  the 
“image  features" . 

The  object  model  features  were  derived  from  a  syn¬ 
thetic  range  image  of  an  M35  truck  that  was  created 
using  the  ray  tracing  program  associated  with  the  BRL 
CAD  f’ackage  [ifij.  The  ray  tracer  was  modified  to  pro¬ 


duce  range  images  instead  of  shaded  images.  The  syn¬ 
thetic  range  image  appears  in  the  upper  left  of  Figure 
6. 

In  order  to  simulate  a  laser  radar,  the  synthetic  range 
image  described  above  was  corrupted  with  simulated 
laser  radar  sensor  noise,  using  a  sensor  noise  model 
that  is  described  by  .Shapiro,  Reinhold,  and  Park  [40]. 
In  this  noise  model,  measured  ranges  are  either  valid 
or  anomalous.  Valid  measurements  are  normally  dis¬ 
tributed,  and  anomalous  measurements  are  uniformly 
distributed.  The  corrupted  range  image  appears  in  Fig¬ 
ure  6  on  the  right.  To  simulate  post  sensor  process¬ 
ing,  the  corrupted  image  was  “restored”  via  a  statisti¬ 
cal  restoration  method  of  Menon  and  Wells  [3l].  The 
restored  range  image  appears  in  the  lower  position  of 
Figure  6. 

Oriented  range  features  were  extracted  from  the  syn¬ 
thetic  range  image,  for  use  as  model  features  -  and  from 
the  restored  range  image,  these  are  called  the  noisy  fea¬ 
tures.  The  features  wt  -e  extracted  from  the  range  images 
in  the  following  manner.  Range  discontinuities  were  lo¬ 
cated  by  thresholding  neighboring  pixels,  yielding  range 
discontinuity  curves.  These  curves  were  then  segmented 
into  approximately  20-pixel-long  segments  via  a  process 
of  line  segment  approximation.  The  line  segments  (each 
representing  a  fragment  of  a  range  discontinuity  curve) 
were  then  converted  into  oriented  range  features  in  the 
following  manner.  The  X  and  Y  coordinates  of  the 
feature  were  obtained  from  the  mean  of  the  pixel  co¬ 
ordinates.  The  normal  vector  to  the  pixels  was  gotten 
via  least-squares  line  fitting.  The  range  to  the  feature 
was  estimated  by  taking  the  mean  of  the  pixel  range> 
on  the  near  side  of  the  discontinuity.  This  information 
was  packaged  into  an  oriented-range  feature.  The  model 
features  are  shown  in  the  fourth  image  of  Figure  6.  Each 
line  segment  represents  one  oriented-range  feature,  the 
ticks  on  the  segments  indicate  the  near  side  of  the  range 
discontinuity.  There  are  113  such  object  features. 

The  noisy  features,  derived  from  the  restored  range 
image,  appear  in  the  fifth  image  of  Figure  6.  There  are  (>2 
noisy  features.  Some  features  have  been  lost  due  to  the 
corruption  and  restoration  of  the  range  image.  The  ,^et  ■  .1 
image  features  was  prepared  from  the  noisy  features  l>y 
randomly  deleting  half  of  the  features,  transforming  tie 
survivors  according  to  a  test  pose,  and  adding  siittirieut 
randomly  generated  features  so  that  ^  of  the  featun  s 
are  due  to  the  object.  The  248  image  features  ap|)e,ar  in 
the  sixth  image  of  Figure  6. 

Using  this  data,  the  EM  algorithm  was  run  in  a  tnulti- 
re.sohition  manner,  and  Figure  7  shows  the  converg'  u. . 
of  the  algorithm  to  the  correct  pose. 

2.4  Projective  Structure  and  Recognition 

In  classic  projective  geometry  of  3D  space,  prujectiv. 
structure  is  typically  defined  by  three  cross-ratios  usiiiu, 
five  reference  points  (tetrahedron  of  reference  and  a  uun 
point)  [39:  32]  or,  etiuivalently,  by  a  tetrad  of  hnmng' 
nous  coordinates.  With  such  projective  structure  .nn 
ran  reconstruct  the  scene  up  to  an  unknown  project i\. 
transformation  in  3D  projective  spare,  or  equivalent  l> 
the  camera  coordinate  frame  may  undergo  an  affine  i  r  iti 
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Usurp  fi:  Tup  row:  Synthetic  ranRp  iinase,  noisy  range  image,  and  restored  range  image.  Bottom  row;  .Model  featiir-  v  rioi^v 
features,  and  image  features. 


formation  in  .spare  followed  liy  ati  arhitrary  projective 
transformation  of  the  image  plane  (taking  a  "picture" 
of  the  image),  A  projective  framework  does  not  make  a 
distinction  hetweeii  orthographic  and  perspective  views 
.'Uid  does  not  reipiire  camera  calihration  (internal  cam¬ 
era  parameters  are  folded  into  the  afhne  transformation 
of  t  he  camera  coordinate  frame) 

Our  .ippro.Kli  h.is  sever.d  characteristics:  first,  oiir 
detimtioii  .it  projeciive  striicture  relies  on  four  scene  ref- 
ereiic.  points  and  the  c.iMii  r.a  s  center  (iiisti'.ad  of  live 
M  l  lie  reference  |ioinis).  aiid  IS  defined  using  a  single 
iToss-ratio  (rather  than  three)  which  means  that  it 
Is  ol  it  allied  hy  some  invariant  function  cT  projective  co¬ 
ordinates  Second,  the  proposeil  project ive  itivariant  al¬ 
lows  us  to  reci instruct  the  scene  u|>  to  an  unknown  dl) 
projective  1  r.ansformation  and.  in  addition,  is  particu¬ 
larly  Useful  for  recognition,  that  is.  recotist ruction  of 
any  third  view  hecomes  very  simide  in  this  framework. 
Ihirdly,  we  make  use  of  the  epipoles  to  Compute  the 
pr.ijective  invariant  using  only  linear  image-hased  c.,mi- 
put.atioiis  ([iS;  :{'jl  propose  other  similar  technii|ues  for 
Using  the  epipiili.s  III  a  linear  reconstruction  of  homoge- 
iieiius  or  non-hoiiiogeneoiis  coor.linates).  Fitiallv.  we  i|i> 
not  reciiver  the  camera  transformation  in  the  course  of 
re.  oiistriii  iioii.  luit  instead  recover  the  projeciive  iran.'- 
torinaiinns  o|  two  laces  on  the  t  et  rahe.lr.  ui  of  reference 
lakeii  t.igeiher,  |iro|ectiv.i  structure  ran  he  coiupiiteil 
from  two  images  perspe.tive  ..r  orthographic,  using  an 
un.'.ililir.iteil  .■aiiier.i  I  In  .■ompul .at i.  ui  r.'.pnres  tin-  cor 


respomlences  arising  frotn  four  reference  p.imts  and  tin 
epipoles  -  the  latter  can  he  recovered  hy  a  niimli.  r  .  .t 
methods  using  six  to  lught  corresponding  (mints  'J'.e  l;. 

Ih]. 

Key  results  here  are  that  the  structure  inwiriant  i- 
useful  for  dl)  reconstriictioti  u|i  to  ati  iinkiiowti  dl)  pr.s 
j<>ctive  transformatioti  of  the  ohject.  ,an.l  for  purp'  - 
of  recognition  via  alignment  hy  .-reating  an  •  i|Ui'..i|e|i.  • 
class  for  different  vnwvs  of  the  same  ohje.-t  In  .ai..  r 
wcircis.  we  ran  "re-iiroject  anohjeit  ont.,.ui\  ino.i  \n  w 
given  two  tlioilel  views  III  full  I-I  irres|i.  'inl'  in  e  .md  i  s|., 
numher  <.if  corres[)onding  |ioints  het  ween  tin-  nov.  i  \  i.  w 
and  the  model  views 

3  Algebraic  Functions  for  3D  Visual 
Recognition 

Ihe  geometric  relation  hetween  .lilfereiit  \nws  ..t  i  :;|i 
ohject  can  he  used  to  recover  the  change  III  VI'V.iIil; 
geomeiry  across  views  .and  to  re.-ov.  r  the  ,,|,|,,t\  ,;|j 
structure.  Ihe  work  on  |.ro|e.-ti\e  structure  !■  in  n 
str.ates  that  m  .a  projective  fr.amewurk  oin-  .  .ui  i.!;:,- 
•atid  recover  a  stru.'ture  mv.ari.aiit  ili.at  is  siitti.  |.  i.i  !  : 
utinpiely  reroiistriii'ting  the  ohje.-t  with  re-|..  .1  i.  , 
frame  .  .f  re  fere  me  insist  mg  ■  .f  I.  uir  s.-ene  )..  .nr  m  I  ’ ; . 

.  aiiiera's  c.  nter  Ihe  lo.'ati..n  ■  if  the  ref  r.  i..  .  ti  i::.. 
-p.ace  Is  geiier.ally  unknown,  .ami  therefore  th.  .  I  >.  • 

-tructure  .-.an  he  rer.  ,nsi  rmt.-.l  ii|.  t..  .m  unkii  v,  .  d 
prop-ctivi  t  ratisf  ,rtii  at  loll  in  s|,.a.e  I  hm  ilfw  .  ..  ■ 


work  in  a  framework  that  does  not  make  a  distinction 
between  orthographic  and  perspective  views  cind  does 
not  require  internal  camera  calibration. 

The  geometric  relation  between  objects  and  their 
views  can  also  be  used  for  purposes  of  recognition.  In 
this  case  one  is  generally  not  interested  in  recovering  ob¬ 
ject  structure  from  multiple  views  but  instead  in  being 
able  to  predict  the  appearance  of  a  novel  view  from  a 
small  number  of  example  views  (“model”  views)  given  a 
small  number  of  corresponding  points  between  the  novel 
view  and  the  model  views.  The  projective  structure  in¬ 
variant  can  also  be  used  for  this  purpose  (see  Figure  ??) 
but  it  is  more  desirable  to  achieve  the  same  result  di¬ 
rectly  without  going  through  the  computation  of  .struc¬ 
ture  (metric  or  non-metric)  and  without  the  reconstruc¬ 
tion  of  camera  geometry  (transformation  parameters  or 
epipolar  geometry). 

In  what  is  still  an  ongoing  research  we  derive  alge¬ 
braic  relations  between  image  coordinates  across  three 
views  (two  model  views  and  a  novel  view).  We  show 
here  three  results:  first,  the  general  result  is  that  im¬ 
age  coordinates  across  three  views  (perspective  or  or¬ 
thographic)  are  related  by  a  small  number  of  third- 
order  algebraic  functions  each  having  1 1  parameters 
that  can  be  recovered  by  linear  methods.  Second,  if 
the  two  model  views  are  known  to  be  orthographic, 
then  the  algebraic  functions  reduce  to  second-order  ones 
with  only  7  parameters.  Thirdly,  if  all  three  views 
are  known  to  be  orthographic,  then  the  functions  re¬ 
duce  further  to  first-order  ones  with  only  4  parame¬ 
ters.  The  latter  is  identical  to  the  result  derived  by  [48; 
■b'3]  known  as  "the  linear  combination  of  views”,  and  thus 
the  first  two  results  can  be  viewed  as  an  extension  of  the 
linear  combination  of  views  to  perspective. 

In  a  projective  framework,  five  reference  points  (a 
tetrahedron  and  a  unit  point)  are  used  for  construct¬ 
ing  a  coordinate  system  of  3D  space  ([49],  for  example). 
A  projection  of  a  point  P  onto  a  plane  with  respect  to 
an  arbitrarily  positioned  center  of  projection  ((  'OP)  and 
arbitrarily  positioned  image  plane  can  be  achieved  by 
first  mapping  the  reference  frame  such  that  one  of  the 
tetrahedron's  vertices  is  aligned  with  the  desired  loca¬ 
tion  of  the  (X)P.  and  three  other  vertices  are  coplanar 
with  the  desired  image  plane  (in  projective  space  five 
points  in  general  position  can  be  uniquely  mapped  onto 
any  other  configuration  of  five  points  in  general  posi¬ 
tion).  The  point  P  is  then  projected  onto  the  face  of 
the  tetrahedron  opposite  to  the  (!OP  (in  homogeneous 
coordinate  representation  of  space,  this  is  achieved  by 
an  orthographic  projection,  see  Figure  9).  If  we  assume 
that  there  exists  at  least  one  configuration  of  the  ref¬ 
erence  frame  in  which  three  of  the  reference  points  do 
not  intersect  any  of  the  scene  points,  then  it  is  not  dif¬ 
ficult  to  show  (but  not  shown  here)  that  the  space  of 
images  we  ran  get  out  of  this  framework  are  no  more 
than  perspective  and  orthographic  images  of  the  scene, 
and  images  of  images  of  the  scene,  produced  by  a  pin¬ 
hole  camera  in  which  the  camera’s  coordinate  frame  is 
allowed  to  undergo  arbitrary  affine  transformations  in 
spare  (rather  than  similarity  transforms  u.sed  in  metric 
structure-from-motion  models). 


Let  be  the  affine  coordinates  of  a  point  P 

with  respect  to  four  of  the  reference  points.  If  the  fifth 
reference  point  (taken  to  be  the  COP)  is  not  at  infinity, 
then  the  observed  image  coordinates  [x.y)  ran  be  de¬ 
scribed  by  an  affine  change  of  coordinates  followed  by  a 
2D  projective  transformation: 


for  some  fixed  matrix  .4i  and  vector  r  and  some  .scale 
factor  z  (in  a  metric  framework  ;  would  correspond  to 
“depth”).  In  the  case  of  an  orthographic  projection 
(COP  at  infinity  and  only  2D  affine  transformations  of 
the  image  are  allowed),  we  have: 


for  some  2D  affine  transformation  Bi  (third  row  is 
(0,0,1))  and  an  ideal  vector  it  (third  coordinate  of  a 
is  zero)  [28;  4l].  We  can  use  these  two  equations  to 
describe  the  transformation  between  image  coordinates 
in  two  views  across  four  cases:  two  perspective  views, 
two  orthographic  views,  a  perspective  to  orthographic 
case,  and  an  orthographic  to  perspective  case.  This  is 
described  below: 


where  p  =  z,p'  =  z\  and  A,v  are  general  for  the 
perspective-to-perspective  case;  p  =  p'  =  ,  A  is  a  2D 

affine  transformation  and  13  =  0  for  the  orthographic- 
to-orthographic  case;  p  =  =  I,  third  row  of  .4  is 

(0,0,0)  and  ^3  =  1  for  the  perspective-to-orthographic 
CEise;  p  =  =  4^.  and  ,4.v  are  general  for  the 

orthographic-to-perspective  case.  Similarly,  the  image 
coordinates  (x" ,y")  of  a  third  view  satisfy  the  following 
relation  to  the  first  view: 


P 


n 


1 


(•21 


Note  that  p  remains  fixed  regardless  whether  the  third 
view  is  perspective  or  orthographic.  The  algebraic  func¬ 
tions  of  image  coordinates  across  three  views  can  be  de¬ 
rived  by  first  eliminating  p'.p"  and  then  isolating  p. 

I'l  —  x’ct  uj  —  y'v^  jl'i'i  —  j'r. 

^  x'aj  ■  p  —  a\  ■  p  y'a^  ■  p  —  ■  p  x'aj  ■  p  —  y‘ti\  ;> 

where  01.03,03  are  the  row  vectors  of  .4.  and  p  - 
(x.y.  1),  p'  =  (x'.y',  1).  Similarly,  from  e(|uation  2  we 
obtain 

t!  n  n  n 

U|  —  X  li3  1*2  —  .V  U.3  y  Ul  —  X  U2 

i"bi  •  p  -  fci  ■  p  y"b-\  p  -  62  •  p  i"b2  p  -  y"hi  p 

where  5), 63. 63  are  the  row  vectors  of  B.  ami  />"  = 
{x".y",  1).  These  two  equations  lead  to  nine  alge¬ 
braic  functions  of  image  coordinates  across  three  view> 
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For  example,  the  first  two  terms  lead  to  the  function 
F{x",  j',  ir,  I/)  =  0  of  the  form; 

x"ivib3  p  -  uaai  p)  +  x"x'(u3a3  p  -  U363  •  p)  + 

x'ivsbi  p-  uitta  p)  +  (uioi  •  p  -  uifei  p)  =  0.(3) 
Note  that  the  breicketed  terms  we  first-order  polynomi¬ 
als  of  X  and  y  with  fixed  coefficients  (depending  only  on 
parameters  of  camera  transformations).  In  other  words, 
eipiation  3  is  a  third-order  algebraic  function  of  the  form: 
x"(ai  j;  -I-  n-ip  -f  0:3)  -|-  x"x'{ttJiX  -I-  n^y  nre)  -(- 
x\ci7X  -t-  otsy  +  no)  +  ftiol  +  rtuy  +  «12  =  0,  (4) 
where  the  coefficients  aj,  j  =  are  fixed  con¬ 

stants.  Note  that  the  constants  can  be  recovered  by 
linecu*  methods  by  observing  1 1  corresponding  points 
across  the  three  views  (more  than  11  points  can  be 
used  for  a  least-squares  solution).  Then,  for  any  ad¬ 
ditional  point  {x,  y)  whose  correspondence  in  the  second 
image  is  known  (x'^x/),  we  can  recover  the  correspond¬ 
ing  x-component  x"  in  the  third  view  by  substitution 
in  equation  4.  In  a  similar  fashion  we  can  recover  the 
y-component  y"  by  using  one  of  the  other  functions,  for 
example: 

y"{(i\  X  +  ihy  +  Ih)  +  y"y'0hx  +  ihy  +  /^s)  + 
y'(/J7X  +  Ihy  +  (i^)  +  li\ox  +  /iuy  +  Hn  =  0.  (5) 
The  solution  for  i",  y"  is  unique  provided  that  ui,  t>3  do 
not  vanish  simultaneously,  or  ui ,  U3  do  not  vanish  simul¬ 
taneously.  These  singular  cases  apply  only  to  the  two 
functions  above,  and  one  can  show  that  from  the  nine 
functions  we  can  always  find  two  that  are  not  singu¬ 
lar  under  any  viewing  transformations  that  takes  place 
between  the  three  views.  The  process  of  generating  a 
novel  view  can  be  easily  accomplished  without  the  need 
to  explicitly  recover  structure,  camera  transformation  or 
epipolar  geometry  —  with  the  price  of  using  more  than 
the  minimal  eight  points  that  are  required  by  less  direct 
methods. 

The  algebraic  functions  derived  so  far  axe  general 
in  the  sense  that  the  scene  is  allowed  to  undergo  gen¬ 
eral  3D  projective  transformation  in  space.  Reduced 
lower  order  functions  can  be  derived  under  more  re¬ 
stricted  situations.  For  example,  the  third  order  com¬ 
ponent  of  these  functions  vanishes  when  us  =  U3  =  0 
(see  Equation  3).  This  situation  arises,  for  example, 
when  the  views  are  taken  by  a  camera  moving  along 
a  bcise  line  perpendicular  to  the  optical  axis.  One  ob¬ 
serves,  as  a  result,  that  this  situation  is  intrinsically 
more  stable  (errors  in  correspondence  multiply  to  a 
.second-order  rather  than  to  a  third-order)  than  the  gen¬ 
eral  case  —  an  observation  experimentally  made  by  [7; 
5]. 

Other  results  can  be  obtained  by  assuming  that  some 
of  the  views  are  orthographic.  This  is  especially  impor¬ 
tant  in  the  context  of  achieving  recognition  via  align¬ 
ment:  since  the  two  model  views  are  taken  only  once 
(and  offline),  we  may  as  well  use  a  tele-lense  for  produc¬ 
ing  orthographic  views.  In  this  case  we  substitute  1/3  =  0 
and  03  p  =  1  in  Equation  3  and  obtain  a  second-order 
function  with  only  7  free  parameters  which  has  the  form: 
x"(rtix  +  noy  -I-  03)  +  n'4x"x'  -H 

rtsx' OTfiX -f  ary -I- rt8  =  0.  (6) 


for  some  fixed  constants  yj.  j  =  1,...,8.  As  a  result,  we 
can  generate  novel  views  (perspective  and  orthographic) 
by  ob.serving  only  7  corresponding  points  across  the  three 
views.  This  result  is  both  direct  (avoiding  structure  and 
motion)  and  requires  less  than  eight  points  (the  minimal 
under  general  conditions  using  linear  methods).  For  in¬ 
stance,  with  other  tools  we  do  not  have  an  easy  way  for 
making  use  of  the  fact  that  the  two  model  views  are  or¬ 
thographic  —  because  the  determination  of  the  epipoles 
and  epipolar  geometry  between  a  perspective  and  an  or¬ 
thographic  view  still  requires  eight  points  in  general. 

Finally,  it  is  easy  to  see  what  happens  when  ail  three 
views  are  orthographic.  In  this  case  we  have  also  U3  =  0 
and  63  ■  p  =  1,  and  thus  Equation  3  reduces  to  a  first- 
order  function,  with  only  4  free  parameters,  of  the  form: 

ftix" -F  +  aaa;  +  a4y-)- as  =  0,  (7) 

for  some  fixed  constants  ctj,  j  =  1,..,,.'5.  This  is  iden¬ 
tical  to  the  “linear  combination  of  views’  result  [48; 
33],  stating  that  under  the  orthographic  assumption  an 
arbitrary  view  can  be  generated  by  applying  certain  lin¬ 
ear  combinations  to  the  image  coordinates  of  two  model 
views. 

To  summarize,  we  have  shown  that  it  is  possible  to 
represent  views  as  a  function  of  image  coordinates  of  two 
other  views.  In  the  general  projective  case,  the  image  co¬ 
ordinates  of  three  views  are  connected  via  third-order  al¬ 
gebraic  functions  with  1 1  free  parameters.  More  restric¬ 
tive  cases  (but  applicable  in  the  context  of  visual  recog¬ 
nition)  reduce  these  functions  to  second-order  with  7  free 
parameters  and  to  first-order  with  4  free-parameters  ile- 
pending  on  whether  two  or  all  the  views  are  assumed 
to  be  orthographic.  The  immediate  application  of  these 
results  are  in  the  context  of  visual  recognition  via  align¬ 
ment  (especially  the  7-point  result),  but  other  applica¬ 
tions  may  also  be  possible.  For  example,  the  general  re¬ 
sult  (Equation  4)  may  be  useful  in  the  context  of  model- 
based  image  compression.  In  this  case  the  number  of 
corresponding  points  required  for  reconstructing  novel 
views  is  not  of  critical  importance  whereas  robustness 
and  simplicity  are  more  of  a  concern.  The  22  parame¬ 
ters  required  for  reconstructing  a  novel  view  can  be  re¬ 
covered  by  many  points  in  a  least-squares  fashion,  luit 
the  receiver  eventually  requires  only  the  parameters  and 
not  the  corresponding  points. 

3.1  Recoguitiou  aud  Structure  from  one  2D 
Model  View 

According  to  the  1.5  views  theorem  [33:  fi]  recogni¬ 
tion  of  a  specific  3D  object  (defined  in  terms  of  pointwise 
features)  from  a  novel  2D  view  can  be  achieved  from  at 
leajst  two  2D  model  views  (in  the  data  basis,  for  each 
object,  for  orthographic  projection).  Poggio  and  Vetter 
studied  how  recognition  C2in  be  achieved  from  a  single 
2D  model  view.  The  basic  idea  is  to  exploit  transforma¬ 
tions  that  are  specific  for  the  object  class  corresponding 
to  the  object  -  and  that  may  be  known  a  priori  or  may 
be  learned  from  views  of  other  “prototypical"  objects  of 
the  same  class  -  to  generate  new  model  views  fn^m  the 
only  one  available.  Their  work  divides  in  two  distinct 
parts,  in  the  first  part,  they  discuss  how  to  exploit  pri>>r 
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knowledge  of  an  object’s  symmetry.  They  prove  that  for 
any  bilaterally  symmetric  3D  object  one  non-accidental 
2D  model  view  is  sufficient  for  recognition.  They  also 
prove  that  for  bilaterally  symmetric  objects  the  corre¬ 
spondence  of  four  points  between  two  views  determines 
the  correspondence  of  all  other  points.  Symmetries  of 
higher  order  allow  the  recovery  of  structure  from  one 
2D  view.  In  the  second  part  of  their  work,  Poggio  and 
Vetter  study  a  very  simple  type  of  object  classes  called 
linear  object  classes.  Linear  transformations  can  be 
learned  exactly  from  a  small  set  of  examples  in  the  case 
of  linear  object  classes  and  used  to  produce  new  views  of 
an  object  from  a  single  view.  More  recently  Vetter,  Pog¬ 
gio  and  Buelthoff  have  provided  psychophysical  support 
for  the  hypothesis  that  the  human  visual  system  exploits 
symmetry  of  3D  objects  for  better  generalization  from  a 
few  views. 

3.2  Face  Recognition:  Features  versus 
Templates 

Over  the  last  twenty  years  .several  different  techniques 
have  been  proposed  for  computer  recognition  of  human 
faces.  Poggio  in  collaboration  with  R.  Brunelli  at  IR.ST 
compared  two  simple  but  general  strategies  on  a  common 
database  (frontal  images  of  faces  of  47  people,  26  males 
and  21  females,  four  images  per  person).  They  have  de¬ 
veloped  and  implemented  two  new  algorithms,  the  first 
one  based  on  the  computation  of  a  set  of  geometrical 
features,  such  as  nose  width  and  length,  mouth  position 
and  chin  shape,  and  the  second  one  based  on  almost- 
grey-level  template  matching.  The  results  obtained  on 
the  testing  sets,  about  90%  correct  recognition  using  geo¬ 
metrical  features  and  perfect  recognition  using  template 
matching,  favour  their  implementation  of  the  template 
matching  approach.  Present  work  aims  to  extend  the 
system  to  deal  with  arbitrary  poses  and  expressions  of 
the  face. 

4  Navigation 

(‘omplementary  to  our  work  on  object  recognition,  we 
have  also  investigated  issues  and  methods  in  navigation. 
One  such  method  has  built  directly  on  our  earlier  work 
in  recognition  by  Linear  ( 'ombinations,  and  is  reported 
in  a  separate  article  by  Basri  in  these  proceedings. 

A  second  approach  to  navigation  has  been  part  of  a 
broader  research  project,  executed  by  Ian  Horswill.  The 
problem  under  consideration  is  the  development  of  sim¬ 
ple  real-time  vision  algorithms  suitable  for  low-cost  com¬ 
puter  systems  such  as  personal  computers.  The  specific 
goals  are  to  develop  (a)  simple  vision  algorithms  useful 
for  problems  such  as  robot  navigation  and  interacting 
with  people,  (b)  a  theoretical  framework  for  analyzing 
such  systems,  and  (c)  a  concrete  implementation  demon¬ 
strating  the  algorithms  in  a  robot  which  gives  primitive 
“tours”  of  the  seventh  floor  of  the  laboratory. 

This  follows  from  the  motivation  of  making  vision 
cheap  as  a  necessary  part  of  making  it  useful.  For  vi¬ 
sion  technology  to  be  used  routinely  in  construction  and 
manufacturing  equipment,  consumer  electronics  prod¬ 
ucts,  automobiles,  etc.,  both  design  costs  and  unit  costs 
must  be  brought  down  to  levels  comparable  with  the 


product  into  which  they  are  to  be  incorporated.  Mil¬ 
lion  dollar  supercomputers,  or  even  twenty  thousand 
dollar  workstations  are  simply  inappropriate  for  tnass- 
produced  products  intended  to  cost  less  than  the  com¬ 
puter. 

Horswill’s  approach  is  one  of  situated  agents,  whereby 
vision  systems  can  be  made  much  simpler  and  cheaper 
by  specializing  them  to  a  specific  task  and  environment. 
A  task-specific  system  need  only  extract  the  specific  in¬ 
formation  needed  to  perform  the  task.  As  well,  a  task 
provides  performance  constraints  that  can  simplify  the 
design  process  by  allowing  the  use  of  approximate  so¬ 
lutions  which  might  not  be  appropriate  for  all  tasks.  A 
system  specialized  to  its  environment  can  take  advantage 
of  domain  knowledge  which  can  simplify  computational 
problems.  For  example,  a  complete  stereo  system  can 
sometimes  be  replaced  by  a  system  which  uses  height  in 
the  image  plane  to  determine  rough  distance  from  the 
agent. 

A  critical  problem  in  developing  these  systems  is  the 
reusability  of  components.  It  is  important  that  we  be 
able  to  apply  experience  gained  in  designing  one  system 
to  the  design  of  other  systems.  For  this  reason,  tasks 
and  environments  must  be  analyzed  at  a  theoretical  level 
so  as  to  make  explicit  the  ways  in  which  they  simplify 
computational  problems. 

Low  cost  task-oriented  vision  systems  could  signifi¬ 
cantly  improve  the  performance  and  flexibility  of  au¬ 
tonomous  systems  and  reduce  their  cost.  Such  systems 
have  applications  in  transportation,  surveillance,  con¬ 
struction,  manufacturing,  space  applications,  and  haz¬ 
ardous  operations.  Low  cost  vision  technology  could  also 
be  extremely  valuable  in  consumer  electronics.  Track¬ 
ing  systems  could  be  incorporated  into  camcorders  or 
surveillance  systems.  Face  recognition,  person  detection, 
stereo  and  motion  algorithms  could  be  incorporateil  into 
intelligent  nightscopes  and  binoculars.  Low  unit  costs 
would  be  particularly  important  in  this  area. 

To  date,  we  have  developed  systems  for  tracking  and 
following  moving  objects,  detecting  obstacles,  proximity 
detection  using  stereo  (see  Figures  10  tuid  11),  following 
corridors,  and  recognizing  nods  and  shatkes  of  the  head. 
All  systems  run  in  real  time  on  inexpensive  computers 
such  as  Macintoshes.  All  systems  use  very  low  resolution 
processing  (64  x  48  or  less).  A  number  of  optimizations 
are  shared  between  two  or  more  systems,  suggesting  that 
some  amount  of  recycling  of  design  time  is  possible. 

The  particular  prototype  system  is  an  indoor  naviga¬ 
tion  system  for  a  mobile  robot  that  runs  at  1.5  frames 
per  second  on  stock  hardware.  A  computer  equivalent 
to  the  one  on  board  the  robot  can  be  purchased  for  83- 
4K.  At  present,  the  robot  is  capable  of  following  corri¬ 
dors,  avoiding  obstacles,  recognizing  corridor  junctions 
navigating  from  point  to  point,  and  detecting  the  pres¬ 
ence  of  people.  The  corridor  follower  is  extremely  well 
tested,  having  seen  hundreds  of  hours  of  service  The 
other  capabilites  are  newer  and  have  not  yet  been  fulK 
evaluated.  An  article  by  Horswill  in  these  proceeding' 
further  describes  the  approach. 
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5  Early  Vision  Modules 

Although  a  primary  focus  has  been  on  recognition  and 
navigation,  we  continue  to  develop  methods  for  early 
processing  of  visual  information. 

3.1  Optical  Flow  from  ID  Correlation: 
Application  to  a  simple  Time-To-Crasli  Detector 

Ancona  (('^SATA,  Italy)  Euid  Poggio  have  shown  that  a 
new  technique  exploiting  ID  correlation  of  2D  or  even 
1 D  patches  l)etween  successive  frames  may  be  sufficient 
to  compute  a  satisfactory  estimation  of  the  optical  flow 
held.  The  algorithm  is  well-suited  to  VLSI  implemen¬ 
tations  and  a  patent  application  is  being  filed  by  MIT 
and  ('SATA.  The  sparse  measurements  provided  by  the 
technicpie  can  be  used  to  compute  qualitative  properties 
of  the  flow  for  a  number  of  different  visual  tasks.  In 
particular,  they  also  showed  shows  how  to  combine  the 
ID  correlation  technique  with  a  scheme  for  detecting  ex¬ 
pansion  or  rotation  [37]  in  a  simple  algorithm  which  also 
suggests  interesting  biological  implications.  The  algo¬ 
rithm  provides  a  rough  estimate  of  time-to-crash.  It  was 
tested  with  good  results  on  real  image  sequences. 

5.2  Seusor  lutegratiou 

Clay  Thompson  h<is  recently  completed  a  Ph.D.  thesis 
that  considers  the  problem  of  fusing  two  computer  vision 
methods,  using  variational  methods.  The  example  algo¬ 
rithms  .solve  the  photo-topography  problem;  that  is,  the 
algorithms  seek  to  determine  planet  topography  given 
two  images  taken  from  two  different  locations  with  two 
different  lighting  conditions.  The  algorithms  each  em¬ 
ploy  a  single  cost  function  that  combines  the  computer 
vision  methods  of  shape-from-shading  and  stereo  in  dif¬ 
ferent  ways.  The  algorithms  are  closely  coupled  and  take 
into  account  all  the  constraints  of  the  photo-topography 
problem.  One  such  algorithm,  the  r-only  algorithm,  can 
accurately  and  robustly  estimate  the  height  of  a  surface 
from  two  given  images. 

5.3  Simple  Region  Features 

Feature-based  methods  for  recovering  the  motion  of  a 
camera  from  a  sequence  of  images  have  suffered  from  the 
inability  of  the  early  vision  processes  to  provide  dense, 
robust  features.  Typically,  the  features,  such  as  Canny 
edges,  are  very  sparse  in  each  image.  Furthermore,  most 
features  are  unreliable  in  the  sense  that  they  often  dis¬ 
appear  from  one  frame  to  the  next.  .Similarly,  sporadic 
features  often  appear  that  are  not  associated  with  any 
object  in  the  scene.  To  address  these  problems,  Ron 
( ffianey  hjis  developed  a  framework  for  early  vision  pro¬ 
cessing  that  leads  to  a  dense,  robust  set  of  features.  The 
early  vision  framework  is  based  on  the  interpretation  of 
the  Laplacian  of  (iaussian  (Lo(i)  Alter  as  a  matched  Alter 
for  features  of  a  particular  size  as  well  as  an  edge  locator. 
An  object  in  the  image  that  has  roughly  the  optimum 
width  associated  with  a  particular  Lo(l  Alter  will  typi¬ 
cally  be  nearly  surrounded  by  the  zero-crossings  of  the 
Lo(t  Alter.  Hence,  a  naive  approach  would  be  to  take 
the  regions  bounded  by  the  zero-crossings  of  the  Lo(i 
Alter  EU5  the  set  of  features,  (^f  course,  the  regions  a.sso- 
ciated  with  nearby  objects  in  the  image  tend  to  merge 


or  blend  together.  To  alleviate  this  problem,  we  intro¬ 
duce  a  stable,  robust  decomposition  of  such  regions  into 
their  salient  subparts.  These  subregions,  called  sitnpU 
region  features,  serve  as  the  feature  set  for  higher  level 
processing.  The  decomposition  is  based  on  the  medial 
cixis  skeleton  of  the  region.  Each  subregion  corresponds 
to  a  portion  of  a  branch  of  the  skeleton;  each  branch  is 
divided  at  positions  where  the  distance  from  the  skeleton 
to  the  bounding  contour  is  minimized.  To  facilitate  the 
computation  of  the  decomposition,  a  novel  scale-space 
is  introduced  for  contours  and  the  medial  axis  skeleton. 
The  scale-space  is  parameteric  with  the  complexity  of 
the  contour  or  the  skeleton.  The  complexity  measure  of 
the  skeleton  is  the  number  of  branches.  A  related  com¬ 
plexity  measure  of  a  contour  is  the  number  of  extrema 
of  curvature  of  the  contour.  This  leads  to  a  complexity 
scale-space  for  the  region  decomposition.  The  result  of 
the  early  vision  process  is  the  set  of  simple  region  fea¬ 
tures  for  each  frame.  These  features  are  dense,  stable, 
and  robust.  To  demonstrate  the  utility  of  the  early  vi¬ 
sion  process,  we  present  a  relatively  simple  motion  and 
structure-from-motion  algorithm  based  on  tracking  sim¬ 
ple  region  features  at  multiple  resolutions  of  the  LoO 
Alter. 

5.4  Calibration 

( -omputing  relative  orientation  is  an  important  problem 
for  calculating  depth  from  binocular  stereo  and  for  de¬ 
termining  general  camera  motion.  Lisa  Dron  has  been 
exploring  methods  for  establishing  the  complete  design 
of  a  small,  autonomous  system  with  specialized  VL.SI 
hardware  for  computing  relative  orientation  in  real-time 
[l4].  .Such  a  system  would  be  suitable  for  mounting  on 
mobile  or  remote  platforms  that  cannot  be  tethered  t.i 
a  computer  and  for  which  the  size,  weight  and  power 
consumption  of  the  components  are  critical  factors 

There  are  two  p£irts  to  this  work.  The  Arst  is  theo¬ 
retical  and  involves  developing  and  adapting  algorithm.' 
for  Anding  point  correspondences  and  solving  the  motion 
equations  which  are  robust  as  well  as  simple  enough  to 
be  easily  implemented  in  hardwtire.  The  second  part  is 
engineering  and  involves  the  design,  fabrication  and  test 
of  prototype  chips  for  the  specialized  processors  which 
will  be  used  to  And  the  point  correspondences.  Two  sep. 
arate  processors  are  needed;  one  which  computes  .a  bi¬ 
nary  edge  map  from  the  input  image  data,  and  the  other 
which  determines  translational  offsets  between  patches 
of  the  edge  maps  from  two  different  images.  Fabrication 
of  these  circuits  is  done  through  MOSIS. 

In  support  of  such  work,  Dron  has  already  developed 
the  edge  detection  algorithm  known  as  the  multi-scale 
veto  (MSV)  method  [l.5].  During  the  past  year,  sin 
htis  completed  the  design  of  a  two-dimensional  proce>- 
sor  which  combines  (XID  and  (!MOS  technology  to  mi 
plement  the  M.SV  algorithm.  A  test  chip  containing  .i 
4x4  two-dimensional  array  has  been  fabricated  and  is 
currently  being  tested.  Algorithms  have  been  developed 
both  to  perform  matching  with  the  binary  edge  sign  ds 
produced  by  the  MSV  chip,  and  to  solve  the  inon,  ,n 
equations  with  a  least-srpiares  method  suitable  for  impb 
mentation  on  a  programmable  digital  microprocessor  In 
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addition,  Dron  has  developed  a  least-squares  algorithm 
to  determine  the  internal  camera  calibration  parameters, 
which  are  required  in  order  to  compute  motion  from  a  set 
of  point  matches,  using  a  sequence  of  images  for  which 
the  translational  motion  is  known.  A  preliminary  de¬ 
sign,  comprising  both  analog  and  digital  components, 
has  been  completed  for  the  second  processor  which  will 
compute  point  correspondences  from  the  edge  maps,  and 
have  sent  out  for  fabrication  a  set  of  test  structures  which 
will  form  the  basis  of  the  matching  circuit. 

5.5  Other  Calibration  Methods 

In  related  work,  (iideon  Stein  has  developed  a  simple 
method  for  internal  camera  calibration  for  computer  vi¬ 
sion  systems.  It  is  intended  for  use  with  medium  to  wide 
angle  camera  lenses.  With  modification  it  can  be  used 
for  longer  focal  lengths.  This  method  is  based  on  track¬ 
ing  image  features  through  a  sequence  of  images  while 
the  camera  undergoes  pure  rotation.  This  method  does 
not  require  a  special  calibration  object.  The  location  of 
the  features  relative  to  the  camera  or  to  each  other  need 
not  be  known.  It  is  only  required  that  the  features  can  be 
located  accurately  in  the  image.  This  method  can  there¬ 
fore  be  used  both  for  laboratory  calibration  and  for  self 
calibration  in  autonomous  robots  working  in  unstruc¬ 
tured  environments.  The  method  works  when  features 
can  be  located  to  single  pixel  accuracy  but  subpixel  ac¬ 
curacy  should  be  used  if  available. 

In  the  basic  method  the  camera  is  mounted  on  a  ro¬ 
tary  stage  so  that  the  angle  of  rotation  can  be  measured 
accurately  and  the  axis  of  rotation  is  constant.  A  set  of 
image  pairs  is  used  with  various  angular  displacements. 
If  the  internal  camera  parameters  and  axis  of  rotation 
were  known  one  could  predict  where  the  feature  points 
from  one  image  will  appear  in  the  second  image  of  the 
pair.  If  there  is  an  error  in  the  internal  camera  param¬ 
eters  the  features  in  the  second  image  will  not  coincide 
with  the  feature  locations  computed  using  the  first  im¬ 
age.  One  Ccin  then  perform  a  nonlinear  search  for  camera 
parameters  that  minimize  the  sum  of  distances  between 
the  feature  points  in  second  image  in  each  pair  and  those 
computed  from  the  first  image  in  each  pair,  summed  over 
all  the  pairs. 

The  need  to  accurately  measure  the  angular  displace¬ 
ments  can  be  eliminated  by  rotating  the  camera  through 
a  complete  circle  while  taking  an  overlapping  sequence 
of  images  and  using  the  constraint  that  the  sum  of  the 
angles  must  equal  .360  degrees. 

The  closer  the  feature  objects  are  located  to  the  cam¬ 
era  the  more  important  it  is  that  the  camera  does  not 
undergo  any  translation  during  the  rotation.  A  method 
is  described  which  enables  one  to  ensure  that  the  axis 
of  rotation  passes  sufficiently  close  to  the  center  of  pro¬ 
jection  (or  front  nodal  point  in  a  thick  lens)  to  obtain 
accurate  results. 

Stein  shows  that  by  constraining  the  possible  motions 
of  the  camera  in  a  simple  manner  it  is  possible  to  devise  a 
robust  calibration  technique  that  works  in  practice  with 
real  images.  Experimental  results  show  that  focal  length, 
aspect  ratio  and  lens  distortion  parameters  ran  be  found 
to  within  a  fraction  of  a  percent.  The  location  of  the 


principal  point  and  the  location  of  the  center  of  radial 
distortion  can  each  be  found  to  within  a  few  pixels. 

In  addition  to  the  first  method  a  second  method  of 
calibration  is  presented.  This  method  uses  simple  geo¬ 
metric  objects  such  as  spheres  and  straight  lines  to  find, 
first  the  aspect  ratio,  then  the  lens  distortion  parameters 
and  finally  the  principal  point  and  focal  length.  (  ali- 
bration  is  performed  using  both  methods  and  the  results 
compared. 

6  Other  Topics 

Two  other  recently  complete  theses,  reported  in  detail 
in  earlier  reports  are  .Steve  White’s  work  in  highly  accu¬ 
rate  representations  for  early  vision,  especially  edges  and 
stereo  disparities  [54],  and  .Subirana's  work  on  recogni¬ 
tion  and  representation  of  flexible  objects  [4.5]. 

6.1  Median  Window  Filtering  for 
Multi-dimensional  Image  Attributes 

Median  window  filtering  is  a  simple  non-linear  tech¬ 
nique  for  reducing  image  noise  while  preserving  sharp 
discontinuities.  It  works  by  replacing  the  current  value 
of  each  image  pixel  with  the  median  value  of  the  pLxel's 
local  neighborhood.  Although  the  technique  has  been 
extensively  used  for  smoothing  scalar  image  data  like 
grey-level  intensities,  little  work  is  known  about  median 
filtering  in  multi-dimensional  data  domains  like  image 
color,  image  texture  and  motion  fields.  Perhaps  this  is 
because  the  sample  median  is  an  ill-defined  concept  for 
multi-dimensional  quantities.  Recently,  Sung  has  pro¬ 
posed  a  novel  interpretation  of  the  median  conc'>pt  for 
multi-dimensional  metric  spaces  .  The  interpretation 
follows  from  a  mathematical  property  of  the  scalar  me¬ 
dian,  and  the  basic  idea  is  to  similarly  define  the  multi¬ 
dimensional  median  as  the  sample  member  that  mini¬ 
mizes  a  mean  absolute  error  term.  Sung  implemented 
a  multi-dimensional  median  filtering  algorithm  for  color 
images  and  showed  that  the  operation  indeed  preserves 
edges  while  reducing  noise.  He  has  also  mathematically 
derived  that  in  the  best  case,  the  smoothing  performance 
of  multi-dimensioned  median  filtering  is  comparable  lo 
that  of  local  averaging.  More  recently,  he  has  also  de¬ 
veloped  algorithms  for  approximating  multi-dimensional 
medians  that  run  in  linear  time  with  respect  to  ilata  di¬ 
mension  and  sample  size. 

6.2  Statistical  Uniformity  Based  Region 
Finding 

Over  the  past  twenty  to  thirty  years,  a  number  of  differ¬ 
ent  techniques  have  been  proposed  for  segmenting  im¬ 
ages  into  piecewise  uniform  regions.  L.ke  many  other 
computer  vision  tasks,  most  of  these  techniques  contain 
at  least  a  few  thresholds  and  operating  parameters  whose 
values  are  crucial  for  producing  reasonable  results  Often 
however,  these  values  are  determined  either  empirically 
or  by  guess.  Kah  Kay  .Sung  h2»s  explored  an  alternative 
approach  to  the  threshold  selection  problem  for  a  simple 
but  fairly  general  class  of  uniformity  based  region  find¬ 
ing  paradigms.  He  proposed  a  statistical  formulation 
for  uniformity  based  region  finding  as  a  series  of  c.inli- 
dence”  and  "significance”  tests,  where  each  test  rough l> 
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corresponds  to  a  decision  procedure  in  the  origin^J  region 
finding  psuadigm.  The  main  advantage  of  his  approach 
is  that  it  replaces  typical  region  finding  thresholds  and 
parameters  with  a  new  set  of  confidence  and  significance 
thresholds  and  parameters  that  gives  greater  insight  to 
the  system’s  pertinent  characteristics.  A  color  region  al¬ 
gorithm,  based  on  his  formulation,  was  implemented  on 
the  par2illel  Connection  Machine. 

7  Learning 

Under  separate  contract,  Tomeiso  Poggio  and  colleagues 
have  been  developing  techniques  for  the  application  of 
learning  methods  to  vision  problems.  In  particular, 
building  on  extensive  earlier  work  by  Poggio  and  collabo¬ 
rators  on  the  use  of  (leneralized  Radial  Basis  Functions, 
they  have  been  developing  learning  methods  for  use  in 
object  recognition  and  computer  graphics. 

8  VLSI 

Under  separate  contract  Berthold  Horn  and  colleagues 
have  continued  to  developed  VLSI  implementations  of 
low  level  visual  algorithms. 

8.1  The  Focus  of  Expansion  Chip 

The  problem  that  this  chip  solves  is  that  of  computing 
the  direction  towards  which  a  camera  is  moving,  based 
on  the  time-varying  image  it  receives.  There  is  no  re¬ 
striction  on  the  shapes  of  the  surfaces  in  the  environ¬ 
ment;  only  an  assumption  that  the  imaged  surfaces  have 
some  texture,  that  is,  spatial  variations  in  reflectance. 
It  is  also  assumed  that  the  camera  is  stabilized  so  that 
there  is  no  rotational  motion. 

Once  the  translational  motion  has  been  determined, 
it  is  possible  to  estimate  distances  to  points  in  the  scene 
being  imaged.  While  there  is  an  ambiguity  in  scale,  since 
multiplying  both  distances  and  speed  by  some  constant 
factor  does  not  change  the  time-varying  image,  it  is  pos¬ 
sible  to  estimate  the  ratio  of  distance  to  speed.  This 
allows  one  to  estimate  the  time-to-collision  between  the 
camera  and  objects  in  the  scene. 

Applications  for  such  a  device  include  systems  warn¬ 
ing  of  imminent  collision,  obstacle  avoidance  in  mobile 
robotics,  and  aids  for  the  blind. 

The  projection  of  the  translational  motion  vector  into 
the  image  is  called  the  focus  of  expansion  (FOE).  It  is  the 
image  of  the  point  towards  which  the  camera  is  moving, 
and  the  point  from  which  other  image  points  appear  to 
be  receding. 

The  method  used  is  based  on  least  squares  analysis  - 
that  is.  find  the  point  in  the  image  that  best  fits  the 
observed  time  variations  in  brightness.  The  quantity 
minimized  is  the  sum  of  squares  of  the  differences  at 
every  picture  cell  between  the  observed  time  variation  of 
brightness  and  that  predicted,  given  the  assumed  posi¬ 
tion  of  the  FOE  and  the  observed  spatial  variations  of 
brightness. 

The  minimization  is  not  straightforward,  because  the 
relationship  between  the  brightness  derivatives  depends 
on  distance  to  the  surface  being  imaged  and  that  dis¬ 
tance  is  not  only  unknown,  but  varies  from  picture  cell 


to  picture  cell.  It  turns  out  that  so-called  stationary 
points,  where  brightness  b  constant  (instantaneously), 
play  a  critical  role.  If  there  were  no  measurement  errors, 
quantization  effects  or  noise,  then  the  FOE  would  be 
at  the  intersection  of  the  tangents  to  the  iso-brightness 
contours  at  these  stationary  points. 

In  practice,  image  brightness  derivatives  are  hard  to 
estimate  accurately  given  that  the  image  itself  is  quite 
noby.  Hence  the  intersections  of  tangents  from  differ¬ 
ent  stationary  points  may  be  quite  scattered.  Reliable 
results  can  nevertheless  be  obtained  if  the  image  con¬ 
tains  many  stationary  points  and  the  point  is  found  that 
has  the  least  weighted  sum  of  squares  of  perpendicular 
distances  from  the  tangents  at  the  stationary  points. 

This  method  was  chosen  from  amongst  a  group  of 
competing  approaches  by  considering  both  simulation 
results  of  these  methods  auid  constraints  of  what  can 
reasonably  be  built  in  analog  VLSI. 

The  amount  of  computation  for  every  picture  cell  (in¬ 
cluding  a  number  of  multiplications)  is  such  that  it  b  not 
feasible  today  to  perform  the  task  in  a  tottilly  unclocked 
mtuiner  with  the  processing  done  at  each  picture  cell. 
Instead  a  row  partdlel  scheme  has  been  decided  upon 
where  each  row  of  the  image  has  a  single  processor. 

The  first  chip  has  been  made  by  MOSLS  and  tested. 
Minor  revisions  are  being  made. 

8.2  System  for  recovering  motion  with  respect 
to  a  planar  surface 

A  problem  in  motion  vision  that  is  somewhat  more  dif¬ 
ficult  than  that  of  recovering  the  focus  of  expansion  is 
that  of  recovering  both  translational  and  rotational  com¬ 
ponents  of  motion  of  a  camera  from  the  time-varying 
image.  Presently  there  is  no  simple  robust  method  for 
solving  this  problem  in  general,  but  methods  are  known 
in  the  special  case  that  the  surface  being  viewed  is  pla¬ 
nar. 

Applications  for  such  a  system  include  landing  a  ve¬ 
hicle  on  a  planar  surface  and  station  keeping  of  a  sub¬ 
mersible  vehicle  above  the  ocean  floor.  Also,  such  a  sys¬ 
tem  could  be  used  to  recover  the  motion  of  a  person  by 
aiming  a  camera  at  the  flat  ground  in  front  of  the  per¬ 
son.  The  motion  estimates  obtained  from  such  a  down¬ 
ward  looking  camera  could  then  be  used  to  interpret  the 
time-varying  image  from  a  second  camera  aimed  directly 
forwend.  The  resulting  system  could  be  an  aid  for  the 
blind  that  warns  them  of  obstacles  -  even  those  that  <lo 
not  have  a  support  directly  below  -  such  as  signs  hanging 
from  beams  supported  off  on  the  side. 

Here  also  the  proposed  method  involves  a  least  squares 
approach,  although  it  is  now  considerably  more  complex 
than  in  the  case  of  simple  trrmsiational  motion.  It  is 
known,  for  example,  that  there  b  an  ambiguity  in  that 
two  quite  different  motions,  and  corresponding  different 
surface  orientations,  can  yield  the  same  time- varying  im¬ 
age. 

Detailed  design  will  have  to  await  the  results  of  ex¬ 
tensive  simulations. 
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8.3  Analog  Circuits 

John  Harris  and  Prof.  Poggio  are  studying  analog  imple¬ 
mentations  of  vision  and  learning  algorithms.  They  are 
interested  in  analog  models  because  these  models  pro¬ 
vide  a  novel  mechanism  for  understanding  and  develop¬ 
ing  algorithms.  Experimentation  with  these  continuous¬ 
time  nonlinear  circuits  facilitates  algorithm  intuition  and 
leads  to  fundamental  insights.  Powerful  cinalog  algo¬ 
rithms  thus  developed  will  prove  useful  even  if  a  re¬ 
searcher  is  limited  to  simulating  the  analog  hardware  on 
a  digital  computer.  In  addition,  biology  has  motivated 
some  of  the  circuits  and,  conversely,  some  of  the  VLSI 
modules  may  help  develop  a  better  intuition  for  solutions 
that  biology  has  found  for  the  same  class  of  early  vision 
problems. 

A  real-world  vision  system  must  be  adaptive  in  order 
to  operate  in  a  unconstrained  environment.  The  system 
must  be  smeirt  enough  to  deal  with  such  nonidealities 
as  changing  light  conditions  or  slight  variations  between 
components.  For  example,  the  thresholds  for  detection 
of  edges  in  edge  detectors  should  dynamically  change 
with  the  brightness  of  objects  in  the  scene.  Or  the  ap¬ 
propriate  space  constant  of  resistive  network  could  by 
dynamically  determined  by  an  estimate  of  the  noise  in 
the  input  signal.  Analog  hardware  allows  for  adaptation 
in  many  instances  by  relying  on  basic  physics  to  perform 
the  necessary  computations.  One  specific  project  un¬ 
der  implementation  is  the  time-tocontact  motion  sensor 
proposed  by  Poggio  (1991)  and  Poggio  ii  Ancona  (1992). 
This  sensor  combines  the  outputs  of  ID  motion  correla¬ 
tion  sensors  to  produce  an  estimate  of  the  amount  of  time 
until  crash  (<issuming  constant  velocity).  The  small,  low- 
power  implementation  will  be  useful  as  a  crash-warning 
sensor  for  robots  or  automobiles. 
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Figure  7:  Example  of  recognition  using  statistical  optimiza¬ 
tion,  based  on  matching  oriented  range  features.  The  figure 
shows  the  convergence  of  the  pose  of  the  model  object  to  the 
correct  position  in  the  data  of  Figure  6. 


Figure  8:  Projective  structure  of  a  scene  point  P  is  defined 
with  respect  to  four  reference  points  Pi , P^  and  the  cen¬ 
ter  of  projection  O  of  the  first  camera  position.  The  cam¬ 
era’s  center  serves  as  the  unit  point  in  the  projective  frame 
of  reference  instead  of  a  fifth  scene  point.  The  cross-ratio, 
denoted  by  orp,  of  the  four  points  P,  P,  P,  O  uniqely  fixes  P 
with  respect  to  the  frame  of  reference.  The  cross-ratio  can  be 
computed  from  the  projections  of  P,  P,  P,  O  onto  the  second 
image  plane.  The  projection  of  O  is  the  epipole  «'  which  can 
be  computed  from  eight  corresponding  points  [18];  the  other 
projections  p',  p'  can  be  recovered  using  the  projections  of  the 
four  reference  points  and  the  corresponding  epipoles  v, Fi¬ 
nally,  since  orp  is  invariant  it  can  be  used  to  predict  the  image 
location  of  P  on  any  third  view  given  the  correspondences  of 
the  four  reference  points  and  the  location  of  epipoles  between 
the  first  and  third  view. 


w 


(0.0,0. i) 

Figure  9:  The  points  0,U,V,W,T  provide  a  reference 
frame  for  constructing  a  coordinate  system  for  3D  pro¬ 
jective  space.  In  a  homogeneous  coordinate  representa¬ 
tion  the  projection  of  a  point  onto  a  face  of  the  tetrahe¬ 
dron  of  r^erence,  is  achieved  by  orthographic  projection 
in  coordinate  space.  For  example,  the  projection  of  P 
whose  homogeneous  coordinates  are  (x,y,  z,()  onto  the 
face  UVW  has  the  coordinates  (x,y,  z,  0). 
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Figure  10:  Stereo  Proximity  Detector.  Performance  of  the 
system  in  an  unmodified  office  environment.  “Grey-scale”  is 
the  original  sampled  image  from  the  left  camera,  “left"  and 
“right”  are  the  respective  images  overlaid  with  derived  edges, 
“matches”  is  the  set  of  edges  matched  between  images  at 
disparity  2,  and  “stereo”  is  the  left  grey  scale  image  overlaid 
with  matched  edges. 
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Figure  11:  Depth-tuning  curves  and  peak  S/N  ratios  for 
various  objects.  S/N  ratios  for  objects  are  measured  as  the 
ratio  of  peak  response  to  that  object  over  the  noise  level  (12 
pixels).  Images  are  64  x  48  pixels,  cameras  have  110  degree 
fields  of  view  with  a  baseline  of  65mm.  Objects  IDH,  JLS, 
and  TK  are  people.  Each  object  was  tested  in  nine  positions 
of  varying  depth  for  20  trials  each.  The  largest  variance  was 
6.9  pixeb.  Readings  past  7  feet  are  entirely  matching  noise. 
The  noise  level  was  measured  as  the  largest  such  noise  value 
observed  over  all  trials  for  all  data  points.  The  background 
was  a  cluttered  office. 
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Abstract 

This  is  an  overview  of  lU  research  in  image  un¬ 
derstanding  at  Columbia’s  Center  for  Research 
in  Intelligent  Systems  since  the  1992  image  un¬ 
derstanding  workshop.  It  reviews  our  work  on 
the  following  topics; 

1.  Physics-Based  Vision 

(a)  Generalization  of  Lambertian  Model 

(b)  Polarization-Based  Techniques 

2.  Recognition  and  Learning 

(a)  Reflectance-Based  Recognition 

(b)  Object  Recognition  using  BSP  trees 

(c)  Shape  Recovery 

(d)  Learning  and  Recognition  from  Appear¬ 
ance 

3.  Sensor  and  Illumination  Planning 

(a)  Planning  in  Active  Environments 

(b)  Illumination  Planning 

(c)  Modeling  Uncertainty 

(d)  lUE  sensors 

4.  Qualitative  Vision 

(a)  Alignment  using  Uncalibrated  Cameras 

(b)  Topological  Navigation 

(c)  Qualitative  Spatial  Description 

5.  Real-Time  Vision 

(a)  Real-Time  Polarization  Computation 

(b)  Real-Time  Image  Warping 

(c)  Real-Time  Tracking 

(d)  Real-Time  Detail-Preserving  Smoothing 

As  you  can  see  our  research  covers  the  full  span 
of  computer  vision,  from  low-level  image  process¬ 
ing  to  complete  systems  integration  of  vision  and 
robotics.  Here  we  present  only  appetizers;  for 
the  fuU  course  the  hungry  reader  should  consult 
the  research  papers  referenced  in  this  overview. 

'This  work  supported  in  pairt  by  DARPA  contract  DACA- 
76-92-C-007.  Numerous  other  agencies  and  companies  have 
also  supported  parts  of  this  research. 


1  Physics-Based  Vision 

The  topic  of  physics-based  vision  is  enjoying  a 
resurgence  in  the  field.  Over  the  past  year  re¬ 
searchers  at  Columbia’s  CRIS  lab  have  made 
some  important  contributions  in  this  area.  Much 
of  this  work  might  be  characterized  as  getting 
back  to  basics:  we  have  revisited  fundamental 
related  work  and  examined  the  assumptions  and 
validity  of  the  models.  The  results  -  summa¬ 
rized  here  and  detailed  in  other  papers  in  these 
proceedings  -  may  be  a  little  surprising. 

1.1  A  Generalization  of  the 
Lambertian  Model 

The  Lambertian  assumption  is  one  of  the  most 
widely  used  assumptions  in  machine  vision.  We 
have  shown  analytically  as  well  as  experimen¬ 
tally  that  rough  surfaces,  even  when  locally  Lam¬ 
bertian,  are  non-Lambertian  in  reflectance.  The 
paper  [Oren  and  Nayar,  1993]  details  the  devel¬ 
opment  of  a  predictive  model  explaining  these 
results.  This  work  may  shed  light  on  an  age  old 
question:  why  the  moon,  which  has  a  diffuse  sur¬ 
face  and  is  spherical,  appears  “flat.” 

Image  brightness  values  are  closely  related  to 
the  reflectance  properties  of  points  in  the  scene. 
Hence,  accurate  reflectance  models  are  funda¬ 
mental  to  the  advancement  of  machine  vision. 
Recently,  Nayar  et  al.  [Nayar  et  ai,  1991]  pro¬ 
posed  a  reflectance  framework  for  machine  vision 
that  has  three  primary  components:  the  diffuse 
lobe,  the  specular  lobe,  and  the  specular  spike. 
The  emphasis  of  their  work  was  the  analysis  of 
the  specular  components  rather  than  the  diffuse 
component.  The  diffuse  lobe  was  assumed  to  be 
Lambertian  as  this  model  is  simple  and  does  rea¬ 
sonably  well  in  approximating  reflection  from  a 
wide  range  of  matte  surfaces.  A  surface  with 
Lambertian  reflectance  appears  equally  bright 
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(a)  Real  image  (b)  Lambertian  model  (c)  Proposed  model 

Figure  1:  Real  image  of  a  cylindrical  clay  vase  compared  with  images  rendered  using  the  Lambertian 
reflectance  model  and  the  proposed  diffuse  reflectance  model.  The  vase  is  illuminated  by  a  point 
source  from  the  viewing  direction. 


from  all  directions.  This  model  for  diffuse  reflec¬ 
tion  is  one  of  the  most  widely  used  assumptions 
in  machine  vision.  It  is  used  explicitly  in  the  case 
of  shape  recovery  techniques  such  as  shape  from 
shading  and  photometric  stereo.  It  is  also  implic¬ 
itly  used  in  the  solution  of  the  correspondence 
problem  by  vision  techniques  such  as  stereo  vi¬ 
sion  and  motion  detection. 

For  several  real-world  objects,  however,  the 
Lambertian  model  can  prove  to  be  a  poor  and 
inadequate  approximation  to  the  diffuse  compo¬ 
nent.  In  the  areas  of  machine  vision,  remote 
sensing,  and  computer  graphics,  each  picture  el¬ 
ement  (pixel)  can  represent  a  surface  area  with 
substantial  roughness.  Though  the  Lambertian 
assumption  is  often  reasonable  when  looking  at 
a  small  planar  surface  element,  the  roughness 
of  the  total  surface  covered  by  a  pixel  causes  it 
to  behave  in  a  non- Lambertian  manner.  This 
deviation  from  Lambertian  reflectance  is  signifi¬ 
cant  for  very  rough  surfaces,  and  increases  with 
the  angle  of  incidence.  We  have  developed  a 
comprehensive  model  that  predicts  reflectance 
from  rough  diffuse  surfaces,  and  conducted  sev¬ 
eral  experiments  that  support  the  model  [Oren 
and  Nayar,  1993],  [Oren  and  Nayar,  1992].  The 
proposed  model  takes  into  account  complex  geo¬ 
metrical  effects  such  as  masking,  shadowing,  and 
interreflections  between  points  on  the  rough  sur¬ 
face.  It  may  be  viewed  as  a  generalization  of  the 
Lambertian  reflectance  model. 


Figure  1(a)  shows  the  image  of  a  cylindrical  clay 
vase  with  a  rough  surface.  This  image  was  ob¬ 
tained  using  a  CCD  camera.  The  vase  is  illu¬ 
minated  by  a  single  light  source  from  the  sensor 
direction.  Figure  1(b)  shows  a  rendered  image 
that  is  generated  using  the  known  geometry  of 
the  vase  and  the  Lambertian  model.  Clearly, 
the  real  vase  appears  much  flatter,  with  less 
brightness  variation  along  its  cross-section,  than 
the  Lambertian  vase.  Figure  1(c)  shows  a  ren¬ 
dered  image  of  the  vase  generated  using  the  pro¬ 
posed  reflectance  model.  Note  that  the  proposed 
model  does  very  well  in  predicting  the  appear¬ 
ance  of  the  vase.  Figure  2  compares  the  bright¬ 
ness  values  along  the  cross-section  of  the  three 
vase  images  shown  in  Figure  1.  It  is  interest¬ 
ing  to  note  that  the  brightness  of  the  real  vase 
remains  nearly  constant  over  most  of  the  cross- 
section  and  drops  quickly  to  zero  very  close  to 
the  limbs.  The  proposed  model  does  very  well  in 
predicting  this  behavior,  while  the  Lambertian 
model  produces  large  brightness  errors.  We  note 
that  in  graphics  rendering  of  diffuse  objects,  it  is 
general  practice  (and  an  acknowledged  hack)  to 
add  an  “ambient  light  source”  which  modifies  the 
reflectance  function  producing  “flatter”  render¬ 
ings  of  objects  and  making  them  look  less  harsh. 
The  proposed  model  can  provide  that  desired  be¬ 
havior,  and  does  so  using  rigorous  foundations. 
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Model 


Figure  2:  Comparison  between  the  image  bright¬ 
ness  along  the  cross-section  of  three  vases  shown 
in  Figure  1. 

1.2  Polarization-Based  Techniques 

In  the  pa^t  three  lUW  proceedings,  Columbia 
has  presented  reports  discussing  the  benefits  of 
polarization  techniques  in  machine  vision.  Ba¬ 
sic  theory  and  numerous  applications  have  been 
addressed.  There  have  been,  however,  techni¬ 
cal  assumptions  which  may  not  have  been  ap¬ 
parent  on  first  reading.  Some  of  these  assump¬ 
tions  turned  out  to  be  quite  strong,  and  !•  mov¬ 
ing  them  proved  more  than  a  simple  extension 
of  the  previous  work. 

1.2.1  Integration  of  Color  and 
Polarization 

Two  serious  assumptions  were  associated  with 
the  problem  of  separation  of  diffuse  and  specu¬ 
lar  components  of  an  image.  These  assumptions 
restricted  previous  polarization-based  work  to 
dealing  with  compact  (i.e.  small)  highlights  over 
regions  with  constant  diffuse  components  and 
constant  material  composition.  Previous  work 
on  this  problem  using  color  information  made 
similar  restrictions.  The  separation  problem  is 
important  since  many  image  understanding  algo¬ 
rithms,  such  as  shape  from  shading,  stereo,  pho¬ 
tometric  stereo,  and  motion  analysis,  fail  when 
there  are  significant  highlights  and  specular  in¬ 
terreflections. 

The  new  approach  uses  color  and  polarization  in¬ 
formation  to  remove  the  restrictions  of  compact 
highlights  and  constant  diffuse  components.  In 
Figure  3,  we  see  a  gray- scale  version  of  a  color 
image  of  a  mug  with  a  significant  highlight,  and 
after  removal  of  that  highlight.  More  details,  as 


Figure  3:  Top  is  the  image  of  a  cup  with  a 
strong  highlight.  Bottom  is  the  diffuse  compo¬ 
nent  of  the  cup  as  computed  with  the  integrated 
color/ polarization  approach.  (These  grayscale 
images  were  converted  from  color  images,  see 
[Nayar  et  al.,  1993]  for  color  versions  of  this  and 
other  examples  of  separation.) 

well  as  the  color  images  from  three  experimental 
scenes,  can  be  found  in  these  proceedings  [Nayar 
et  ai,  1993]. 

The  new  approach  is  not  just  an  application  of 
previous  work  in  three  color  bands  with  a  sim¬ 
ple  combination  of  the  results.  Rather  it  started 
from  the  fundamental  characterization  of  specu¬ 
lar  highlights/diffuse  reflection  and  derived  new 
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constraints  in  color  space,  which  can  be  used  to 
determine  the  diffuse  reflection  at  a  point. 

We  have  tested  the  new  tehcnique  in  experiments 
using  both  a  color  camera  and  a  monochrome 
camera  with  color  filters,  on  scenes  including  a 
simple  geometric  shape  (cylindrical  cup),  with 
strong  surface  markings  and  a  strong  primary 
highlight,  see  figure  3,  and  more  complex  scenes 
with  strong  interreflections,  see  [Nayar  et  al., 
1993].  The  results  of  the  experiments  not  only 
verify  the  approach  as  feasible,  but  show  a  con¬ 
siderable  advance.  The  scenes  considered  in¬ 
clude  examples  that  would  not  be  computable 
using  previously  existing  rlgorithms.  There  are 
still  a  few  assumptions  to  be  overcome  before 
the  separation  is  widely  useful:  it  does  not  han¬ 
dle  near  normal  reflections  or  highlights  that  are 
nearly  the  same  color  as  the  object.  Still,  it  rep¬ 
resents  an  important  advancement  in  the  state 
of  the  art. 

Unfortunately,  the  only  available  analysis  of  the 
separation  quality  is  qualitative/subjective  in¬ 
terpretation.  Over  the  next  year  we  will  use 
this  algorithm  as  a  preprocessor  to  stereo,  shape 
from  shading  and  photometric  stereo.  Even  with 
the  aforementioned  limitations,  we  expect  that  it 
will  significantly  help  in  these  application  arecis. 
These  methods  recover  depth  or  shape  informa¬ 
tion  and  we  will  use  the  the  resulting  surface 
“quality  metrics”  as  a  quantitative  measure  of 
the  impact  of  separation. 

1.2,2  Polarization  for  the  UGV: 

Stepping  Outdoors 

Other  strong  assumptions  in  our  previous  po¬ 
larization  work  came  from  our  exclusive  use  of 
indoor  environments.  In  support  of  the  UGV 
program  we  have  teamed  up  with  L.  Wolff  and 
Johns  Hopkins  University  and  L.  Matthies  at 
JPL  to  address  the  problem  of  material/scene 
classification  in  outdoor  scenes.  There  are  two 
things  that  make  this  quite  challenging;  dealing 
with  the  complexities  of  the  outdoor  world  and 
collecting  the  necessary  data. 

The  problems  in  outdoor  scenes  are  many.  First, 
natural  skylight  is  partially  polarized,  and  re¬ 
flections  of  it  are  thus  more  complex.  Second, 
outdoor  scenes  containing  vegetation  have  com¬ 
plex  shapes  with  significant  internal  structure 


(consider  the  potential  number  of  reflections  be¬ 
tween  the  leaves  of  a  tree).  Yet  we  generally 
view  these  from  a  distance  at  which  most  inter¬ 
nal  structure  is  within  a  pixel.  As  mentioned 
above,  even  for  something  as  simple  ais  Lamber¬ 
tian  reflectance,  complex  sub-pixel  surface  ge¬ 
ometries  (i.e.  roughness)  can  create  formidable 
deviations  from  the  model. 


Figure  4:  Image  of  a  scene  near  JPL  containing 
water,  vegetation,  bare  soil,  and  a  building. 


Figure  5;  Percent  polarization  the  scene.  [09r  - 
30%)  -  [255  -  0] 


The  approach  being  explored  by  the  joint  effort 
is  twofold:  To  venture  forth  and  gather  data  for 
analysis,  and  to  find  ways  to  use  additional  in¬ 
formation,  e.g.  vehicle  pose  and  multi-spectral 
information.  We  have  already  gathered  data  for 
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Figure  6:  Polarization  phase  image  for  the  scene 
in  figure  4.  [0°  -  180°]  ^  [255  -  0] 


a  number  of  scenes  at  6  different  wavelengths 
and  are  analyzing  the  data.  Figures  4-6  show 
some  of  the  polarization  information  obtained 
from  the  650nm  wavelength  band  for  a  scene  near 
JPL.  Shown  are  the  visible  image,  the  percent 
polarization  and  the  polarization  phase.  The 
scene  contains  a  body  of  water  (in  front)  with 
a  dirt /rock  bank.  Above  the  bank  are  various 
forms  of  vegetation,  some  bare  soil,  and  in  the 
background  is  a  building.  The  percent  polariza¬ 
tion  (figure  5)  clearly  delineates  the  water  from 
the  bank,  but  the  reflections  of  some  of  the  veg¬ 
etation  are  considerably  more  complex.  The  di¬ 
rect  view  of  the  vegetation  is  also  very  detailed. 
The  phase  information  has  complexities  rivaling 
those  of  the  intensity  image.  We  clearly  have  in¬ 
formation,  it  is  just  not  clear  how  to  use  it.  The 
joint  research  effort  is  looking  at  how  we  can 
exploit  known  geometry  (we  can  model  some  of 
the  partial  polarization  effects  of  skylight,  if  we 
know  the  time,  viewing  direction,  local  weather 
conditions,  and  vehicle  pose)  and  multi-spectral 
information. 

A  final  problem  in  the  use  of  polarization  for  the 
UGV  project  is  that  of  data  acquisition:  how 
to  get  the  polarization  information  on  a  moving 
vehicle.  That  is  discussed  in  the  section  of  the 
overview  on  real-time  vision. 


2  Object  Recognition,  Shape 
Recovery  and  Learning 

Three  key  problems  in  high-level  vision  are 
Shape  Recovery,  Object  Recognition  and  Learn¬ 
ing  of  Object  Models.  This  section  overviews 
Columbia’s  recent  and  ongoing  work  in  these  key 
areas. 

2.1  Reflectance-Based  Recognition 

Object  recognition  has  been  an  active  area  of 
machine  vision  research  for  the  past  two  decades. 
The  traditional  approach  has  been  to  recover  ge¬ 
ometric  features  from  images  and  then  use  these 
features  to  hypothesize  and  verify  the  existence 
of  three-dimensional  objects  in  the  image.  Edges 
and  vertices  are  examples  of  geometric  features 
often  used  by  recognition  systems.  During  im¬ 
age  formation,  however,  a  substantial  amount  of 
information  is  lost  regarding  the  geometry  of  the 
scene.  Hence,  geometric  features  are  not  always 
adequate  for  robust  recognition  of  objects.  In 
the  past,  little  attention  has  been  given  to  the 
use  of  other  properties  of  objects  for  recogni¬ 
tion.  In  addition  to  its  geometry,  an  object  may 
be  characterized  by  physical  properties  such  as 
reflectance,  roughness,  and  material  type. 

An  efficient  algorithm  has  been  developed  for 
computing  the  reflectance  of  regions  in  a  scene, 
with  respect  to  their  backgrounds,  from  a  sin¬ 
gle  image  [Nayar  and  Bolle,  1993a]  [Nayar  and 
Bolle,  1993b].  The  result  is  a  physical  property 
of  each  scene  region  that  is  invariant  to  the  in¬ 
tensity  and  direction  of  illumination.  This  pho¬ 
tometric  invariant,  referred  to  as  the  reflectance 
ratio,  provides  valuable  information  for  recogni¬ 
tion  tasks.  We  have  used  the  reflectance  ratio 
invariant  to  recognize  objects  from  a  single  im¬ 
age  [Nayar  and  Bolle,  1993a].  This  approach  is 
very  effective  in  the  case  of  man-made  objects 
that  have  printed  characters  and  pictures.  Each 
object  is  assumed  to  have  a  set  of  regions,  each 
with  constant  reflectance.  The  reflectance  ratio 
and  center  of  each  region  are  used  to  represent 
the  object.  Algorithms  based  on  the  indexing 
scheme  have  been  developed  for  recognition  and 
pose  estimation  of  objects  from  single  image. 

Experimental  results  are  presented  in  [Nayar  and 
Bolle,  1993a]  for  realistic  scenes  with  occlusions, 
shadows,  and  illumination  variations.  Figure  7 
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shows  object  acquisition  and  recognition  results 
obtained  using  the  reflectance-based  recognition 
algorithm.  The  image  shown  in  Figure  7a  is  used 
for  learning  the  object  model.  Three  reflectance 
regions  are  used  to  form  an  index  and  are  indi¬ 
cated  by  the  triangle.  Other  regions  on  the  ob¬ 
ject  are  included  in  the  entry  of  a  hash  table  to 
be  used  for  object  verification  and  pose  estima¬ 
tion.  The  centroids  of  these  verification  regions 
are  indicated  by  black  boxes  in  Figure  7a.  Fig¬ 
ure  7b  shows  a  scene  with  several  objects.  The 
index  triangle  is  detected  in  the  scene  image  and 
is  shown  as  a  triangle  in  Figure  7b.  This  pro¬ 
vides  a  hypothesis  for  the  object  and  its  pose  in 
the  scene  image.  This  hypothesis  is  verified  by 
projecting  other  regions  in  the  model  image  to 
the  scene  image. 


(b) 


Figure  7:  Model  acquisition  and  object  recogni¬ 
tion  results  obtained  using  the  reflectance  based 
recognition  algorithm. 


2.2  Object  Recognition  using  BSP 
Trees 

We  recently  began  investigating  the  application 
of  Binary  Space  Partitioning  (BSP)  trees  to  the 
domain  of  object  recognition.  A  BSP  tree  is  a 
general  method  of  partitioning  an  n-dimensional 
space  by  a  set  of  n-1  dimensional  hyper  planes. 
Its  tree  structure  allows  very  efficient  algorithms 
to  be  developed,  it  is  compact,  and  it  is  numeri¬ 
cally  robust.  BSP  trees  have  proven  their  utility 
in  3-D  modeling,  graphics  and  image  processing, 
and  many  of  the  properties  that  allow  BSP  trees 
to  perform  well  in  these  tasks  are  of  primary  im¬ 
portance  when  considering  the  object  recogni¬ 
tion  problem. 

The  object  recognition  system  we  are  developing 
uses  augmented  BSP  trees,  generated  from  CAD 
models,  to  model  the  objects  being  examined. 
Data  is  acquired  from  an  image,  a  rangefinder, 
or  tactile  sensor,  which  the  system  then  uses  to 
find  matching  features  in  the  BSP  tree  models. 
The  tree  structure  of  the  model  will  allow  the 
system  to  quickly  obtain  a  correlation  between 
the  sensed  and  modeled  features.  Models  will 
also  be  used  to  guide  the  sensing  process  to¬ 
wards  features  that  are  more  discriminating  over 
those  that  are  less  so.  The  output,  which  con¬ 
sists  of  the  type  of  object,  its  position  and  orien¬ 
tation,  may  then  be  used  to  direct  further  sen¬ 
sor  planning  or  tasks  such  as  manipulation  and 
path  planning.  In  addition,  a  recognized  object 
no  longer  needs  to  be  associated  with  the  sensed 
data  and  may  be  described  by  its  model,  which 
may  be  viewed  as  a  compression  of  the  sensed 
data. 

2.3  Shape  Recovery 

In  the  past  few  years  we  have  been  developing 
and  analyzing  a  range  of  different  methods  for 
shape  recovery.  Most  of  these  projects  contin¬ 
ued  this  year,  with  incremental  progress  and  a 
few  publications  with  more  experimentation  or 
analysis.  The  basic  concepts  have  been  covered 
in  past  years.  We  report  only  the  recent  publi¬ 
cations; 

•  Shape-From-Focus  [Nayar,  1992a,  Nayar. 

1992b| 

•  Recovery  of  TORI  from  Range  Images 

[Kjeldsen  and  Render,  1992] 


72 


•  Energy- Based  Segmentation  [Boult  and 
Lerner,  1991],  [Lerner,  1993]. 

•  Shape  from  Shadows  [Yang  and  Render, 
1993] 

•  PROVER  [O’Donnell  and  Boult,  1991] 

•  Symmetry  Analysis,  SHGC  Modeling  and 
Recovery  [Gross  and  Boult,  1992] 

This  year  we  have  begun  research  using  dynamic 
models  combining  previous  work  on  splines,  gen¬ 
eralized  cylinders  and  superquadrics. 

2.4  Learning  and  Recognition  of 
Objects  from  Appearance 

A  technique  is  being  developed  for  automatically 
learning  object  models  for  recognition  and  pose 
estimation  [Murase  and  Nayar,  1993].  In  con¬ 
trast  to  the  traditional  approach,  we  formulate 
the  recognition  problem  as  one  of  matching  ap¬ 
pearance  rather  than  shape.  The  appearance  of 
an  object  in  a  two-dimensional  image  depends 
on  its  shape,  reflectance  properties,  pose  in  the 
scene,  and  the  illumination  conditions.  While 
shape  and  reflectance  are  intrinsic  properties  of 
an  object  and  are  constant,  pose  and  illumina¬ 
tion  vary  from  scene  to  scene.  We  have  proposed 
a  new  compact  representation  of  object  appear¬ 
ance  that  is  parameterized  by  pose  and  illumi¬ 
nation.  For  each  object  of  interest,  a  large  set  of 
images  is  obtained  by  automatically  varying  pose 
and  illumination.  This  large  image  set  is  com¬ 
pressed  to  obtain  a  low-dimensional  subspace, 
called  the  eigcnspace,  in  which  the  object  is  rep¬ 
resented  as  a  \  yptrsurface.  Given  an  unknown 
input  imag'-',  ih~‘  r.  ignition  system  projects  the 
image  onto  ;ii<j  :  .•.  '•lOipace.  The  object  is  recog¬ 
nized  based  i:  h>persurface  it  lies  on.  The 
exact  position  of  th«-  projection  on  the  hypersur¬ 
face  determines  the  object’s  pose  in  the  image. 

Experiments  hr've  be- ii  conducted  using  several 
objects  with  complex  appearance  characteristics 
[Murase  and  Nayar,  ’993],  [Murase  and  Nayar, 
1992].  Fig.  8a  show  =  an  input  image  of  the  ob¬ 
ject  (car)  whose  parane-tric  hypersurface  repre¬ 
sentation  is  shown  in  i’lg.  8b.  The  represen¬ 
tation  is  parameterized  by  ob  ject  pose  (^i)  and 
illumination  direction  (^2)-  The  hypersurface  is 
actually  8-dimensional  but  only  the  three  most 
prominent  dimensions  are  displayed  in  Fig.  8b. 
The  input  image  in  Fig.  8a  is  projected  onto 


the  eigenspace  and  is  seen  to  lie  on  the  para¬ 
metric  hypersurface  of  the  object.  The  location 
of  the  point  on  the  hypersurface  determines  the 
object’s  pose  in  the  image.  The  performance  of 
the  recognition  and  pose  estimation  algorithms 
is  studied  using  over  a  thousand  input  images 
of  the  sample  objects  [Murase  and  Nayar,  1992]. 
The  sensitivity  of  recognition  to  the  number  of 
eigenspace  dimensions,  and  the  number  of  learn¬ 
ing  samples,  is  analyzed.  For  the  objects  used, 
appearance  representation  in  eigenspaces  with 
less  than  10  dimensions  produces  very  accurate 
recognition  results  with  an  average  pose  estima¬ 
tion  error  of  0.5  degrees.  These  results  suggest 
the  proposed  appearance  representation  to  be  a 
valuable  tool  for  a  variety  of  machine  vision  ap¬ 
plications. 


(b) 


Figure  8:  (a)  An  input  image,  (b)  The  input  image  is 
mapped  to  a  point  in  eigenspace.  The  location  of  the 
point  on  the  parametric  hypersurface  representation 
of  the  object  determines  its  pose  in  the  input  image. 


3  Sensor  and  Illumination 
Planning/Modeling 

This  section  overviews  two  important  aspects  of 
using  sensors.  The  first  is  the  planning  of  param¬ 
eters  for  sensors  and  illuminators,  e.g.,  illumi¬ 
nation  placement,  sensor  placement,  sensor  lens 
settings,  etc.  The  second  area  is  the  modeling 
of  sensors  and  uncertainty  in  robotic/ vision  sys¬ 
tems.  These  research  projects  go  to  the  heart 
of  a  key  problem  in  vision-  how  we  can  reason 
about  lU  systems  and  plan  to  have  them  work 
rather  than  getting  them  to  work  via  fiddling 
around  with  the  sensor  positions,  light  position, 
etc. 


73 


3.1  Sensor  Planning  in  Active 
Environments 

A  goal  of  robotics  has  been  to  develop  intelli¬ 
gent  robots  which  are  capable  of  planning  their 
own  actions.  These  actions  are  often  guided  by 
sensors,  which  provide  noisy,  incomplete,  and 
sometimes  inaccurate  data.  Researchers  accept 
this  as  a  necessary  evil  of  sensors  and  develop 
algorithms  which  try  to  extract  as  much  infor¬ 
mation  as  possible  from  sensor  data.  However, 
less  attention  has  been  focused  on  planning  sens¬ 
ing  strategies  which  yield  more  accurate  data. 
The  Machine  Vision  Planning  (MVP)  system 
developed  at  Columbia  University  attacks  this 
problem  by  integrating  sensor,  object,  and  mo¬ 
tion  models,  along  with  task-level  information  to 
plan  appropriate  sensor  locations  and  settings, 
see  [Abrams  et  ai,  1993a,  Timcenko  et  aL,  1993, 
Abrams  ei  aL,  1993b]. 

Given  a  CAD  description  of  an  object  and  its 
environment,  a  model  of  a  vision  sensor,  plus 
a  specification  of  the  features  to  be  viewed, 
MVP  generates  a  camera  location,  orientation, 
and  lens  settings  (focus-ring  adjustment,  focal 
length,  aperture)  which  insure  a  robust  view  of 
object  features.  In  this  context,  a  robust  view 
implies  a  view  which  is  unobstructed,  in  focus, 
properly  magnified,  and  well-centered  within  the 
field-of-view.  In  addition,  MVP  attempts  to  find 
a  viewpoint  with  as  much  margin  for  error  in  all 
parameters  as  possible. 

We  have  added  moving  environment  models  to 
MVP  and  are  exploring  methods  of  extending 
MVP  to  plan  viewpoints  in  an  active  environ¬ 
ment.  The  first  approach,  currently  limited  to 
the  case  of  moving  obstacles  (i.e.  the  target,  or 
features  to  be  viewed,  are  stationary),  is  to  sweep 
the  model  of  all  moving  objects  along  their  tra¬ 
jectories  and  to  plan  around  the  swept  volumes, 
as  opposed  to  the  actual  objects.  A  temporal 
interval  search  is  used  in  conjunction  with  the 
swept  volumes  to  find  large  time  intervals  for 
which  one  robust  viewpoint  can  be  used.  This 
approach  has  been  implemented  in  simulation 
and  experiments  are  being  carried  out  in  our 
robotics  laboratory. 

The  lab  setup  involves  two  robot  arms.  The  first 
arm  has  a  camera  mounted  on  its  end-effector 
that  can  be  positioned  anywhere  in  the  environ¬ 


ment  in  order  to  monitor  a  robotic  task.  The 
second  arm  is  given  a  particular  task  such  as  a 
pick-and-place  or  welding  operation.  The  goal  is 
to  have  the  MVP  system  plan  viewpoints  which 
allow  us  to  monitor  the  task,  while  avoiding  oc¬ 
clusion  which  may  arise  due  to  the  motion  of 
the  second  robot.  Future  research  will  concen¬ 
trate  on  planning  continuous  paths  that  link  the 
viewpoints  together,  and  continuing  to  create  a 
fully  automated  environment  for  active  sensing 
tasks  [Timcenko  et  al.,  1993). 

3.2  Illumination  Planning 

Using  the  parametric  eigenspace  representation 
(described  in  above  section  on  learning)  we  have 
developed  a  new  approach  to  the  problem  of 
illumination  planning,  see  [Murase  and  Nayar, 
1993].  Given  a  set  of  objects,  the  goal  is  to 
determine  the  illumination  direction  for  which 
the  objects  are  most  distinguishable  in  appear¬ 
ance  from  each  other.  Correlation  is  used  as 
a  measure  of  the  similarity  in  the  appearance 
of  objects.  For  each  object,  a  large  number 
of  images  are  automatically  obtained  by  vary¬ 
ing  pose  and  illumination  direction.  The  sets 
of  images  for  each  object  constitutes  the  plan¬ 
ning  set.  A  parametric  eigenspace  is  computed 
using  the  planning  image  set.  For  each  illumina¬ 
tion  direction,  objects  are  represented  as  hyper¬ 
curves  in  the  eigenspace.  The  minimum  distance 
between  the  hypercurves  of  two  objects  repre¬ 
sents  the  similarity  between  the  objects  in  the 
correlation  sense.  The  optimal  source  direction 
is  one  that  maximizes  the  shortest  distance  be¬ 
tween  object  hypercurves  in  eigenspace.  This 
illumination  planning  method  has  the  following 
advantages  over  previous  approaches: 

(a)  It  does  not  rely  on  geometric  (CAD)  models 
of  the  objects  of  interest.  It  uses  only  bright¬ 
ness  images  to  accomplish  the  planning  task. 

(b)  No  assumptions  are  made  with  respect  to  the 
reflectance  properties  of  the  objects. 

(c)  The  optimal  source  direction  produced  by 
the  illumination  planner  is  pose  invariant;  it 
is  optimal  over  all  object  poses. 

3.3  Modeling  Uncertainty 

In  [Timcenko  and  Allen,  1992,  Timcenko  and 
Allen,  1993b]  we  offer  a  new  method  for  model¬ 
ing  uncertainties  that  exist  in  a  robotic  system. 
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based  on  stochastic  differential  equations.  The 
benefit  of  using  such  a  model  is  that  we  are  then 
able  to  capture  in  an  analytical  structure  some 
key  points  underlying  robot  motion:  the  abil¬ 
ity  to  properly  express  uncertainty  within  the 
motion  descriptions,  and  the  dynamic,  chang¬ 
ing  nature  of  the  task  and  its  constraints.  The 
goal  of  this  research  is  to  exploit  the  smooth,  dif¬ 
ferentiable  topological  structure  of  configuration 
space  and  populate  it  with  mathematical  entities 
that  lead  to  plans  as  solutions  of  certain  differ¬ 
ential  equations. 

We  have  performed  experiments  that  attempt  to 
quantify  the  uncertainty  in  robotic  motion  con¬ 
trol  and  show  how  it  can  be  used  within  our 
model.  The  statistical  justifiability  of  the  pro¬ 
posed  model  indicates  that  it  resembles  the  real 
nature  of  the  random  phenomena  that  govern 
the  system  quite  well.  More  importantly,  the 
method  offers  a  way  of  estimating  the  variance 
of  different  types  of  uncertainties,  thus  answering 
questions  about  both  the  qualitative  and  quan¬ 
titative  nature  of  uncertainty. 

Also  of  interest  is  defining  methods  for  time  pa¬ 
rameterization  of  robot  trajectories  so  that  the 
robotic  system  achieves  favorable  performance 
despite  the  inevitable  presence  of  uncertainties 
[Timcenko  and  Allen,  1993a].  In  order  to  develop 
these  methods,  we  need  detailed  understand¬ 
ing  of  the  sources  of  random  phenomena  in  the 
system,  as  well  as  comprehensive  mathematical 
models  of  those  phenomena.  That  understand¬ 
ing  can  lead  us  to  a  mathematically  tractable 
formulation  of  motion  planning  in  the  form  of  a 
constrained  optimization  problem. 

We  believe  that  this  is  a  fruitful  research  direc¬ 
tion.  It  opens  a  wide  spectrum  of  questions. 
Some  of  them  are:  1)  dealing  with  non-constant 
environment  uncertainty,  2)  the  robustness  of 
obtained  plans  with  respect  to  modeling  errors, 
3)  the  numerical  complexity  of  computing  ap¬ 
proximations  of  globally  optimal  plans,  4)  ex¬ 
perimentation  with  different  types  of  difficulty 
indices  and  different  types  of  constraint  func¬ 
tions,  such  as  mathematical  expectations  instead 
of  success  probabilities.  We  hope  to  address 
some  of  these  problems  in  future. 


3.4  Sensors  and  the  lUE 

The  DARPA  Image  Understanding  Environ¬ 
ments  specification  was  recently  released.  (See 
also  the  papers  on  the  lUE  in  this  proceed¬ 
ings  and  also  [Mundy  et  al.,  1992].  Unoffi¬ 
cial  copies  of  both  the  lUE  Overview  Document 
and  the  lUE  Class  Definition  Document,  can 
be  FTPed  from  cs.columbia.edu  in  the  directory 
/pub/vision/iue.)  A  major  part  of  this  docu¬ 
ment  was  the  specification  of  sensors  and  related 
objects.  Our  group  was  instrumental  in  this  part 
of  the  specification,  drawing  partially  from  our 
experiences  in  the  PROVER  project.  The  design 
incorporates  both  sensor  data  characteristics  and 
sensor  uncertainty  modeling.  This  past  year  we 
have  been  prototyping  some  of  the  lUE  sensor 
objects,  providing  feedback  on  the  design.  Our 
experiences  should  be  summarized  in  a  technical 
report  later  this  year. 

4  Qualitative  Vision 

In  Columbia’s  CRIS  lab  we  have  initiated 
projects  in  the  area  of  qualitative  vision  - 
projects  where  the  goals  of  vision  are  not  mea¬ 
surements  but  qualitatively  describe  goals  such 
as  object  alignment  without  calibration,  navi¬ 
gation  with  topological  directions  or  determine 
the  applicability  of  certain  propositions  (such  as 
near,  far,  next  to)  to  describe  the  relationship 
of  objects  in  a  scene.  In  this  section,  we  review 
our  recent  work  in  these  areas.  It  is  interest¬ 
ing  to  note  that  while  the  title  of  the  section  is 
“qualitative  vision,”  all  the  projects  have  a  sig¬ 
nificant  quantitative  component.  For  example, 
if  we  did  topological  navigation  without  a  quan¬ 
titative  error  analysis,  it  would  be  difficult  for  us 
to  measure  research  progress. 

4.1  Alignment  using  an  Uncalibrated 
Camera  System 

This  new  technique  takes  the  typical  mapping 
from  3-D  positions  to  image  coordinates,  and  in¬ 
stead  of  finding  this  mapping,  it  recovers  a  prop¬ 
erty  of  the  image  coordinates  without  calibrat¬ 
ing  the  camera  location.  This  method  exploits 
the  fact  that  a  known  movement  in  the  camera 
system  can  result  in  useful  motion  information 
in  the  image  system  without  knowing  the  exact 
calibration  between  the  systems. 
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Figure  9;  Experimental  setup  for  uncalibrated 
alignment 


Figure  10:  Work  table  with  “U”  shaped  part  to 
be  tracked 

In  this  method,  an  active  camera  system 
mounted  on  a  robotic  arm  can  maintain  an  arbi¬ 
trary,  geometric  relationship  with  an  object  sys¬ 
tem,  and  as  a  result  of  certain  operations,  the 
vision  system  can  “calibrate”  itself  to  or  “can 
define  its  location  with  respect  to”  the  unknown 
camera  system.  The  newness  of  our  technique 
arises  from  the  fact  that  our  system  “calibrates” 


Figure  11:  The  image-space  projections  of  the 
ellipses  formed  by  tracking  the  object  over  the 
first  11  positions  moved  to  by  the  robot. 

itself  while  performing  the  task  of  moving  to  the 
goal  position  and  without  computing  the  true 
location  of  the  camera  system. 

Given  an  alignment  task  (e.g.  insertion  of  a  part 
in  an  assembly),  we  first  select  designated  fea¬ 
tures  on  the  part  to  be  inserted.  The  robot  then 
rotates  the  camera  around  its  rotational  axis,  R. 
If  the  only  movement  in  the  robot-camera  sys¬ 
tem  is  the  rotation,  the  part  features  will  trace 
out  a  conic  section,  an  ellipse  under  certain  con¬ 
ditions.  By  noting  the  changes  in  these  ellipti¬ 
cal  parameters  as  the  camera  system  moves  (and 
computing  the  ellipse’s  projected  area),  we  can 
recover  the  alignment  condition.  We  determine 
the  object’s  position  based  on  the  fact  that  if  an 
object  lies  on  the  axis  around  which  a  camera 
system  moves,  the  object  will  rotate  but  will  not 
translate  in  the  camera  system.  This  fact  allows 
us  to  set  up  a  controF  structure  for  closed-loop 
servoing  to  the  alignment  position  even  though 
we  have  no  knowledge  of  the  camera  calibration 
information,  see  [Yoshimi  and  Allen,  1993]. 

Figure  9  shows  the  setup  of  the  camera  mounted 
on  the  arm.  Figure  10  shows  a  workspace  with 
a  part  to  be  tracked  for  alignment;  the  goal  is  to 
insert  a  pin  in  the  hole  of  “U”  shaped  flange.  Fig¬ 
ure  11  shows  the  generated  ellipses  from  tracking 
the  hole  during  the  camera’s  rotation,  and  the 
convergence  of  the  ellipse  data  as  the  alignment 
succeeds. 

4.2  Topological  Navigation 

Navigation  in  a  unstructured  environment  is 
generally  carried  out  in  a  topological  manner 
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with  directions  like  “turn  right  when  you  see 
the  isolated  yellow  house,  take  a  left  at  the  sec¬ 
ond  light,”  etc.)  rather  than  using  metric  direc¬ 
tions  like  “travel  1.345  miles,  turn  right,  go  3.12 
miles  turn  left”.  This  past  year  we  have  contin¬ 
ued  our  work  in  this  area,  working  on  an  ana¬ 
lytic  (quantitative)  model  of  the  error  behavior 
of  our  system.  In  [Park  and  Kender,  1992]  we  de¬ 
scribe  a  purely  topological  method  for  navigation 
in  a  large  unstructured  environment  that  con¬ 
tains  featureless  objects,  using  qualitative  non¬ 
metric  information  such  as  “isolated”  landmarks 
and  “trajectories,”  which  we  define.  The  map- 
maker  and  the  navigator  are  implemented  using 
an  IBM  7575  SCARA  robot  arm,  PIPE,  and  two 
cameras.  The  navigational  environment  consists 
of  a  flat  plane  with  identical  objects  populated 
randomly  but  densely  on  it.  First,  given  a  start¬ 
ing  position  and  a  goal  position  the  map-maker 
module  observes  the  environment  and  generates 
a  “custom  map”  that  describes,  in  a  non-metric 
language,  how  to  get  from  the  starting  position 
to  the  goal  position  efficiently  and  reliably.  The 
accuracy  and  the  cost  of  the  directional  instruc¬ 
tions  are  analyzed,  then  demonstrated  by  the 
navigator  by  following  the  commands  in  the  cus¬ 
tom  map. 

The  errors  in  denoting  isolated  objects  as  land¬ 
marks  were  analytically  determined,  and  com¬ 
bined  with  an  analytic  determination  of  errors  in 
using  landmark  pairs  to  denote  parkway-crossing 
directions.  These  two  error  measures  are  then 
treated  as  a  single  measure  of  topological  good¬ 
ness.  Since  both  error  measures  are  parame¬ 
terized  by  a  model  of  sensor  error  (the  stan¬ 
dard  deviation  of  landmark  exact  placement), 
it  was  illuminating  to  simulate  the  variations  in 
paths  produced  under  increasing  error,  partic¬ 
ularly  in  cluttered  environments.  As  error  in¬ 
creases,  crowded  areas  are  avoided.  For  more 
details  see  [Park  and  Kender,  1993]  in  these  pro¬ 
ceedings. 

4.3  Qualitative  Spatial  Description 

In  document  understanding,  the  relation  be¬ 
tween  image  understanding  and  language  be¬ 
comes  very  important.  We  have  been  looking  at 
the  relationship  between  objects  within  an  image 
and  the  automatic  determination  of  prepositions 
to  describe  them,  see  [Abella  and  Kender,  1993]. 
In  the  area  of  spatial  descriptions,  the  problem  of 


Figure  12:  An  image  for  experimentation  with 
recovery  of  spatial  descriptions.  See  text  for  de¬ 
tails. 


using  prepositions  of  place,  such  as  “near,”  “in,” 
or  “next  to,”  was  formalized  in  a  translation- 
independent,  rotation-independent,  and  scale- 
independent  way.  Using  prior  research  on  the 
descriptions  of  quadrilaterals,  the  method  al¬ 
lows  fuzzy,  probabilistic  estimates  on  how  ac¬ 
curately  a  given  preposition  describes  the  two- 
dimensional  relationship  of  objects. 

We  wiU  use  the  image  shown  in  figure  12  to  il¬ 
lustrate  several  “recovered”  prepositions.  Each 
object  has  been  numbered  to  ease  its  reference. 
Internally,  objects  in  the  system  are  represented 
using  moments  through  second  order.  Currently 
the  system  accepts  a  spatial  preposition  and  dis¬ 
plays  all  those  objects  that  satisfy  the  prepo¬ 
sition  inequalities.  The  system  also  accepts  as 
input  two  objects  along  with  a  preposition  and 
it  outputs  how  well  those  two  objects  meet  the 
given  preposition  (the  value  of  for  given  a). 
All  intuitively  obvious  relations  between  objects 
are  discovered  by  the  system,  e.g.  objects  1  and 
3  are  next  to  each  other,  4  is  in  5,  6  next  to  7, 
etc. 

An  interesting  case,  and  one  that  demonstrates 
the  effects  of  fuzzification  is  the  case  of  supplying 
object  2  and  object  6  along  with  the  preposition 
aligned.  Fuzzification  is  accomplished  by  blur¬ 
ring  the  object  with  a  Gaussian  distribution  with 
a  standard  deviation  of  a.  With  no  fuzzification 
the  system  finds  that  2  and  6  are  not  aligned. 
However,  if  we  allow  a  certain  amount  of  fuzzifi- 
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Figure  13:  The  dependency  of  /yo,,j„.d(2,6)  as  a 
function  of  a 


fUp 


Figure  14:  The  dependency  of  /t/„ear(2,6)  and 
as  a  function  of  a 

cation  with  say  a  =  0.03  the  value  of  fUahgned 
0.8.  This  value  indicates  that  they  may  be  suffi¬ 
ciently  aligned  to  be  regarded  as  such  (which  we 
actually  see  in  the  image!),  depending  on  how 
much  leeway  we  wish  to  allow.  The  dependency 
of  fUaitgned  ^  shown  in  figure  13.  From  this 
graph  we  see  that  the  value  of  the  membership 
function  significantly  deteriorates  for  large  val¬ 
ues  of  <T.  This  simply  means  that  the  amount 
of  induced  uncertainty  is  so  large  that  the  ob¬ 
jects  cease  to  possess  their  original  features  (such 
as  orientation  in  this  case).  This  also  indicates 
what  the  maximal  acceptable  value  for  <t  should 
be,  in  this  case,  cr  <  0.1. 

Another  interesting  case  is  that  of  supplying  ob¬ 
ject  2  and  object  6  along  with  the  preposition 
near  or  far.  Neither  satisfies  the  inequalities  pre¬ 
cisely.  However,  if  we  again,  allow  for  fuzzifica¬ 
tion,  we  get  a  most  interesting  result,  as  shown 
in  figure  14.  We  observe  that  although  we  can 


not  say  for  certain  that  object  2  and  object  6 
are  either  near  or  far,  we  can  say  that  they  are 
somewhat  near  or  somewhat  far.  How  we  decide 
which  of  the  two  to  use  can  be  seen  in  figure  14. 
If  we  examine  the  slopes  of  the  two  curves  we 
see  that  for  small  values  of  tr  the  slope  for  far 
is  steeper  than  that  for  near.  Therefore  it  would 
seem  more  appropriate  to  say  that  2  is  somewhat 
far  from  6  as  opposed  to  2  is  somewhat  near  to 
6. 

5  Real-Time  Vision 

While  not  really  a  “vision  topic”  in  itself,  doing 
vision  in  real-time  often  means  crafting  clever 
algorithms  to  fit  the  hardware  and  timing  con¬ 
straints.  The  topics  in  this  section  are  all  related 
to  some  research  topic  above,  but  are  separated 
out  because  of  their  real-time  nature. 

5.1  Real-Time  Polarization 
Computation 

A  problem  in  the  use  of  polarization  for  the  UGV 
project  is  that  of  data  acquisition:  how  to  get 
the  polarization  information  on  a  moving  vehicle. 
Before  we  can  develop  significant  real-time  po¬ 
larization  computation  algorithms,  we  need  the 
data.  Previous  algorithms  at  Columbia  used  a 
rotating  filter  in  front  of  a  single  camera.  While 
sufficient  for  indoor  inspection-type  tasks,  it  is 
not  acceptable  for  vehicles.  In  the  past  year  we 
have  had  two  projects  addressing  the  real-time 
polarization  computation  problem:  one  based  on 
beam-splitting,  and  one  on  image  warping.  We 
have  already  implemented  and  tested  the  algo¬ 
rithms  for  polarization  computation;  it  is  just  a 
question  of  getting  multiple  images,  registered  in 
space  and  time,  which  measure  different  polar¬ 
ization  states. 

5.1.1  Polarization  Preserving 
Beam-splitter  (PPBS) 

We  have  designed  and  built  a  “polarization  pre¬ 
serving  beam  splitter,”  and  are  in  the  process  of 
calibrating  it.  (Actually,  it  does  not  totally  pre¬ 
serve  polarization,  but  the  effects  introduced  can 
be  compensated  for  via  calibration.)  We  have 
found  that  with  its  20  degree  field  of  view  and 
beam-splitting  external  to  the  camera  optics, 
precise  pixel  alignment  is  difficult  to  achieve. 
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and,  more  importantly,  virtually  impossible  to 
maintain.  We  were  also  not  able  to  achieve  it 
simultaneously  over  the  entire  image  in  differ¬ 
ent  spectral  bands.  Our  conclusion  was  that 
we  would  work  with  this  approximate  alignment, 
calibrate  the  different  distortions,  and  then  warp 
the  images  to  achieve  alignment.  This  is  sim¬ 
ilar  to  the  approach  we  used  last  year  to  deal 
with  chromatic  aberration  correction  in  regular 
lenses  [Boult  and  Wolberg,  1992a].  The  cali¬ 
bration  process  is  now  being  implemented.  The 
PPBS  is  intended  to  allow  internal  experimenta¬ 
tion  and  will  be  used  as  a  comparison  for  real¬ 
time  warping- based  polarization  computations. 
The  PPBS  is  not  suitable  for  use  on  the  UGV 
vehicle. 

5.1.2  Polarization  Computation  by 
Image- Warping 

A  second  approach  for  real-time  polarization, 
currently  under  investigation,  is  the  use  of  real¬ 
time  image  warping  to  align  images  taken  with 
separate  cameras  and  then  apply  the  compu¬ 
tations.  For  small  baselines  (1-2  inches)  and 
small  focal  length  lenses  (6-8mm)  the  disparity- 
distortions  are  rather  small.  The  real  issue  is 
how  accurately  we  can  warp  the  image  data, 
which  is  addressed  in  the  next  section. 

5.2  Tceal-Time  Image  Warping 

Our  previous  work  in  chromatic  aberration  cor¬ 
rection  ([Boult  and  Wolberg,  1992a,  Boult  and 
Wolberg,  1992b])  and  our  ongoing  work  in  real¬ 
time  polarization  algorithm  development,  have 
demonstrated  the  need  for  high-quality  image 
warping.  While  we  have  developed  a  serial  al¬ 
gorithm  for  this,  which  requires  1-2  seconds  on 
a  standard  workstation,  the  real-time  polariza¬ 
tion  work  requires  real-time  warping.  The  lens 
correction  would  also  be  more  easily  used  if  it 
were  run  in  real-time  as  part  of  the  image  ac¬ 
quisition  process.  In  the  past  year  we  have  been 
developing  and  testing  a  near  real-time,  high- 
quality  image  warping  algorithm.  The  algorithm 
was  designed  for  the  PIPE  image  processor,  and 
currently  runs  at  lOhz.  Multiple  versions  can 
be  run  in  parallel  to  allow  30hz  processing  with 
a  1/10  second  pipeline  delay.  It  uses  separable 
image  reconstruction  filters  (up  to  9x9)  and  pre¬ 
computed  distortion  tables.  We  are  currently 
testing  it  using  the  image  reconstruction  filters 


developed  under  our  previous  DARPA  contract, 
see  [Boult  and  Wolberg,  1993].  In  addition,  we 
are  developing  the  calibration  processes  to  de¬ 
termine  the  spatial  warp  needed  for  different  ap¬ 
plications.  Related  work  is  also  under  investiga¬ 
tion,  with  no  significant  results  yet,  on  flexible 
multi-modal  registration  which  uses  expUcit  sen¬ 
sor  models  and  a  flexible  matching  process. 

5.3  Real-Time  Tracking 

An  important  use  of  real-time  vision  is  for  track¬ 
ing  moving  objects.  In  this  area  we  have  two 
projects,  one  on  tracking  objects  for  grasping  by 
an  arm,  and  the  other  on  tracking  for  surveil¬ 
lance. 

5.3.1  Real-Time  Tracking  and 
Grasping 

Research  in  real-time  motion  tracking  and  grasp¬ 
ing  has  succeeded  in  laboratory  demonstrations 
of  tracking  and  grasping  moving  objects.  The 
focus  of  this  work  is  to  achieve  a  high-level  of  in¬ 
teraction  between  a  real-time  vision  system  ca¬ 
pable  of  tracking  moving  objects  in  3-D  and  a 
robot  arm  equipped  with  a  dextrous  hand  that 
can  be  used  to  pick  up  a  moving  object.  The  sys¬ 
tem  we  have  built  addresses  three  distinct  prob¬ 
lems  in  robotic  hand-eye  coordination  for  grasp¬ 
ing  moving  objects:  fast  computation  of  3-D  mo¬ 
tion  parameters  from  vision,  predictive  control 
of  a  moving  robotic  arm  to  track  a  moving  ob¬ 
ject,  and  grasp  planning.  The  system  is  able 
to  operate  at  approximately  human  arm  move¬ 
ment  rates,  and  has  been  tested  with  a  moving 
model  train  which  is  tracked,  stably  grasped,  and 
picked  up  by  the  system  (see  Figure  15).  The 
system  uses  a  real-time  motion  stereo  computa¬ 
tion  and  a  novel  probabilistic  filter  [Allen  et  ai. 
1992,  Allen  et  ai,  in  press].  We  have  also  pub¬ 
lished  our  results  on  unifying  image-flow  com¬ 
putations  in  an  estimation-theoretic  framework 
[Singh  and  Allen,  1992]. 

5.3.2  Surveillance  Mode  Tracking 

Related  ongoing  real-time  tracking  work  for 
surveillance  is  joint  work  with,  and  partially 
funded  by,  Texas  Instruments.  The  approach 
builds  upon  some  of  Columbia’s  recent  work  in 
real-time  tracking,  but  uses  the  DATACUBE 
MV20  processor  as  opposed  to  the  PIPE.  This 
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Figure  15:  Intercepting,  grasping  and  picking  up 
the  object 


change  in  real-time  hardware,  plus  the  pan¬ 
ning  action  of  the  camera,  necessitated  a  mod¬ 
erately  different  approach.  We  are  currently  ex¬ 
panding  the  approach  to  be  applied  with  a  sec¬ 
ond  generation  FLIR,  and  expect  to  incorporate 
both  sensor  models  and  “target”  motion  mod¬ 
els.  This  year’s  project  will  provide  new  informa¬ 
tion  which  should  be  useful  in  Columbia’s  own 
tracking  research,  as  well  as  input  for  our  work 
on  Sensor  Modeling.  This  technology  transfer  is 
thus  bi-directional. 


5.4  Real-Time  Detail-Preserving 
Smoothing 

This  past  year  we  also  completed  a  formal  de¬ 
scription  of  our  ongoing  work  on  G-neighbor- 
based  processing,  see  [Boult  et  ai,  1992).  G- 
neighbor  algorithms  use  a  modification  of  the 
usual  8-connected  neighborhood,  including  only 
pixels  that  differ  by  less  then  a  fixed  amount 
or  by  less  than  a  fixed  ratio.  These  signal  de¬ 
pendent  neighborhoods  are  then  used  in  tradi¬ 
tional  processing  for  detail-preserving  smooth¬ 
ing  or  signal-dependent  morphology.  In  [Boult  et 
ai,  1992]  we  present  a  qualitative  analysis  of  the 
usefulness  of  G-reighbor-based  detail-preserving 
smoothing,  comparing  it  to  four  previously  exist¬ 
ing  algorithms  for  detail-preserving  smoothing. 
The  G-neighbor  approach  is  significantly  cheaper 
and  easily  parallelized.  The  results  of  the  qual¬ 
itative  comparison  were  surprisingly  good:  the 
new  algorithm  was  as  good  or  better  than  the 
comparison  algorithms.  The  quantitative  analy¬ 
sis  of  these  algorithms  is  stiU  ongoing,  and  should 
be  completed  in  the  next  year. 

The  algorithm  has  been  implemented  both 
on  a  workstation,  under  KBVision,  and  on 
our  PIPE.  The  real-time  system  computes  G- 
neighborhoods  in  a  single  frame-time  and  per¬ 
forms  1  iteration  of  smoothing  per  frame.  For 
comparison,  an  example  of  the  algorithm  and  a 
few  comparison  algorithms  can  be  found  in  fig¬ 
ure  16. 
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Abstract 

This  papa*  summarizes  the  USC  Image  Undostanding 
research  projects  and  provides  references  to  more  de¬ 
tailed  sources  of  information.  Our  woric  has  focused  on 
the  topics  of  3-D  vision  (including  range  data  process¬ 
ing,  sto'eo,  shape  from  contour  and  object  recognition), 
aerial  image  analysis,  motion  analysis  (including  3-D 
motion  and  structure  estimation,  visual  guidance  for 
mobile  robots,  and  an  integrated  motion  system),  and 
parallel  processing  (including  mapping  algorithms  onto 
specific  or  flexible  architertures,  and  processor-time 
trade-offs). 

1  INTRODUCTION 

This  paper  summarizes  our  research  projects  over  the 
last  year.  Much  of  this  work  is  described  in  detail  in  the 
othej-  papers  in  this  proceedings,  with  this  ovmiew  giv¬ 
ing  only  a  brief  description  of  the  detailed  efforts.  Work 
that  is  less  complete  will  be  described  in  somewhat 
more  detail  in  this  overview  since  there  is  no  corre¬ 
sponding  papa*  in  the  proceedings. 

Our  research  covers  a  broad  range  of  sq)arate  tasks 
in  image  understanding,  but  the  diffo^ent  tasks  are  high¬ 
ly  into'-related  and  share  many  common  techniques. 
The  four  major  task  areas  are  three-dimensional  vision, 
motion  analysis,  parallel  processing  and  aerial  image 
analysis,  which  is  largely  supported  under  the  RADIUS 
program.  This  introduction  will  briefly  describe  the  dif¬ 
ferent  research  projects. 

Three-Dimensional  Vision 

Three-dimensional  vision  is  needed  for  many  tasks  in¬ 
cluding  those  of  manufacturing,  robotics  and  outdoor 
object  recognition.  To  achieve  these  goals,  we  must  de¬ 
velop  a  rqjresentation  formalism  that  is  rich  enough  to 

*  This  research  was  supported  in  part  by  the  Advanced  Research 
Projects  AgeiKy  of  the  Department  of  Defense  and  was  monitored 
by  the  Air  Force  Office  of  Scientific  Research  under  Contract  No. 
F49620-90-C-0078.  Some  woilc  was  supported  under  the  RADIUS 
program  under  a  sub-contract  from  HughM  Aircraft  Co.  The  United 
States  Government  is  authorized  to  rqrroduoe  and  distribute  reprints 
fot  governmental  purposes  notwithstanding  any  copyright  notation 
h«r^. 


describe  complex  objects  in  terms  of  both  volumes  and 
surfaces  and  which  allows  us  to  conv  ,  between  them. 
We  must  also  develc^  techniques  to  compute  these  de¬ 
scriptions  from  real  data  which  includes  shadows  and 
noise  and  we  must  recognize  such  objects  from  a  large 
database  of  objects.  Three-dimensional  vision  is  the 
largest  research  area  funded  undo^  this  contract.  Within 
that  area,  we  have  a  variety  of  projects: 

•  Description  of  3-D  objects:  We  are  studying  the 
problem  of  generating  3-D  surface  descriptions  from 
range  data  using  the  concept  of  deformable  models. 
This  work  is  described  in  [Liao  &  Medioni  1993].  A 
second  project  using  range  data  is  exploring  the  inte¬ 
gration  of  surface  descriptions  from  diff^ent  view¬ 
points  and  is  discussed  in  detail  in  [Chen  &  Medioni 
1993].  A  third  effort  addresses  tire  problem  of  recov¬ 
ering  segmented,  hierarchical  volumetric  descriptions 
from  range  data. 

•  Perceptual  Grouping:  Most  high  level  vision  algo¬ 
rithms  require  perfect  data  as  input,  but  it  is  impossi¬ 
ble  to  generate  such  features  with  low  level 
algorithms  such  as  edge  detectors.  We  are  working  or. 
bridging  this  gap  by  transforming  an  edge  image  into 
a  saliency  map.  This  approach  uses  a  non-itoative 
method  based  on  a  field  associated  with  each  edge. 
This  field  encodes  the  notions  of  siir  licity,  curvature 
constancy  and  co-curvilinearity.  A  ul^  rqiort  on 
this  effort  is  given  in  [Guy  &  Me  Jioiu  1993]. 

•  3-D  Shape  from  Monocular  Images:  In  this  project, 
we  are  developing  techniques  for  inforing  3-d  shape 
descriptions  given  only  a  single  image  of  the  scene. 
We  have  developed  techniques  to  use  our  earlier  the¬ 
oretical  results  on  real  images  where  contours  are 
likely  to  be  fragmented  and  distracting  contours  such 
as  markings  and  shadows  are  present  We  repmt  on 
these  efforts  in  [Zoroug  &  Nevada  1993a  and  b]. 

Motion  Analysis 

We  have  a  number  of  projects  in  the  area  of  motion  anal¬ 
ysis,  with  autonomous  navigation  providing  the  context 
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for  most  of  the  work,  though  these  techniques  have  a 
much  broader  utility. 

•  Integrated  system  for  Motion:  We  have  developed  an 
integrated  system  that  includes  hierarchical  feature 
extraction  and  matching  and  feedback  of  3-D  motion 
estimation  to  the  feature  matching  process.  The  sys¬ 
tem  is  able  to  tolerate  errors  and  differences  in  the  fea¬ 
ture  extraction  and  matching  process  by  removing 
these  inconsistent  feature  points  from  the  later  analy¬ 
sis.  This  is  described  in  [l6m  &  Price  1993]. 

•  Mobile  Platform  :  We  use  the  domain  of  autonomous 
navigation  to  unify  our  motion  work.  To  this  end  we 
have  a  small  projea  in  vision  based  navigation  with  a 
trinocular  stereo  system  for  reliable  3-D  descriptions 
of  the  environment.  The  recent  results  for  this  effort 
are  briefly  given  later  in  Section  3.  More  details  can 
be  found  in  [Kim  &  Nevatia  1993]. 

Aerial  Image  Analysis 

Our  work  in  aerial  image  analysis  consists  of  two  major 
components.  First  is  the  transfer  of  technology  funded 
by  DARPA  to  the  RADIUS  program.  We  also  continue 
on  our  long  range  effort  of  analyzing  complex  cultural 
domains.  Our  recent  work  in  extraction  of  buildings  is 
given  in  [Huertas  et  al.  1993]  and  [Chung  &  Nevatia 
1992b].  In  the  domain  of  large  commercial  airports  we 
have  shown  good  results  on  the  detection  of  runways 
and  taxi  ways  [Huertas  et  al.  1990].  Recently  we  have 
ported  this  system  from  the  older  Symbolics  version  to 
the  Sun  platform  and  assisted  another  DARPA  funded 
group  at  use  in  the  reimplementation  of  these  algo- 
ritluns  on  a  Prolog  machine.  This  airport  analysis  woric 
also  forms  the  basis  for  a  new  effort  in  using  the  Loom 
knowledge  representation  system  for  vision  tasks. 

Knowledge  Based  Vision 

We  have  begun  a  project  in  using  a  standard  knowledge 
representation  system  (the  Loom  system  developed  at 
USC-ISl).  This  system  will  be  applied  to  our  earlier  air¬ 
port  analysis  system.  This  work  is  briefly  described  in 
Section  5. 

Parallel  Processing 

We  are  investigating  parallel  implementations  of  vari¬ 
ous  vision  algorithms  developed  in  our  group  and  else¬ 
where.  We  have  studied  algorithms  for  stereo  and  image 
matching,  grtqjh  algorithms.  This  involves  implement¬ 
ing  such  algorithms  on  existing  architectures.  The  re¬ 
cent  work  on  implementing  scalable  data  parallel 
geometric  hashing  form  matching  is  discussed  in  more 
detail  in  [Khokhar  &  Prasanna  1993].  In  previous  work, 
we  have  also  explored  the  advantages  of  using  flexible 
architectures  [Reinhart  1991]. 


2  THREE-DIMENSIONAL  VISION 

2.1  Description  of  S-D  Objects 

2.1.1  Integration  from  Multiple  Views 

We  have  developed  systems  for  building  models  from 
unregistered  multiple  range  images  [Parvin  &  Medio- 
nil992,  Chen  &  Medioni  1992].  The  latter  system  inte¬ 
grates  views  at  the  triangulated  surface  level  rather  than 
at  the  pixel  level.  A  triangulated  surface  model  can  rep¬ 
resent  a  variety  of  solid  objects,  and  theoretically  to  any 
kind  of  resolution.  They  are  not  ideal  representations  for 
high  level  vision  tasks,  such  as  recognition,  because, 
first,  the  representation  is  still  low  level,  second,  it  is 
sensitive  to  many  parametas,  and  therefore  unstable. 
However,  we  think  it  is  a  good  intermediate  representa¬ 
tion  fa*  integration  and  for  building  high  level  descrip¬ 
tion  through  surface  inteq)olation  from  triangulation. 
Figure  1  shows  the  multi-view  integration  result  for  a 
complex  object. 


(a)  Original  intensity  image 


(b)  Rendoed  result  (c)  Rendered  result 

Figure  I  Three  dimensional  reconstruction  from 
several  views.  Two  views  of  the  resulting 
structure  for  the  object  "tooth." 

2.1.2  Deformable  Models 

A  second  project  in  range  analysis  involves  the  use  of 
deformable  surfaces  to  generate  a  3-D  approximation  of 
range  data.  This  work,  performed  by  C.  Liao  and  G.  Me¬ 
dioni,  builds  on  our  earlier  work  in  “B-Snakes.”  The 
user  provides  an  initial  simple  surface,  such  as  a  cube, 
which  is  subject  to  internal  forces  ((Ascribing  implicit 
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continuity  properties  such  as  tension  and  bending)  and 
external  forces  which  attract  it  toward  the  data  points. 
The  problem  is  cast  in  terms  of  en^gy  minimization. 
We  solve  this  non-convex  optimization  problem  by  us¬ 
ing  the  well  known  Powell  algorithm  which  guarantees 
convergence  to  a  (possibly  local)  extremum  and  does 
not  require  gradient  information.  The  variables  are  the 
positions  of  the  control  points.  The  number  of  control 
points  is  adaptively  controlled.  This  methodology  leads 
to  a  reasonable  complexity  and  good  numerical  stabili¬ 
ty.  We  also  provide  a  novel  solution  to  the  problem  of 
subdividing  a  patch  when  the  fit  is  bad.  We  show  results 
on  real  range  images  to  iUustrate  the  applicability  of  our 
approach.  The  advantages  of  this  approach  are  that  it 
provides  a  compact  representation  of  the  approximated 
data,  and  lends  itself  to  iq)plications  such  as  non-rigid 
motion  tracking  and  object  recognition.  Currently,  our 
algorithm  gives  only  a  cP  continuous  analytical  descrip¬ 
tion  of  the  data,  but  due  to  the  flexibility  of  our  adtq)tive 

approach  it  should  be  upgraded  to  or  easily. 
Figure  2  shows  an  example  of  the  extraction  of  the  sur¬ 
face  description  from  the  range  data. 


(a)  (b) 


Figure  2  (a)  shows  one  view  of  a  head  with  45524 
sampled  points,  and  the  initial  cube  with  8 
control  points  and  12  triangles,  (b)  and  (c)  are 
the  fitting  surface  with  259  control  points  and 
514  triangles,  (d)  shows  the  shaded  surfaces 
after  the  third  subdivision 


2.1.3  Segmented  Volumetric  3-D  Descriptions 
We  address  the  problem  of  recovering  segmented  hier¬ 
archical  volumetric  descriptions  of  three  dimensional 
shapes.  In  an  earlier  work  [Rom  &  Medioni  1992]  [Rom 
&  Medioni  1993],  we  have  suggested  a  method  (using 
SLS)  for  obtaining  hierarchical  axial  descriptions  of 
planar  shapes,  togethe  with  a  decomposition  of  the 
shiq)es  into  their  parts.  Unfortunately,  it  is  not  straight¬ 
forward  to  exteid  these  methods  to  handle  three  dimen¬ 
sional  shapes.  This  is  because  in  the  three  dimensional 
space  the  SAT  and  SLS  axes  are.  in  general,  not  curves, 
but  surfaces,  leading  to  unnatural  descriptions  [Nack- 
man  1985]. 

In  this  current  work,  poformed  by  H.  Rom  and  G. 
Medioni,  we  restrict  ourselves  to  thr^  types  of  parts: 
Convex  blobs  (or  Ovoids,  borrowing  the  tominology 
from  Koenderink  [Koenderink  1990]),  Straight  Homo¬ 
geneous  Generalized  Cylindo^  (SHGCs  [Shafer 
1983]),  and  Planar  Right  Constant  GCs  (PRCGCs  [Ul- 
upinar  1991],  planar  axis  and  constant  cross  section). 
These  components  exhaust  many  of  the  man-made  ob¬ 
jects  encountered  on  a  normal  basis.  We  suggest  the  use 
of  properties  of  the  parabolic  curves  (zero  crossings  of 
the  Gaussian  curvature)  for  recovering  the  cross  sec¬ 
tions  and  axes  of  the  different  parts.  We  advocate  the 
use  of  the  parabolic  curves  ova*  the  often  used  occlud¬ 
ing  contours,  which  are  unstable  in  range  data.  We  will 

assume  that  the  sh^)es  are  (P  continuous  (i.e.  the  curva¬ 
ture  is  defined  everywhere).  We  do  not  want  to  assume, 
as  seva'al  authors  do.  that  the  parts  are  cut  along  a  cross 
section  or  that  a  cross  section  is  visible.  Furthermore, 
we  will  not  assume  the  existence  of  any  discontinuity 
edges  between  parts.  We  believe  that  the  case  of  parts 
joined  discontinuously  is  the  limiting  case  of  the  mo-e 
general  continuous  case  which  we  address. 

Given  the  3-D  surface  data,  either  from  a  CAD 
model,  or  from  regist^ed  range  images  [Chen  &  Medi¬ 
oni  1992],  or  from  a  single  range  image,  we  first  recover 
the  parabolic  curves  on  the  surface.  This  requires  the 
evaluation  of  the  sign  of  the  Gaussian  curvature  of  the 
surface  patches.  It  has  been  shown  that  this  process  is 
stable  and  reliable  [Besl  &  Jain  1986]  [Ponce  &  Brady 
1987]  [Fan  et  al.  1989].  The  parabolic  curves  could  be 
either  on  the  surface  of  the  individual  parts,  or  on  the 
border  of  the  “glue”  between  parts.  Note,  that  due  to  the 
transversality  principle  [Guillemin  &  Pollack  1974], 
there  is  almost  always  an  anticlastic  (negative  Gaussian 
curvature)  region  between  convex  parts  when  they  are 
joined.  The  parabolic  curves  on  the  parts  we  consider 
could  be  either  moidians  or  cross  sections  of  the  SHGC 
and  PRCGC  parts  (this  has  been  shown  for  SHGCs 
[Ponce  et  al.  1989]  and  we  have  proven  it  for  PRCGCs). 
Using  simple  tests  we  can  hypothesize  (or  in  many  cas¬ 
es  determine)  the  role  of  each  parabolic  curve.  We  can 
therefore  segment  the  object  into  parts,  and  based  on  the 
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(a)  Original  edges 


(b)  Saliency  map 


Figure  3  Steps  in  extracting  the  most  salient 
features  from  an  edge  image. 


containing  Straight,  Homogeneous  Generalized  Cylin¬ 
ders  (SHGCs).  Tlie  image  may  contain  multiple,  oc¬ 
cluding  objects  and  the  objects  may  have  surface  mailc- 
ings.  In  working  with  real  images,  we  must  deal  with 
problems  of  fragmented  boundaries  and  many  addition¬ 
al  boundaries  due  to  markings,  shadows,  highlights  and 
noise.  We  use  the  expected  propoties  of  the  desired 
contours  to  sq)arate  the  two  sets  of  properties  and  to 
complete  the  broken  boundaries.  Figure  shows  results 
on  one  example.  Figure  4(a)  shows  the  image,  (b)  the 
edges  detected  in  it.  Figure  ^c)  shows  the  three  reccm- 
stmcted  SHGCs  from  two  viewpoints  each.  FurthCT  de¬ 
tails  of  this  process  are  given  in  anotho-  paper  in  these 
proceedings  [Zerroug  &  Nevada  1993a]. 

In  continuation  of  this  work,  we  are  also  studying 
the  class  of  curved  generalized  cylinders  with  circular 
but  changing  cross-sections.  In  this  case,  we  are  unable 
to  find  invariants  for  the  visible  boundaries.  However, 
we  are  able  to  find  good  quasi-invariants  that  show  that 
commonly  used  ribbon  descriptions  are  in  fact,  stable 
for  such  objects.  Our  future  work  will  focus  on  com¬ 
pound  objects  that  combine  a  number  of  primitives  that 
we  have  analyzed  in  the  past. 


propoiies  of  the  specific  parts,  we  can  recover  the  axis 
of  the  parts  from  the  meridians  and  cross  sections. 

One  problem  which  remains  is  that  some  parts  can¬ 
not  be  found  until  some  other  parts  are  removed.  As  in 
[Rom  &  Medioni  1992]  and  [Rom  &  Medioni  1993],  we 
take  an  hio'archical  strategy,  in  which,  at  each  step,  well 
defined  parts  are  described  and  removed.  Once  these 
parts  are  removed,  the  next  level  parts  can  now  be  de¬ 
scribed.  This  process  is  efficient  arid  produces  a  decom¬ 
position  of  the  shape  into  its  intuitive  parts  with  a  stable 
axial  description  of  these  parts. 

2.2  Perceptual  Grouping 

We  have  started  to  develop  a  system  for  percq)tual 
grouping  based  on  the  properties  of  coUinearity  and  co- 
curvilinearity  of  partial  contours  [Guy  &  Medioni 
1993].  The  implementation  is  obtained  by  initiating  an 
oriented  vector  field  at  each  site  detected  by  a  low  level 
detector.  Structures  which  verify  our  constraints  rein¬ 
force  each  other,  and  dominate  other  arrangements. 
Figure  3  shows  a  typical  input,  and  the  salient  structures 
detected 

2.3  Shape  Analysis  from  Monocular  Images. 

We  have  continued  our  effort  in  understanding  how 

to  infer  shape  from  monocular  images  using  contours. 
First  we  developed  a  theory  of  invariances  of  projected 
contours  and  how  they  can  be  used  to  infer  3-D  stu4)es 
of  a  certain  classes  of  surfaces  [Ulupinar  &  Nevada 
1990,  Ulupinar  &  Nevada  1992,  Ulupinar  1991].  In  re¬ 
cent  w(m1c,  we  have  developed  a  system  for  generadng 
volumetric  3-D  shape  descripdons  from  real  imt^es 


(c)  Different  views  of  each  extracted  object 


Figure  4  Generation  of  3-D  SHGC  descriptions 
from  edges  extracted  from  an  image. 
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3  MOTION  ANALYSIS 
We  have  advocated  the  use  of  multi-frame  analysis  for 
several  years.  In  previous  work,  we  developed  a  motion 
and  structure  estimation  system  based  on  so  called  chro- 
nogeneous  motion,  which  includes  uniform  acceleration 
and  constant  angular  velocity  rotation  and  translation  as 
special  cases  [Franzen  1991,  Franzen  19921. 

3.1  Integrated  Motion  System 
In  recent  work  using  the  motion  estimation  system,  we 
have  been  building  an  integrated  system  that  includes 
hierarchical  feature  extraction  and  matching  and  feed¬ 
back  of  the  Franzen  3-D  motion  estimation  results  to  the 
feature  matching  process  [Kim  &  Price  1993].  This  en¬ 
ables  the  system  to  tolo'ate  errors  and  differences  in  the 
feature  extraction  and  matching  processes  by  removing 
these  inconsistent  feature  points  from  the  later  analysis. 
Figure  5  shows  the  first  and  last  frames  of  an  indoor  im¬ 
age  sequence  (acquired  from  the  University  of  Massa¬ 
chusetts)  along  with  the  reconstructed  image-plane 
trajectories  based  on  the  3-D  analysis  and  the  recon¬ 
structed  top-down  view  of  the  scene. 


(a)  First  image  (b)  Last  image 


(c)  Image  plane  view  (d)  Overhead  view 

Figure  5  Images  and  reconstructed  results  for  the 
cone  sequence. 


3.2  Mobile  Platform 

We  have  continued  with  our  robot  project  using  trinoc- 
ular  imagery  fa-  guidance.  We  are  investigating  rotxn 
navigation  for  situations  where  only  generic  maps  are 
available,  with  one  of  the  tasks  being  the  generation  of 
more  complete  maps.  The  visual  navigation  uses  three 
views  to  improve  tite  performance  of  the  stereo  system, 
both  in  speed  and  accuracy  of  the  matching.  Ratha  than 
producing  a  complete  depth  map  we  are  concerned  only 
with  producing  a  '‘squee^  3-D  map”  that  shows  corri- 


(b)  Depth  map  (c)  Planned  Path 

Figure  6  Trinocular  vision  results  for  mobile 
robot. 

dor  walls  and  obstacles  in  the  hallway.  Figure  6  shows 
the  input  three  images  for  a  hallway  scene  along  with 
the  renting  3-D  map  and  the  planned  path. 

4  AERIAL  IMAGE  ANALYSIS 
Most  of  our  effort  in  this  area  is  in  support  of  the  RADI¬ 
US  program  and  is  largely  funded  under  a  different  con¬ 
tract  though  much  of  the  basic  technology  has  derived 
from  our  basic  lU  research  effort  We  have  focused  cm 
the  problem  of  3-D  object  detection  and  description, 
particularly  the  buildings.  Various  RADIUS  experi¬ 
ments  with  image  analysts  clearly  illustrate  the  central 
role  of  buildings  in  a  site. 

4.1  Stereo  Calibration 

In  wmk  by  M.  Bejanin  and  G.  Medioni,  we  have  consid¬ 
ered  the  problon  of  finding  the  geometric  transforma¬ 
tion  between  two  perspective  views  of  the  same  scene 
and  the  determination  of  epipolar  lines,  given  a  set  of 
matched  control  points.  Many  methods  exist  and  divide 
in  two  groups:  linear  methods  and  non-linear  methods 
[Horn  1991].  Linear  methods  work  with  matrix  opera¬ 
tions,  using  the  essential  matrix  that  was  introduced  by 
Umguet-Higgins  in  1981  [Hartley  1992]  [Longuet-Hig- 
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gins  1981],  We  have  evaluated  the  performance  of  both 
types  of  methods  using  aerial  images.  We  have  also  pro¬ 
duced  extensive  comparisons  between  the  two  types  of 
methods  by  testing  them  on  synthetic  random  data.  The 
main  result  of  this  comparative  study  is  that  the  non-lin¬ 
ear  method  seems  to  be  more  robust  and  reliable  in  the 
presence  of  noise.  Also,  the  linear  method  has  many  de¬ 
generate  cases  where  the  transformation  cannot  be  com¬ 
puted;  among  these  is  the  “perfect”  stereo  case  who'e 
both  optical  axis  are  parallel  and  perpendicular  to  the 
baseline.  Finally,  the  result  obtained  on  aerial  scenes  is 
not  very  accurate,  as  these  scenes  are  nearly  planar. 

After  the  transformation  between  the  two  images 
has  been  computed  from  any  of  these  methods,  we  show 
that  it  is  possible  to  obtain,  from  the  initial  views,  two 
images  that  are  in  parallel  epipolar  geometry.  This  is 
done  by  applying  the  rotation  to  the  first  image  plane. 
This  is  equiv^ent  to  reprojecting  the  image,  by  keeping 
the  same  position  in  space  for  the  principal  point,  but 
simply  changing  the  direction  of  the  principal  ray.  Afto* 
having  transformed  the  first  image,  our  two  images  will 
lie  on  parallel  image  planes,  but  the  scale  will  not  nec¬ 
essarily  be  the  same  (that  is  to  say  the  baseline  is  not 
necessarily  parallel  to  both  image  planes).  Therefore  we 
still  need  to  scale  the  first  image  to  the  size  of  the  second 
image.  We  first  transform  the  image  coordinate  system 
in  both  images  such  that  the  new  jc-axis  and  the  new  x'- 
axis  are  parallel  to  the  baseline,  we  are  then  able  to  use 
the  control  points  to  scale  the  first  image:  since  we  want 
to  get  collinear  epipolar  images,  two  conjugate  points 
should  lie  on  the  same  line  (in  this  new  coordinate  sys¬ 
tem),  and  so  have  identical  y-coordinates.  The  results 
for  real  aerial  images  of  the  R  Hood  site  are  shown  in 
Figure  7  (flight  parameters  were  used  for  computing  the 
results). 

4.2  Stereo  for  Buildings 

The  problem  of  building  detection  and  description  is 
difficult  for  a  number  of  reasons.  Aerial  images  tend  to 
be  highly  complex  with  even  simple  buildings  having 
many  architectural  details,  surrounding  trees  and  vehi¬ 
cles,  nearby  and  aligned  roads  etc.  These  cause  the  low- 
level  segmentation  to  produce  highly  fragmented  re¬ 
sults.  Also,  3-D  information  is  not  explicit  in  2-D  imag¬ 
es,  but  must  be  inferred  somehow.  We  have  developed 
two  systems,  one  using  sto^eo  analysis,  the  other  using 
shadow  analysis  to  overcome  both  problems  (of  frag¬ 
mentation  and  3-D  description). 

Availability  of  stereo  makes  the  task  of  inferring  3- 
D  easiCT.  However,  we  must  still  solve  the  problem  of 
correspondence,  a  difficult  one  in  this  context,  and  infer 
surfaces  from  the  partial  sto^eo  matches.  In  our  system, 
we  use  scei^  features  for  correspondence.  Our  current 
system  uses  junction  and  line  matches,  though  ability  to 
match  higher  level  features  can  also  be  incorporated. 
Surfaces  are  infored  from  junction  and  line  matches  by 


(a)  Original  Aerial  Images 


(b)  Left  result  image  (c)  Right  result  image 


Figure  7  7Wo  original  images  of  the  Ft.  Hood 
area  (top),  and  registered  and  right 
transformed  images  in  collinear  epipolar 
geometry  (bottom). 


framing  co-planar  clusters  and  tracing  structures  among 
them.  Our  system  can  work  with  both  overhead  and 
“oblique”  views.  An  example  is  shown  in  Figures. 
Note  that  the  system  produces  a  flat  roof  (as  closed  to 
wavy  surfaces  typically  found  by  stereo  systems)  and  is 
able  to  detect  the  hde  in  the  roof  correctly.  Mrae  exam¬ 
ples  and  a  detailed  description  of  this  system  may  be 
found  in  [Chung  &  Nevada  1992b]  and  [Chung  1992]. 
4.3  Use  of  Groupings  and  Shadows 
Another  system  we  have  devel(^>ed  uses  only  a  single 
image,  but  uses  shadows  to  verify  presence  of  a  building 
and  estimate  its  height.  This  system  is  currently  de¬ 
signed  for  overhead  views;  we  are  in  the  process  of  ex¬ 
tending  it  fra  oblique  views.  Basically,  the  approach  is 
to  use  percepmal  grouping  to  hypothesize  likely  build¬ 
ing-like  structures.  We  assume  that  the  buildings  are 
rectangular  or  composition  thereof.  Thus,  the  hypothe¬ 
sis  genraation  phase  consists  of  forming  rectangular  hy¬ 
potheses  from  fragmented  line  segments.  We  select 
among  these  hypotheses  based  on  some  geometrical 
analysis  of  overlap,  containment  and  strength  of  the  ev¬ 
idence  forming  the  hypotheses.  The  selected  hypotheses 
are  then  verified  by  examining  whether  they  cast  shad¬ 
ows  in  the  appropriate  ways.  Figure  9  shows  an  exam¬ 
ple.  Note  that  this  building  contains  a  numbra  of  parallel 
structures  on  top  of  the  roof,  making  the  hypotheses  for¬ 
mation  and  selection  process  particularly  difficult 
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(a)  Original  image  (left)  (b)  Matched  lines 

and  junaions 


(c)  TWO  extracted 
surfaces 


(d)  Rendered  view  of 
extracted  model 


Figure  8  Analysis  of  buildings  using  stereo.  The 
hierarchical  matching  uses  lines  and  junctions 
to  extract  a  set  of  surfaces,  which  are  then 
shown  as  a  rendered  model. 


Nonetheless,  our  system  produces  good  results.  More 
examples  and  a  more  complete  description  is  given  in 
another  paper  in  these  proceedings  [Huertas  etal. 
1993). 

5  KNOWLEDGE-BASED  SYSTEMS 
We  have  begun  a  new  project  in  using  standard  knowl¬ 
edge  representation  technology  for  Image  Understand¬ 
ing.  Loom,  a  knowledge  representation  system 
developed  at  USC-ISI,  provides  a  high  level  program¬ 
ming  interface  for  Lisp  and  an  environment  for  knowl¬ 
edge  based  system  construction  [MacGregor  & 
Burstein  1991).  Since  it  is  built  on  Lisp,  it  can  be  easily 
incorporated  into  our  Lisp  based  environment  and 
should  be  compatible  with  the  Image  Understanding 
Environment  developments  (which  did  not  address  the 
knowledge  base  aspects  of  image  analysis).  In  past 
years  we  developed  a  system  that  uses  domain  knowl¬ 
edge  to  simplify  the  task  of  describing  a  scene  [Huertas 
et  al.  1990).  This  original  system  encoded  the  knowl¬ 
edge  about  airports  in  the  programs. 

We  chose  to  apply  Loom  to  the  airport  analysis 
problem  so  that  we  could  address  the  knowledge  repre¬ 
sentation  issues  separate  from  the  image  understanding 
issues.  This  is  possible  since  the  basic  analysis  pro¬ 
grams  already  exist  and  the  incorporation  of  Loom  into 
the  analysis  can  proceed  in  an  incremental  fashion,  first 
with  the  use  of  Loom  as  a  representation  system  for  the 


(a)  Input  image  (b)  Extracted  linear  features 


(c)  Hypotheses  formed  (d)  Extracted  building 

regions 

Figure  9  Building  detection  using  shadows  for 
verification.  Using  the  lines  and  Junctions,  the 
buildings  are  extracted  arui  verified  using 
shadows. 

generic  descriptions  of  airports  and  for  particular  in¬ 
stance  that  we  are  analyzing.  Then  we  will  begin  to  use 
the  otho-  descriptive  and  deductive  capacities  for  find¬ 
ing  and  analyzing  the  runway  markings  that  are  used  for 
the  hnal  verification  and  positioning  of  the  runways. 
Loom  provides  a  means  to  describe  the  other  objects 
that  may  be  present  in  the  scene  (taxiways,  building,  air¬ 
craft,  etc.)  and  their  relations  to  the  other  objects. 

Our  first  goal  is  to  evaluate  the  capabilities  of  Loom 
for  high  level  Image  Understanding  and  then  to  use 
Loom  for  generic  high  level  descriptions  of  complex 
objects,  and  to  use  these  descriptions  for  building  other 
analysis  systems. 

6  PARALLEL  PROCESSING 
We  have  studied  parallel  implementations  of  several 
high-level  algorithms,  such  as  relaxation  labelling  and 
graph  matching.  Our  recent  work  has  looked  at  the 
problem  of  geometric  hashing,  which  is  used  for  a  vari¬ 
ety  of  matching  problems  [Stein  &  Medioni  1992).  In 
earlier  parallel  implementations  the  number  of  proces¬ 
sors  was  independent  of  the  size  of  the  scene  but  de¬ 
pended  on  the  size  of  the  model  database.  In  this  work 
we  have  designed  new  parallel  algorithms  for  both  the 
MasPar  and  Connection  Machine  architectures  which 
improve  on  the  number  of  processors  and  improve  the 
overall  performance.  Details  of  this  work  are  given  in 
[Khokhar  &  Prasanna  1993). 
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Abstract 

Machine  Vision  Research  at  the  University  of  Roches¬ 
ter  centers  around  the  concqtt  of  active  or  behavioral 
vision.  Vision  is  considered  as  an  interactive  process 
between  a  system  and  its  environment,  and  this  rela¬ 
tionship  is  explicitly  invdeed  to  provide  constraints  to 
make  machin^vision  aiqtlications  tractable.  Work  in 
this  area  has  focussed  on  four  main  areas:  integration 
of  visual  and  motor  control,  teaming  for  robot  skill  and 
visual  model  acquisition,  t^  dependent  allocation  of 
computational  physical  resources,  and  parallel 
real-time  systems  integration  and  support  We  also 
continue  to  develop  real-time  low-kvel  primitives  for 
integration  into  larger  systems. 

1.  The  Laboratory 

The  Computer  Vision  Laboratory  at  the  University  of 
Rochester  is  set  up  to  study  active  vision  in  a  very  real 
sense  -  by  intoacting  with  the  physical  world.  Robot 
hardware  includes  a  binocular  h^  with  three  external 
degrees  of  fireedom,  an  Utah/MIT  hand,  and  two  Puma 
robot  arms  for  moving  the  hand  and  head.  Computa¬ 
tional  hardware  includes  a  large  collection  of  Maxvi- 
deo  boards  fw  performing  real-time  image  processing, 
an  eight  node  Silicon  Graphics  Multiprocessor,  an 
array  of  eight  transputer  TWS  TRAMS,  and  various 
coordinating  workstations.  Several  improvements  have 
been  made  to  the  lab  in  the  last  year.  The  head  has 
been  modified  to  provide  color  images  and  internal 
degrees  of  fipeedom,  including  focus  and  zoom.  The 
second  Puma  arm  was  acquired  to  give  us  the  capabil¬ 
ity  to  move  the  head  arid  hand  cooperatively.  The 
Puma  arms  were  converted  to  run  RCCL  instead  of 
VAL.  This  gives  us  more  flexibility,  and  provides  a 
much  needed  increase  in  speed.  We  have  <A)tained  an 
EXOS  exoskeleton  for  measuring  human  hand  move¬ 
ments  during  manipulation.  This  rqrlaces  the  Data- 
Glove  as  an  input  device  for  the  Utah  hand,  and  pro¬ 
vides  values  that  are  far  more  stable  and  repeatable. 
We  have  also  founded  a  Virtual  Reality  lab  for  investi¬ 
gating  psychophysical  aspects  of  vision.  This  lab  con¬ 
tains  eye  and  head  trackers  that  will  be  inctxporated 
into  a  helmet-mounted  stereo  video  system.  The  equip¬ 
ment  will  be  used  to  study  human  visual  strategies  dur¬ 
ing  the  performance  of  various  tasks.  Recent  research 


at  the  lab  was  described  in  a  dedicated  special  issue  of 
the  International  Journal  of  Qxnputo'  Vision  [Ndson 
1991a  and  sequel].  The  animate/behavioial  vision 
tqjpioach  is  also  de^bed  in  [Ballard  et  al.  1992,  Bal¬ 
lard  1991,  Mmaff  and  Brown  1992]. 

2.  Integration  of  Visual  and  Motor  Control. 

Previous  work  on  low-level  gaze  centred  citfaninated 
with  a  successful  int^ration  of  vergence  and  tracking 
that  permitted  die  robot  head  to  track  an  object  movirig 
in  a  cluttered  scene  binoculatly  in  real  time.  The  sys¬ 
tem  combined  foveal  regions  of  interest,  a  binoci^ 
disparity  filter,  and  predictive  tracking  to  obtain  the 
demonstrated  performance  [Brown  1991a,  Brown  and 
Coombs  1991,  Brown  et  al  1992,  Coombs  and  Brown 
1992,  Grosso  and  Ballard  1992,  Soong  and  Brown 
1992]. 

A  new  initiative  by  Nelson  addresses  hand-eye  cowdi- 
nation  using  local  linear  models  [Nelson  1993b].  The 
general  idea  is  to  relate  changes  in  object  qipearance  to 
a  set  of  one-parameter  manipulations  or  “twines”  by 
means  of  a  generalized  Jacolnan.  We  represent  the 
tqrpearance  of  an  object  by  a  vector  x  of  scidar  quanti¬ 
ties,  which  could  be  die  image  coordinates  point 
features,  segment,  curve  or  blob  parametos.  or  even 
colors.  We  also  have  a  vector  y  of  qualitative  one- 
parameter  manipulations  (»-  “twiddles”,  which  could 
involve  both  rigid  motions  and  nonrigid  deformations. 
For  a  particular  pose  of  the  object,  we  can  describe  the 
change  in  appearance  of  the  object  under  a  small  mani¬ 
pulation  by  a  motor-wual  Jacobian  formed  from  the 
partial  derivatives  of  the  feature  values  with  respea  to 
the  various  manipulations.  A  nice  property  of  this 
Jacobian  is  that  it  can  be  determined  interactively,  Le., 
teamed.  The  robot  just  picks  up  an  object,  looks  at  it. 
makes  small  manqiulations  along  its  basis  axes,  and 
observes  the  changes  in  rqipeatance. 

The  motor-visual  Jacobian  can  be  used  to  imfdement  a 
tight  version  of  a  principle  views  3-D  recognizer  by 
storing  with  each  view  the  motm’-visual  Jacobian 
corresponding  to  that  pose  and  some  set  of  basis  mani¬ 
pulations.  Given  a  mruch  hypothesis  from  some  flexi¬ 
ble  matching  process,  a  diffidence  vector  can  be  com¬ 
puted  giving  the  discrepancy  between  stored 
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representation  and  the  measured  values  of  the 
corresponding  features.  We  then  use  a  well  condi¬ 
tioned  pseudoinverse  computed  from  the  singular  value 
decomposition  of  the  Jacobian,  to  find  the  manipulation 
that  comes  closest  (e.g.  in  a  least  squares  sense)  to  |nt>- 
ducing  the  observed  view.  By  checking  the  difference 
between  this  solution  and  the  actual  appearance,  we 
can  tell  whether  what  we  see  is  consistent  with  what  is 
encoded  about  how  the  appearance  of  the  object  can 
change.  This  procedure  allows  an  accurate  match  to  be 
made  without  an  explicit  3-D  model,  and  using  only 
visually  obtained  infrxmation. 

The  same  mechanism  used  for  recognition  can  be 
reversed  to  visually  control  manipulations.  The  basic 
idea  is  to  experimentally  determine  the  visual  effect  of 
qualitative  manipulations  of  an  object  and  rqtresent 
them,  as  previously,  by  a  motor-visual  Jacobian.  If  we 
know  the  appearance  of  the  object  in  the  goal  condi¬ 
tion,  we  can  compute  a  difference  vecttx'  and  solve  for 
the  best  manipulation  vector  using  the  pseudoinverse  as 
before.  Executing  this  manipulation,  we  will  effect  the 
goal  configuration  within  the  accuracy  of  the  linear 
approximation.  The  primary  advantage  of  the  proposed 
technique  is  that  it  can  be  used  with  a  manipulator  and 
visual  system  for  which  a  good  physical  model  does 
not  exist 

3.  Learning  Applications 

Learning  has  become  increasingly  important  in  our 
research,  where  we  define  learning  as  the  process  of 
obtaining  some  of  the  information  needed  to  specify  a 
complex  machine  system  automatically,  ei^er  via 
interaction  with  the  world,  or  using  a  search  process  to 
find  internal  parameters  that  produce  some  more 
loosely  specifi^  behavior.  Along  the  theoretical  axis, 
Whitehead  finished  his  work  on  using  variations  of 
reinforcement  learning  to  acquire  execution  models  for 
joint  fovea!  and  manipulator  control  for  goal-state 
specified  tasks  in  complex  environments  [Ballard  et  al. 
1992,  Ballard  and  Whitehead  1992].  Ballard  and  deSa 
continued  work  on  self-teaching  variants  of  competi¬ 
tive  learning  [DeSa  and  Ballard  1992].  The  basic  idea 
is  to  use  the  non-flat  nature  of  joint  probability  condi¬ 
tional  density  functions  between  different  sensory 
modalities  to  drive  the  segregation  of  classes  when  the 
information  in  a  single  modality  is  not  appropriately 
biased  to  drive  a  clustering  process.  Brown  also  pur¬ 
sued  topics  in  geometric  invariance,  which  allow 
recognition  models  to  be  automatically  acquired 
[Brown  1991b,  Brown  etal.  1992].  On  the  applications 
side,  two  projects  are  ongoing.  One,  by  Pock  and  Bal¬ 
lard  attempts  to  derive  high-level  execution  models  for 
complex  manipulation  tasks  by  observing  telcoperated 
sequences.  The  other,  by  Schneider  and  Brown, 
attempts  to  learn  motor  control  strategies  for 
paramcterizable  actions  that  optimize  some  cost  func¬ 
tion.  These  are  described  briefly  below. 


3.1  Tele-assisted  Manipulation 

Teleoperation,  wherein  a  robot  mimics  the  turtions  of  a 
human,  ofrers  the  means  to  learn  action  sequences.  If 
one  records  the  robot  state  under  teleoperated  control 
then  one  can  later  rqplay  the  sequence  of  moves  to 
duplicate  the  perframance.  The  problem  with  such 
replays  is  that  they  are  open-loop;  the  controller  cannot 
respond  to  even  minor  changes  in  the  environment 
Po^  and  Ballard  are  developing  the  idea  of  tele¬ 
assistance,  a  form  of  control  that  combines  the  learning 
abilities  of  teleoperation  with  closed-loop  autonomous 
control,  to  address  the  common  latency  and  accuracy 
problems  of  teteoperation  [Pook  and  Ballard  1992a, 
1992b,  Chu  1992]  For  example,  as  an  operate  slides  a 
tool  across  a  surface,  a  closed-loop  controller  can 
maintain  steady  robot  velocity  and  contact  with  the  sur¬ 
face  even  if  the  operator  action  is  jerky. 

The  first  step  in  tele-assistance  is  to  identify  a  primitive 
actions  as  they  are  performed.  Using  teleoperation  of 
the  Utah-MIT  hand  to  perform  tasks  such  as  flipping 
eggs  with  a  spatula,  Pook  has  successfully  identified 
action  primitives  such  as  grasping,  carrying,  ixossing, 
and  sliding  in  contact,  from  characteristic  temporal  pat¬ 
terns  of  joint  forces.  This  was  done  using  learning  vec¬ 
tor  quantization  to  generate  codebook  pattons  from  a 
training  set  of  actions.  The  codebook  patterns  were 
then  used  in  conjunction  with  a  hidden  Maikov  model 
for  transitions  between  actions  to  identify  the  action 
primitives  in  the  context  of  a  larger  task. 

The  second  step  in  teleassistance  is  to  produce  the 
closed  loop  controls  that  implement  the  action  primi¬ 
tives.  Our  current  approach  is  based  on  the  use  of 
parameterized  behaviors  that  reduce  the  redundant 
degrees  of  freedom  in  the  hand  to  a  manageable  level 
via  a  task-dependent  linking  mechanism.  We  have  had 
some  success  already  in  producing  such  parameterized 
behaviors.  Pook  and  B^lard  will  continue  to  pursue 
this  step  over  the  upcoming  year. 

3.2  Robot  Skill  Learning 

Schneider  and  Brown  are  woiking  on  the  topic  of  robot 
skill  learning  [Schneider  and  Brown  1992].  Skills  are 
parameterized  open-loop  behaviors  useful  for  tasks  for 
which  the  delay  of  closed-loop  control  can  not  be 
tolerated,  or  for  which  no  intermediate  performance 
information  is  available.  Throwing  a  ball  at  a  target  is 
an  example  of  the  latter.  A  skill  is  termed  “n- 
dimensional”  when  the  task  is  parameterized  by  n 
desired  output  values. 

Robot  skill  learning  can  be  modeled  as  a  function 
approximation  problem.  The  learning  algorithm  needs 
to  find  a  mapping  firom  n  desired  output  values  to  the 
space  of  possible  robot  control  trajectories.  These  input 
control  trajectories  may  be  sequences  of  joint  positions, 
velocities,  or  torques.  Typically  the  mapping  from 
robot  control  signal  to  task  output  is  redund^t  and  the 
inverse  to  be  learned  should  be  optimized  according  to 
a  cost  metric.  Function  approximation  techniques  can 
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be  divided  into  global  and  local  methods. 

In  global  approximation  techniques,  a  variable  func¬ 
tional  form  is  specified  in  terms  of  parameters  that 
affect  the  entire  domain  of  the  moping.  The  learning 
algorithm  must  find  the  parameter  settings  that  optim¬ 
ize  a  cost  function.  Linear  models  are  easy  to  deal 
with,  but  restrictive  when  the  task  is  non-linear,  as  is 
the  case  in  most  robot  control  problems.  We  therefore 
adopted  a  modified  linear  model  in  which  learning  was 
done  in  a  linear  space,  but  a  set  of  non-linear  basis 
functions  was  used  to  convert  points  in  the  output  space 
into  robot  trajectories.  Experiments  with  a  one¬ 
dimensional  throwing  task  (the  single  task  parameter  is 
distance  to  throw)  were  done  in  simulation.  The  results 
showed  that  global  function  api^oximation  can  be  used 
for  one-dimensional  throwing.  Performance  is  highly 
dependent  on  choice  of  basis  functions  and  robot  con¬ 
trol  schemes  (joint  velocities,  torques,  or  spring-based 
control).  The  largest  drawback  is  the  number  of  robot 
executions  necessary  to  fit  the  model,  which  increased 
dramatically  with  the  number  of  task  dimensions.  This 
led  us  to  consider  local  function  approximation 
methods. 

Canonical  local  function  approximation  methods 
include  neural  networks,  nearest  neighbor  classifiers, 
table  lookup,  and  radial  basis  functions.  Schneider 
used  a  form  of  interpolated  table  lookup  modified  to 
handle  a  high  (18)  dimensional  input  space  efficiently 
to  learn  a  two-dimensional  (x  and  y  lan^ng  position  of 
the  ball)  throwing  task  in  simulation.  Experiments  were 
run  with  1-d  and  2-d  throwing  to  compare  the  brute 
force  search  of  standard  table  lookup  and  the  new  algo¬ 
rithm.  The  results  showed  a  55%  to  80%  reduction  in 
the  cost  metric  (includes  accuracy  and  control  effort) 
for  the  same  number  of  robot  executions.  The  experi¬ 
ments  also  showed  that  the  local  method  could  outper¬ 
form  the  global  method  even  when  it  had  the  benefit  of 
good  basis  functions.  An  additional  feature  of  the  new 
algorithm  is  its  ability  to  automatically  extend  iL>  range 
of  performance  as  it  acquires  the  skill.  Neither  stan¬ 
dard  table  lookup  nor  the  global  method  allows  this. 

4.  Intelligent  Resource  Allocation 

One  of  tk;  major  driving  forces  behind  research  into 
active  vision  is  the  need  for  some  sort  of  sophisticated 
control  over  the  allocation  of  resources,  both  physical 
and  computational  [Ballard  and  Brown  1992,  Brown 
1992b].  As  an  example  of  the  first  case,  eyes,  hands, 
and  other  sensors  and  manipulators  generally  can’t  be 
everywhere  at  once,  and  moreover,  may  require  consid¬ 
erable  time  or  energy  to  transfer  between  states.  As  an 
example  of  the  second,  it  is  generally  not  necessary  for 
a  vision  system  to  identify  everything  identifiable  in  a 
scene,  and  is  probably  a  waste  of  resources  to  do  so  in 
most  applications.  'This  brings  up  the  issue  of  visual 
attention,  where  resources  are  allocated  on  the  basis  of 
what  is  most  likely  to  benefit  the  task  at  hand.  Two 
current  projects  adless  the  issue  of  visual  attention  in 
vai  ious  ways. 


4.1  Resource  Allocation  Using  Bayes  Nets 

Rimey  and  Brown  have  addressed  the  inoblem  of  intel- 
ligoit  resource  allocation  using  a  Bayes  net  formalism 
[Rimey  and  Brown  1991,  1992a,  1992b,  1993].  The 
general  problem  is  to  control  a  computer  vision  system 
that  has  a  repertoire  of  actions  so  that  it  achieves  some 
goal  in  minimal  time.  In  particular  we  want  to  accom¬ 
plish  visual  tasks  efficiently  by  using  knowledge  about 
the  scene  domain  and  about  available  visual  and  non¬ 
visual  operators.  Efficiency  comes  from  processing  the 
scene  only  where  necessary,  to  the  level  of  detail 
necessary,  and  with  only  the  necessary  operators. 

TEA-1  is  a  general  purpose  selective  computer  vision 
system  that  attempts  to  accomplish  these  goals  using 
Bayes  nets  for  representation  and  a  maximum  expected 
utility  rule  to  make  decisions  about  what  action  to  take. 
It  is  both  a  prototype  system  and  an  open-ended 
research  tool  for  studying  control  of  selective  percep¬ 
tion.  The  basic  constraint  assumed  for  the  vision  sys¬ 
tem  is  a  pointable,  multiresolution  sensOT  that  cannot 
view  the  whole  scene  at  once.  The  problem  is  how  best 
to  utilize  this  sensor  to  achieve  a  particular  goal  We 
have  performed  extensive  experiments  both  in  simula¬ 
tion  and  in  the  lab. 

A  number  of  modifications  have  recently  been  made  to 
improve  the  performance  of  the  system.  First,  actions 
have  been  split  into  two  types:  visual  actions  that  pro¬ 
cess  image  data  and  camera  movement  actions.  Previ¬ 
ously  all  camera  movements  were  integrated  into  each 
visu^  action.  This  both  reduces  the  number  of  cases 
that  must  be  considered,  and  decouples  the  analysis. 
Second,  all  the  decision  algorithms  have  been  extended 
so  they  can  be  based  on  an  expected  value  sample 
information  (EVSI)  measure.  The  main  advantage  of 
using  an  EVSI  measure  is  that  values  and  costs 
throughout  the  system  are  formally  consistent,  and  in 
practice  there  are  fewer  coefficients  that  need  to  be 
adjusted.  Third,  A  new  algorithm  that  is  based  on  a 
brute-force  state-space  search  was  added  for  deciding 
which  visual  and  camera  movement  actions  to  execute. 
In  addition  a  formal  definition  of  the  T-world  domain 
and  a  software  system  for  creating  and  solving  T-world 
problems,  has  been  developed.  This  enables  analysis 
of  a  variety  of  factors  that  affect  the  performance  of  a 
selective  perception  system. 

4.2  Searching  Cluttered  Areas 

Object  search  is  one  area  in  which  animate  vision  sda- 
tegies  can  have  a  high  payoff  [Swain  et  al.  1992]. 
When  searching  for  an  object,  it  can  be  advantageous 
to  make  use  of  the  spatial  relationships  in  which  it 
commonly  participates.  Searches  that  do  this,  which 
we  call  indirect  searches,  can  be  modeled  as  two-stage 
processes  that  first  find  an  intermediate  object  that 
commonly  participates  in  a  spatial  relationship  with  the 
target  object,  and  then  look  for  the  target  in  the  res¬ 
tricted  region  specified  by  this  relationship.  Using  this 
model,  Wixson  has  determined  that  over  a  wide  range 
of  situations,  indirect  searches  improve  efficiency  by 
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factors  of  2  to  8  [Wixson  and  Ballard  1991,  Wixson 
1992]  To  exploit  such  spatial  relationships,  it  is  neces¬ 
sary  to  have  mechanisms  for  selecting  camera  posi¬ 
tions.  Since  large  objects  that  are  suitable  intermediate 
objects  are  usually  cluttered,  we  need  general-purpose, 
yet  computationally  efficient  mechanisms  for  detecting 
these  occluded  areas  and  bringing  them  into  view. 

Occluding  edges  indicate  areas  that  cannot  be  viewed 
from  the  current  viewpoint,  and  Wixson  believes  that 
sparse  information  about  occluding  edges  can  be  used 
to  construct  simple  but  efficient  search  mechanisms. 
Recently,  he  has  developed  an  algorithm  for  detecting 
occluding  edgels  in  stereo  or  motion  pairs.  The  algo¬ 
rithm  works  by  searching  for  matches,  in  the  right 
image,  for  the  regions  to  the  left  and  right  of  a  left- 
image  edgel.  The  algorithm  is  based  on  that  of  Toh  and 
Forrest  but  extends  that  work  in  several  ways.  First,  it 
adds  an  algcaithm  for  automatically  selecting  an 
appropriate  size  for  the  correlation  windows  used  to 
detect  the  occlusions.  Second,  it  identifies  a  situation 
in  which  simply  examining  the  match  values  values  is 
insufficient  to  determine  whetlier  an  edgel  is  a  surface 
marking  or  an  occlusion.  Finally,  it  adds  a  post¬ 
processing  step  that  eliminates  some  falsely  detected 
occlusions.  Work  on  procedures  for  searching  clut¬ 
tered  areas  using  this  algorithm  is  currently  underway. 

5.  Parallel  Systems  Support 

Systems  support  for  parallel  processing  applications  in 
AI  is  an  ongoing  theme  at  Rochester  [Bianchini  and 
Brown  1992,  Marsh  et  al.  1992,  Weems  et  al.  1991] 
Currently,  Robert  Wisniewski  is  working  on  systems 
support  for  real-time  parallel  intelligent  applications. 
There  is  a  growing  interest  in  designing  such  applica¬ 
tions.  Currently  there  does  not  exist  a  good  software 
platform  upon  which  to  build  these  applications.  The 
goal  is  to  design  a  general  system  that  allows  for  the 
necessary  intelligent  ’hooks’  or  information  exchange 
between  the  high  level  application  and  the  underlying 
system.  Previous  work  has  developed  systems  with 
good  information  exchange  at  the  single  application 
level,  but  general  parallel  environments  with  good 
interfaces  between  high  and  low  levels  do  not  yet  exist 

We  are  using  a  parallel  shepherding  application 
developed  on  an  eight  node  SGI  multiprocessor  to 
develop  and  lest  our  proposed  run-time  environment. 
We  believe  this  application  is  representative  of  AI 
applications  needing  to  function  in  a  real  world 
environment.  The  goal  is  to  keep  as  many  individually 
moving  objects  confined  on  a  table  top  as  possible. 
Different  planners  using  different  amounts  of  time  dev¬ 
ise  suategies  for  a  robot  arm  manipulator  using  the 
visual  input  of  an  overhead  camera. 

Our  work  has  examined  the  effectiveness  of  our  run¬ 
time  environment  in  choosing  the  appropriate  planner 
for  a  particular  dynamic  internal  state  of  the  system. 
The  run-time  environment  is  aware  of  the  application’s 
goals  by  a  set  of  shared  data  structures.  This  permits 
the  execution  level  to  maintain  environmental 


independence  while  still  allowing  information  to  be 
propapted  throughout  the  system.  The  run-time 
environment  also  keeps  track  of  the  state  of  the  under- 
iying  processors  and  system  state.  Whoi  the  applica¬ 
tion  indicates  it  wants  to  solve  a  new  goal,  the  run-time 
environment  uses  the  information  it  maintained  about 
the  system  (e.g.  about  deadlines  and  accuracy  require¬ 
ments)  and  combined  it  with  knowledge  of  the  struc¬ 
ture  of  the  £q>plication  to  choose  an  appropriate 
planner.  Our  results  indicate  that  this  strategy  works 
better  than  having  a  single  monolithic  planner. 

6.  Motion  Recognition 

Polana  and  Nelson  have  worked  on  robustly  comput¬ 
able  motion  features  that  can  be  used  directly  as  a 
means  of  recognition.  The  underlying  motivation  is  the 
observation  that,  for  objects  that  typically  move,  it  is 
frequently  easier  to  identify  them  when  they  are  mov¬ 
ing  than  when  they  are  stationary  [Nelson  1991]. 
Specifically,  the  goal  is  to  design,  implement  and  te.st  a 
general  framework  for  recognizing  both  distributed 
motion  activity  on  the  basis  of  temporal  texture  and 
complexly  moving  compact  objects  on  the  basis  of 
their  action.  This  recognition  approach  conuasts  with 
the  reconstructive  approach  that  has  typified  most  prior 
work  on  motion.  TTie  parsed  work  has  practical 
applications  in  monitoring  and  surveillance,  and  as  a 
component  of  a  sophisticated  visual  system. 

For  the  first  phase  of  the  project  image  sequences  con¬ 
taining  temporal  textures  were  analyzed.  Real  imagery 
was  used  as  the  prime  source  of  test  data.  The  normal 
flow  field  of  the  motion  between  successive  frames  was 
used  as  the  basis  for  recognizing  the  temporal  textures. 
Several  features  were  extracted  from  the  normal  flow 
field  and  techniques  analogous  to  the  statistical 
methods  of  gray-level  texture  classification  were 
applied  to  successfully  classify  scene  regions  contain¬ 
ing  non-rigid  motion  [Nelson  and  Polana  1992]. 

For  die  second  phase,  image  sequences  containing  a 
single  periodic  activity  were  analyzed  in  order  to  tag 
and  track  objects  exhibiting  periodic  movement.  Iden¬ 
tifying  such  motion  is  important  since  it  indicates  a 
situaticxi  where  a  structural  classification  technique 
would  be  more  appropriate  than  a  temporal  texture 
method.  A  Fourier  Transfrain  based  technique  was 
developed  that  successfully  distinguished  periodic 
activities  such  as  walking,  exercising,  rotating 
machinery  etc.  from  non-periodic  motion,  and  tracked 
the  region  of  periodic  activity  against  cluttered  back¬ 
grounds.  Both  stationary  activity,  and  periodic  activity 
resulting  in  translation  of  the  actor  can  be  identified 
[Polana  and  Nelson  1993].  The  technique  can  be  gen¬ 
eralized  to  arbitrarily  moving  objects  exhibiting 
periodic  activity. 

The  current  task  is  recognition  of  periodic  activities 
after  successful  detection  using  the  periodicity  detec¬ 
tion  described  above.  A  combination  of  normal  flow 
field  features  and  Fourier  domain  features  will  be  u.sed 
to  characterize  the  nature  of  periodicity  exhibited  by 
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different  activities.  These  will  be  ^plied  specifically 
to  recognize  human  walking,  animal  gaits,  and  periodi¬ 
cally  operating  machinery.  The  techniques  wiU  also  be 
examined  fen-  their  invariance  with  respect  to  spatial 
scaling  and  viewing  angle. 

7.  Line  Segment  Extraction 

Finding  lineal  features  in  an  image  is  an  important  step 
in  many  object  recognition  and  scene  analysis  pro¬ 
cedures.  Nelson  has  developed  a  new  method  of 
extracting  lineal  features  from  an  image  using  extended 
local  information  to  provide  robustness  and  sensitivity 
[Nelson  1993a].  The  method  utilizes  both  gradient 
magnitude  and  direction  information,  and  incorporates 
explicit  lineal  and  end-stop  terms.  These  terms  are 
combined  non-linearly  to  produce  an  energy  landsctq)e 
in  which  local  minima  ctnrespond  to  lineal  features 
called  sticks  that  can  be  reixesented  as  line  segments. 
A  gradient  descent  (stick-growing)  process  is  used  to 
find  these  minima. 

More  specifically,  suppose  there  exists  a  matching  cri¬ 
terion  by  which  any  line  segment  can  be  compared 
against  an  image  and  given  a  score.  Then  conceptu¬ 
ally,  a  good  set  of  line  segments  could  be  found  by 
finding,  from  all  possible  segments,  the  one  that  pro¬ 
duces  the  best  score,  nullifying  the  effect  of  the  image 
components  contributing  to  it,  and  repeating,  until 
enough  segments  had  been  found.  The  main  practical 
problems  with  this  method  are  making  it  efficient,  since 
it  is  clearly  impractical  to  look  through  all  possible  seg¬ 
ments  mulfipte  times,  and  designing  an  jqipropriate 
matching  measure.  The  new  matching  measure  utilizes 
a  non-linear  combination  of  separate  convolutions  with 
line-like  and  end-stop  templates  that  provides  for 
growth  along  a  lineal  feature,  but  stops  the  growth 
when  strong  evidence  of  a  termination  is  encountered. 
This  cannot  be  achieved  with  a  single  convolution 
measure. 

When  compared  against  two  other  methods  of  line  seg¬ 
ment  detection,  one  based  on  edgel  linking,  and  the 
other  on  support  regions  of  similar  gradient  direction, 
the  stick  growing  method  exhibits  improved  gap  cross¬ 
ing  abilities,  and  is  better  able  to  extract  long,  poorly 
defined  features,  especially  in  cluttered  images.  The 
method  gives  sufficiently  good  results  in  images  of 
objects  having  strong  linear  features  to  permit  the  inter- 
m^iate  level  representation  obtained  to  be  used  for 
recognition. 

References 

Ballard,  D.H.,“RISC  models  of  visual  behaviors,” 
invited  paper,  Proc.,  3rd  Ini'l.  Forum  on  the  Frontier 
of  Telecommunications  Technology,  Tokyo,  November 
1991. 

Ballard,  D.H.  and  CM.  Brown,* 'Principles  of  animate 
vision,”  Computer  Vision,  Graphics,  and  Image  Pro¬ 
cessing  56-IU  (Special  Issue  on  Active  Vision).  1,  3- 


21,  July  1992;  to  ^>pear,  Y.  Aloimonos  (Ed.).  Active 
Vision. 

Ballard,  D.H.,  C.M.  Brown,  and  R.C.  Nelson,  “Image 
understanding  research  at  Rochester,”  DARPA  Image 
Understanding  Workshop,  109-116,  San  Diego,  CA, 
January  1992. 

Ballard,  D.H.,  M.M.  Hayhoe,  F.  Li,  and  S.D.  White- 
head,  “Hand-eye  coordination  during  sequential 
tasks,”  Proc.,  Phil.  Trans.  Royal  Society  of  London  B, 
London,  March  1992. 

Ballard,  D.H.  and  S.D.  Whitehead,  “Learning  visual 
behaviors,”  in  H.  Wechsler  (Ed.).  Neural  Networks  for 
Machine  Perception,  Vol.  2.  Boston,  MA:  Academic 
Press,  1992. 

Bianchini,  R.  and  C.M.  Brown  “Parallel  genetic  algo¬ 
rithms  on  distributed-memory  architectures,”  TR  436, 
Compute-  Science  Dept,  U.  Rochester,  August  1992. 

Brown,  C.M.,“An  empirical  investigation  of  differen¬ 
tial  invariants,”  in  J.L.  Mundy  and  A.W.  Zisserman 
(Eds.).  Computational  Invariants  for  Vision.  Cam¬ 
bridge,  MA:  MIT  Press,  215-227, 1992a. 

Brown,  C.M.,  “Gaze  behaviors  for  robotics,”  invited 
paper,  in  A.  Sood  (Ed.),  Active  Perception  and  Robot 
Vision  (Proc.,  NATO-ASI  Symp.  on  Active  Perception 
and  Robot  Vision,  Mariea,  Italy,  July  1989). 
Springer- Veilag,  August  1991a. 

Brown,  C.M.  “Issues  in  selective  perception,”  Proc., 
11th  lAPR  Int’l.  Conf.  on  Pattern  Recognition,  21-30, 
The  Hague,  IEEE  Computer  Society  Press,  September 
1992b. 

Brown,  C.M.,  “Numerical  evaluation  of  differential 
and  semi-diHerential  invariants,”  TR  393,  Computer 
Science  Dept.,  U.  Rochester,  August  1991b. 

Brown,  C.M.  and  DJ.  C(X)mbs,“Notes  on  control  with 
delay,”  TR  387,  Compute  Science  Dept.,  U.  Roches¬ 
ter,  August  1991. 

Bro-wn,  C.M.,  DJ.  Coombs,  and  J.  Soong,  “Real-time 
smooth  pursuit  tracking,”  in  A.  Blake  and  A.  Yuille 
(Eds.)  Active  Vision.  Cambridge,  MA:  MIT  Press, 
123-136, 1992. 

Brown,  C.M.,  H.  Durrant-Whyte,  J.  Leonard,  B.  Rao, 
and  B.  Steer,  “Distributed  data  fusion  using  Kalman 
filtering:  A  robotics  application,”  in  M.A.  Abidi  and 
R.C.  Gonzalez  (Eds.).  Data  Fusion  in  Robotics  and 
Machine  Intelligence.  Academic  Press,  267-310, 1992. 

Chu,  Man-Wai,  “Polhemus  coordinates  and 
Polhemus-Puma  conversion,”  TR  427,  Computer  Sci¬ 
ence  Dept,  U.  Rochester,  June  1992. 


97 


Coombs,  DJ.  and  C.M.  Brown.  “Real-time  smooth 
pursuit  tracking  for  a  moving  binocular  head,”  Proc., 
IEEE  Conf.  on  Computer  Vision  and  Pattern  Recogni¬ 
tion,  23-38,  Champaign,  IL,  June  1992. 

de  Sa,  V.R.  and  D.H.  Ballard,  “Top-down  teaching 
enables  task-relevant  classification  with  competitive 
learning,”  Proc.,  Int'l.  Joint  Cor^.  on  Neural  Networks, 
Baltimore,  June  1992. 

Grosso,  E.  and  D.H.  Ballard,  “Head-centered  orienta¬ 
tion  strategies  in  animate  vision,”  TR  442,  Computer 
Science  Dept.,  U.  Rochester,  October  1992. 

Marsh,  B.D.,  C.M.  Brown,  TJ.  LeBlanc,  MX.  Scott, 
T.G.  Becker,  P.Ch.  Das,  J.  Karlsson,  and  C.A.  Quiroz, 
“Operating  system  support  for  animate  vision,”  Jour¬ 
nal  of  Parallel  and  Distributed  Computing  152,  103- 
117,  June  1992. 

Moraff,  H.  and  C.M.  Brown,  “Vision  as  process:  A 
joint  NSF/ESPRIT  research  project,”  SPIE  Robotics 
and  Machine  Perception  Newsletter,  July  1992. 

Nelson,  R.C.,  “Introduction:  Vision  as  intelligent 
behavior:  An  introduction  to  machine  vision  at  the 
University  of  Rochester,”  Int'l.  J.  of  Computer  Vision 
7, 1  (Special  Issue),  5-9, 1991a. 

Nelson,  R.C.,  “Qualitative  detection  of  motion  by  a 
moving  observer,”  Int'l.  J.  of  Computer  Vision  7,  1 
(Special  Issue),  33-46, 1991b. 

Nelson,  R.C.  and  R.  Polana,  “Qualitative  recognition 
of  motion  using  tempoal  texture,”  Computer  Vision, 
Graphics,  and  Image  Processing  56-IU  (Special  Issue 
on  Active  Vision),  1, 78-89,  July  1992. 

Nelson,  R.C.,  “Finding  line  segments  by  stick  grow¬ 
ing,”  submitted  for  journal  publication  1^3a 

Nelson,  R.C.,  “Learning  from  Manipulation:  Merging 
Seeing  and  Doing  in  Three  Dimensions”,  TR  426 
Computer  Science  Dept.,  U.  Rochester,  1993b. 

Polana,  R.  and  R.C.  Nelson,  “Detecting  Activities,” 
Submitted  for  Publication,  1993. 

IEEE  Conf.  on  Computer  Vision  and  Pattern  Recogni¬ 
tion,  Urbana,  IL,  June  1992. 

Pook,  P.K.  and  D.H.  Ballard,  “Contextually  dependent 
control  strategies  for  manipulation,”  TR  416,  Com¬ 
puter  Science  Dept.,  U.  Rochester,  April  1992a. 

Pook,  P.K.  and  D.H.  Ballard,  “Dexterous  robotic 
gripper  performs  qualitative  manipulation,”  SPIE 
Robotics  and  Machine  Perception  Newsletter,  July 
1992b. 


Rimey,  R D.  and  C.M.  Brown.  “Controlling  eye  move¬ 
ments  with  hidden  Maikov  models,”  Int’l.  J.  of  Com¬ 
puter  Vision  7, 1  (Special  Issue),  47-65. 1991. 

Rimey.  R.D.  and  C.M.  Brown,  “Studying  control  of 
selective  percqition  using  T-World  and  TEA,”  to 
appear,  Proc.,  DARPA  Image  Understanding 
Workshop,  1993. 

Rimey,  R.D.  and  C.M.  Brown,  “Task-specific  utility  in 
a  genearal  Bayes  net  vision  system,”  IEEE  Conf.  on 
Computer  Vision  and  Pattern  Recognition,  142-147, 
Champaign.  IL,  June  1992a. 

Rimey,  R.D.  and  C.M.  Brown,  “Where  to  look  next 
using  a  Bayes  net:  Incorporating  geometric  relations,” 
Proc.,  2nd  Eur.  Corf,  on  Computer  Vision,  542-550,  S. 
Marghaita  Ligure,  Italy,  May  1992b. 

SchneidCT,  J.G.  and  C.M.  Brown,  “Robot  skill  learning 
and  the  effects  of  basis  function  choice,”  TR  437, 
Computer  Science  Dept,  U.  Rochester,  Sqitember 
1992. 

Soong,  J.  and  C.M.  Brown.  “Inverse  kinematics  and 
gaze  stabilization  for  the  Rochester  robot  head,”  TR 
394,  Computer  Science  Dept,  U.  Rochester,  August 
1991; 

Swain,  MJ.,  L.E.  Wixson,  and  D.H.  Ballard,  “Object 
identification  and  search:  Animate  vision  alternatives  to 
image  interpretation,”  in  P.  Dario,  G.  Sandini,  and  P. 
Aebischer  (Eds.).  Robots  and  Biological  Systems: 
Towards  a  New  Biordcs?  Springer  Verlag,  NATO  ASI 
Series,  1992. 

Weems,  C.,  C.M.  Brown,  J.  Webb,  T.  Poggio,  and  J. 
Kender,  “Parallel  processing  in  the  DARPA  Strategic 
Computing  Vision  program,”  IEEE  Expert  6, 5, 23-38, 
October  1991. 

Wixson,  L.E.,  “Looking  Near  One  Object  for 
Another,”  Proc.,  SPIE,  Intelligent  Robots  and  Com¬ 
puter  Vision  XI:  Algorithms,  Techniques,  and  Active 
Vision  (Boston,  MA),  Vol  1825,  D.P.  Casasent,  Ed., 
159-167,  November  1992. 

Wixson,  LX.  and  D.H.  Ballard,  “Learning  efficient 
sensing  sequences  for  object  search,”  Proc.,  AAAI  Fall 
Symp.  on  Sensory  Aspects  of  Robotic  Intelligence,  Asi- 
lomar,  CA,  November  1991. 


98 


Image  Understanding  Research  at  GE 


J.L.  Mundy* 

Box  8 

G.E.  Corporate  Research  and  Development 
Schenectady,  NY  12309 


Abstract 

Recent  progress  in  image  understanding  re¬ 
search  at  GE  is  described.  The  focus  of  GE’s 
program  in  lU  is  on  the  application  of  geometric 
constraint  models  and  geometric  invariants  to 
the  recognition  and  representation  of  objects, 
as  well  as  the  development  of  object-oriented 
software  environments  to  support  lU  research 
and  applications. 

1  Overview 

1.1  An  Emphasis  on  Geometry 

Image  understanding  research  and  applications  at  GE 
are  centered  around  geometric  descriptions  and  geomet¬ 
ric  reasoning  for  representing  and  recognizing  objects. 
We  have  developed  this  geometric  theme  over  the  past 
decade  with  emphasis  on  object  recognition  and  asso¬ 
ciated  approaches  for  representing  objects  to  facilitate 
recognition. 

1.2  Constraint- Based  Modeling 

The  conventional  approaches  to  object  recognition  have 
been  based  on  fixed  polyhedral  models  or  models  with 
fixed  relationships  between  components.  We  have  been 
developing  a  system  for  representing  broad  categories 
of  objects  by  defining  an  object  in  terms  of  geometric 
relations,  or  constraints.  Any  specific  object  of  the  class 
is  considered  to  be  a  solution  of  the  constraint  system.  In 
most  cases,  there  is  a  continuum  of  solutions  so  a  specific 
object  is  selected  which  satisfies  the  constraints  as  well 
as  optimality  criteria  derived  from  image  features. 

A  key  application  of  our  constraint-based  modeling 
system  is  to  reduce  the  manual  effort  in  the  construction 
of  RADIUS*  site  models.  Many  building  shapes  can  be 
represented  by  a  generic  set  of  constraints  and  the  spe¬ 
cific  geometry  of  a  building  can  be  recovered  by  fitting 
the  constraint  model  to  various  image  views  of  the  build¬ 
ing.  Recent  progress  on  this  application  is  described  in 
section  1.3.2.  A  dual-use  application  of  the  constraint 
modeling  system  is  automated  interpretation  of  X-ray 

*The  research  reported  here  is  funded  in  part  by  DARPA 
Contract  #MDA972-91-C-0053 

’Research  and  Development  for  Image  Understanding 
Systems,  a  joint  project  by  DARPA  and  ORD. 


images  for  the  detection  of  flaws  in  jet  engine  turbine 
blades.  This  application  is  described  elsewhere  in  these 
proceedings  [Noble  and  Mundy,  1993]. 

1.3  Geometric  Invariants 

A  significant  body  of  results  on  the  construction  of 
geometric  invariants  to  projective  and  affine  trans¬ 
formations  has  been  developed  over  the  past  three 
years  [Mundy  and  Zisserman,  1992).  An  invariant  is  any 
property  of  a  geometric  configuration  which  is  unaffected 
by  viewpoint.  In  current  model-based  vision  approaches, 
it  is  necessary  to  test  each  object  in  the  library  since  the 
specific  properties  of  the  object  can  only  be  exploited  for 
discrimination  after  model  pose  is  determined.  When 
objects  are  described  in  terms  of  invariants  the  resulting 
properties  can  be  used  to  index  large  model  libraries. 
We  have  shown  that  invariant  indexing  leads  to  object 
recognition  cost  which  grows  slowly  with  the  size  of  the 
model  library.  This  efficient  indexing  property  has  been 
exploited  in  three  application  tasks. 

1.3.1  Automatic  Target  Recognition 

A  key  problem  in  automatic  target  identification  is 
efficient  indexing  of  targets  from  image  features.  This 
indexing  must  be  done  from  a  rather  sparse  and  low- 
resolution  image  segmentation.  It  is  often  the  case  that 
the  resolution  is  too  low  to  permit  robust  indexing  and 
recognition  on  geometric  properties  alone.  We  have  re¬ 
cently  developed  a  concept  of  integrated  geometric  and 
thermal  invariants  in  conjunction  with  the  Target  Recog¬ 
nition  Technology  Branch  at  Wright  Labs(AARA)^  The 
idea  is  to  derive  a  set  of  features  which  are  invariant, 
not  only  to  the  geometric  variations  due  to  viewpoint, 
but  also  to  the  thermal  variations  due  to  environmental 
and  target  operational  conditions.  The  benefits  of  in¬ 
variant  indexing  can  be  applied  to  target  detection  and 
classification,  even  where  resolution  is  limited,  since  at¬ 
tributes  of  the  thermal  intensity  data  are  taken  into  ac¬ 
count.  The  graphs  in  Figure  1  shows  the  values  of  several 
geometric-thermal  invariants  computed  from  a  time  se¬ 
ries  of  IR  images  of  a  tank.  The  variation  is  quite  small 
compared  with  the  absolute  temperature  varation  of  in¬ 
dividual  tank  components. 

*The  key  contributors  at  AARA  are  V.  Velten,  L.  West- 
erkamp  and  M.  Gander.  Substantial  contributions  have  also 
been  made  by  D.  Forsyth  of  the  University  of  Iowa. 
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Figure  1:  Ratio’s  of  spatial-thermal  integrals,  as  a  func¬ 
tion  of  time,  for  the  carriage  region  of  a  tank.  Note  that 
four  of  the  ratio’s  are  effectively  constant,  and  that  the 
others  vary  only  slightly  over  time. 

1.3.2  Site  Model  Registration 

The  main  RADIUS  assumption  is  that  each  site  will 
be  described  by  a  geometric  site  model  which  describes 
such  features  as:  the  site  perimeter,  lines  of  communi¬ 
cation,  functional  areas  and  3D  building  structure.  This 
site  model  provides  a  context  for  lU  tools  to  assist  the 
image  analyst  in  tasks  such  cis  asset  accounting,  change 
detection  and  scientiflc  analysis.  A  central  task  for  lU  in 
the  RADIUS  program  is  the  registration  of  existing  site 
models  to  new  reconnaissance  imagery.  In  our  approach, 
we  represent  features  from  the  site  model  in  terms  of  ge¬ 
ometric  invariants  and  then  use  these  invariants  to  rec¬ 
ognize  key  features  at  the  site,  such  as  buildings,  roads 
and  shorelines.  This  approach  is  described  in  more  detail 
in  section  3.1. 

1.3.3  3D  Reconstruction  From  Uncalibrated 
Cameras 

Recently,  Richard  Hartley  of  GE  and  Olivier  Faiigeras 
of  INRIA  independently  proved  that  a  structure  in  3D 
space  can  be  reconstr\icted  up  to  wit  hin  a  3D  projective 
transformation  from  two  or  more  uncalibrated  camera 
views.  For  many  vision  tasks  such  as  object  recognition, 
it  is  not  necessary  to  determine  the  Euclidean  structure 
of  the  object  since  the  3D  object  can  be  characterized 
by  its  projective  invariants.  If  desired,  the  structure  can 
be  transformed  to  an  appropriate  Euclidean  coordinate 
frame  if  a  sufficient  number  of  constraints,  such  as  dis¬ 
tances  and  orientations  are  known  a  priori  about  the 
object.  This  observation  has  lead  to  a  new  approachs  to 
acquiring  and  registering  models  to  image  data.  Of  pri¬ 


mary  importance  is  the  ability  to  achieve  most  lU  image 
analysis  tasks  without  knowledge  of  the  camera  parame¬ 
ters  or  viewing  conditions  and  without  requiring  ground 
control  points  [Hartley,  1993b]. 

1.4  Object-Oriented  Design 

In  addition  to  our  research  and  applications  of  object 
recognition  and  modeling  techniques,  we  are  involved 
in  a  number  of  projects  for  the  application  of  object- 
oriented  design  to  the  development  of  research  and  ap¬ 
plication  environments. 

1.4.1  RCDE 

Part  of  the  DARPA  sponsored  RADIUS  project,  is  to 
develop  a  common  software  environment(RCDE)  to  fa¬ 
cilitate  the  exchange  of  results  and  to  provide  a  platform 
for  the  demonstration  of  algorithms  which  are  targeted 
at  the  intelligence  exploitation  of  aerial  images.  The 
Cartographic  Modeling  Environment(CME)  developed 
at  SRI  International  by  Lynn  Quam,  has  been  extended 
and  interfaced  to  C-|— f-  by  GE’s  Management  and  Data 
Systems  Operation  with  support  from  GE-CRD  to  be¬ 
come  the  RCDE  [Mundy  et  a/.,  1992b].  Extensive  docu¬ 
mentation  for  operation  and  programming  of  the  RCDE 
has  also  been  developed.  Currently  a  number  of  early 
versions  of  the  RCDE  are  under  evaluation  and  testing  at 
RADIUS  contractor  beta  sites.  The  RCDE  is  currently 
being  enhanced  by  SRI  and  CRD  under  a  ttisk  called 
THREAD  which  provides  a  narrow  architecture  for  car¬ 
rying  out  Model  Supported  Exploitation(MSE)  using  the 
RCDE.  The  key  lU  algorithmic  steps  in  THREAD  archi¬ 
tecture  are: 

•  Edge  segmentation  using  a  modified  Canny  edge  de¬ 
tector  and  line  segments  extracted  from  the  result¬ 
ing  pixel  chains  using  maximum  curvature. 

•  Line  segment  grouping  based  on  endpoint  proximity 
and  collinearity. 

•  Site  model  registration  using  affine  mo<lel  matching 
ba.sed  on  vertex-pairs  as  well  as  affine  and  projective 
invariants. 

•  Linear  feature  extraction  using  image  correlation. 

Other  elements  of  the  THREAD  architecture  are  related 
to  camera  modeling,  interface  to  a  database,  and  an  an¬ 
alyst  interface.  This  architecture  i)rovides  a  useful  set 
of  examples  for  using  and  integrating  new  applications 
with  the  RCDE. 

1.4.2  lUE 

A  major  DARPA  project  has  been  initiated  to  provide 
a  standard  software  environment  for  carrying  out  im¬ 
age  understanding  research  and  applications.  The  Image 
Understanding  Environment  or  lUE  has  been  specifi(Hl 
by  a  committee  of  senior  lU  re.searcher  and  is  based  on  an 
object-oriented  representation  of  the  riiajor  <lata  struc¬ 
tures  and  operations  used  in  lU  algorithms.  An  overview 
of  the  lUE  appears  elsewhere  in  this  proceedings  [Mundy 
and  Committee,  1993b].  The  goal  is  that  the  lUE  will 
become  a  standard  of  exchange  of  lU  data  and  al.so  pro¬ 
vide  common  interfaces  so  that  new  algorithms  and  other 
support  code  can  be  freely  exchanged  between  research 
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institutions.  Recently  effort  at  CRD  has  been  focussed 
on  the  development  and  prototyping  of  a  data  exchange 
file  format.  It  is  envisioned  that  such  a  standard  will  be 
of  considerable  immediate  benefit  [Mundy  and  Commit¬ 
tee,  1993a]. 

2  Constraint-Modeling 

Our  underlying  approach  to  constraint  modeling  is  the 
observation  that  all  geometric  constraints  can  be  rep¬ 
resented  in  terms  of  algebraic  equations.  A  constraint- 
model  is  thus  a  system  of  polynomial  equations  and  spe¬ 
cific  model  instances  correspond  to  root  of  the  polyno¬ 
mial  system.  The  final  solution  represents  a  best  least 
scpiares  fit  with  respect  to  both  the  constraint  equations 
and  the  projection  of  corresponding  image  features.  In 
several  earlier  papers  we  have  provided  details  of  the  rep¬ 
resentation  and  have  shown  results  on  a  number  of  solu¬ 
tion  techniques  [Nguyen  ei  al.,  1990,  Nguyen  et  at.,  1991, 
Mundy,  1992]. 

The  focus  of  this  work  over  the  last  year  has  been  the 
application  of  these  techniques  to  RADIUS  sites  using 
the  test  data  sets  developed  by  ORD.  In  addition  we  have 
extended  the  use  of  constraints  to  curved  2D  primitives 
which  has  enabled  the  representation  of  complex  turbine 
blade  geometries  for  automatic  visual  inspection  [Noble 
and  Mundy,  1993].  This  application  represents  a  nice 
instance  of  dual-use  of  government-funded  technology. 

2.1  RADIUS  Site  Models 

There  is  now  available  imagery  to  support  the  analysis 
and  site  model  construction  for  two  sites,  1)  aerial  pho¬ 
tography  of  Fort  Hood  and  2)  an  industrial  area  mod- 
elboard.  We  have  been  applied  the  constraint  model 
system  to  the  representation  of  road  networks  as  well  as 
3D  building  geometry.  A  first  example  is  illustrated  at 
the  top  of  Figure  2  which  shows  a  constraint  model  of 
an  L-shaped  building. 

The  operator  roughly  sketches  the  building  and  se¬ 
lects  a  few  correspondences  at  key  vertices  as  appropri¬ 
ate.  The  constraint  system  is  then  solved  to  maintain 
the  correspondences  while  at  the  same  time  minimizing 
the  error  with  respect  to  the  image  feature  locations. 
The  left  image  in  Figure  2  shows  the  initial  model  as 
sketched.  The  final  solution  is  shown  in  the  right  image. 
The  process  assumes  the  existence  of  a  camera  model  for 
a  single  image  of  the  scene.  In  this  example  a  camera 
model  was  calculated  from  ground  control  points  sup¬ 
plied  with  the  RADIUS  modelboard. 

A  second  3D  example  example  is  also  shown  in  Fig¬ 
ure  2.  Here  we  constructed  some  3D  constraint  primi¬ 
tives  and  iised  them  to  model  a  composite  structure  on 
the  RADIUS  model  board.  Again,  the  procedure  was 
to  roughly  position  the  model  in  the  vicinity  of  the  ac¬ 
tual  building  and  then  establish  a  few  correspondences 
between  the  vertices  of  the  constraint  primitive  and  the 
image.  The  initial  placement  of  the  model  primitives 
can  be  quite  loose.  For  example  one  of  the  primitives 
is  about  dS"  away  from  the  true  orientation.  Note  that 
the  primitives  do  not  have  to  be  very  close  to  the  final 
shape  and  size  in  order  to  reach  convergence.  The  only 


Example 

#  Const. 

Time 

Error 

Top  Ex. 

57 

8  sec 

.02  pixels 

Bottom  Ex. 

141 

66.8  sec 

.58  pixels 

Table  1:  The  performance  of  the  constraint  modeling 
system  for  the  two  examples.  The  computational  com¬ 
plexity  for  the  solver  used  in  these  experiments  is 
where  N  is  the  number  of  constraints.  The  times  shown 
are  for  a  SPARC  2. 

requirement  is  that  the  system  of  constraints  established 
for  the  model  is  consistent  with  the  final  building  shape. 

The  current  version  of  the  modeling  system  is  written 
in  C++  and  the  computational  performance  is  charac¬ 
terized  in  Table  1. 

3  Geometric  Invariants 

An  invariant  is  a  property  of  a  set  of  geometric  forms 
which  does  not  change  with  viewpoint.  Our  premise  is 
that  invariants  offer  a  sound  framework  for  the  repre¬ 
sentation  of  objects  leading  to  efficient  recognition  algo¬ 
rithms.  For  the  past  few  years  we  have  worked  jointly 
with  Oxford  University  to  develop  and  apply  geometric 
invariants  to  the  problem  of  object  recognition  A  joint 
workshop  between  DARPA  and  ESPRIT,  “Applications 
in  Computer  Vision,”  was  held  in  Reykjavik,  Iceland  in 
April,  1991.  The  workshop  brought  together  the  lead¬ 
ing  researchers  in  invariant  theory  and  applications.  A 
collection  of  the  papers  from  the  workshop  have  been 
published  by  MIT  Press  [Mundy  and  Zisserman,  1992]. 
Also  in  November  1992,  a  very  successful  seminar  on 
invariants  was  held  at  Wright  Lab  under  the  sponsor¬ 
ship  of  the  Target  Recognition  Technology  Branch  and 
AFOSR.  The  two  day  seminar  provided  a  tutorial  on  in¬ 
variants  presented  by  J.  Mundy  and  D.  Kapur''  as  well 
as  a  session  quasi-invariants  presented  by  Tom  Binford®. 
The  seminar  was  attended  by  about  30  participants  from 
government  labs,  ATR  contractors  and  universities. 

Below  are  some  of  the  highlights  of  our  recent  results 
in  the  application  of  invariants. 

3.1  Site  Model  Registration 

Experiments  have  been  carried  out  to  test  the  perfor¬ 
mance  of  invariants  on  the  RADIUS  task  of  site  model 
registration.  The  approach  exploits  the  use  of  projective 
and  affine  planar  invariants  to  locate  major  building  fea¬ 
tures  in  images.  After  these  features  are  recognized  the 
full  site  model  can  be  aligned  with  the  image  using  stan¬ 
dard  camera  resectioning  software. 

First  we  review  the  basic  invariants  used  in  the  exper¬ 
iments. 


^The  individuals  from  Oxford  University  involved  in  the 
collaboration  are  A.  Zisserman,  D.  Forsyth  (now  at  the  Uni¬ 
versity  of  Iowa)  and  C.  Rothwell 

'State  University  of  New  York  at  Albany 
^Stanford 
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F'igtire  2:  An  exarnplr  of  constraint,  model  building  construction.  The  initial  configuration  of  the  model  is  shown  in 
the  image  on  the  h'ft.  The  final  solution  is  on  the  right..  Note  that  the  constraints  sui'ply  information  about  the 
sha|)e  of  the  structures  so  that  only  a  few  image  feature  point  correspondences  are  needetl  to  fix  the  sha|)i'. 
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Invariant  1:  Five  Coplanar  Lines. 

Given  five  coplanar  homogeneous  lines  Ij,  i  €  {1,  ..,5}, 
two  projective  invariants  are  defined: 

_  |M43i||A^52i| 

|Af42l||A/53l| 

_  |A^42i||M532| 

|Af432||M52l| 

I]^)  and  \M\  is  the  determinant  of 
M .  Should  any  triple  of  lines  become  concurrent  the 
first  invariant  is  undefined.  This  singular  case  is  common 
for  polygons  where  alternate  sides  are  parallel.  In  these 
cases  we  can  only  use  the  second  invariant  as  a  shape 
descriptor.  Using  the  duality  of  points  and  lines  the 
invariants  can  also  be  defined  for  five  coplanar  points, 
i.e.  Mijk  =  (Pi.Pj.Pjt). 

Invariant  2:  Two  Points  and  Two  Lines 

A  single  projective  invariant  can  also  be  derived  from 
two  points  and  two  lines.  The  invariant  is  given  by  a 
ratio  of  various  combinations  of  the  algebraic  distance 
from  each  point  to  each  line,  as  follows. 

J  ^  (U  pi)(4  p2) 

'  (11P2)(1‘2  Pi) 

Note  that  each  point  and  line  appear  the  same  number 
of  times  in  the  numerator  and  denominator  so  the  pro¬ 
jective  scale  factors  cancel. 

Experiments 

An  example  of  the  use  of  feature  matching  based  on 
projective  invariants  is  shown  in  Figure  3.  A  set  of  in¬ 
variants  are  used  to  describe  the  “L”  shaped  buildings 
at  Fort  Hood.  The  projective  invariants  are  based  on 
lines  and  points  extracted  from  an  edge  segmentation 
of  the  building.  Then  other  buildings  of  the  same  type 
in  the  same  image  or  new  images  of  the  site  are  found 
by  invariant  indexing.  Table  2  shows  the  value  of  typ¬ 
ical  invariants  of  each  type  within  the  same  image  and 
between  images.  The  matches  for  each  building  are  indi¬ 
cated  by  overlaying  the  segmentation  edgels  of  the  model 
building  on  each  detected  instance.  In  out  experiments, 
building  (1,1)  was  taken  as  the  model.  Neither  projec¬ 
tive  and  affine  planar  transformations  account  exactly 
for  the  actual  RADIUS  image  characteristics.  However, 
over  a  small  field  of  view,  such  as  a  building,  both  of 
these  models  provide  reasonable  indexing  power.  That 
is,  the  variance  of  the  invariant  keys  is  small  compared  to 
the  difference  between  invariant  values  of  distinct  object 
classes. 

4  Basic  Research  Results 

GE-CRD,  in  the  context  of  a  number  of  university  collab¬ 
orations,  has  continued  to  advance  the  basic  foundations 
of  object  recognition,  based  on  invariants.  A  key  issue 
in  geometric  invariant  research  is  the  computation  of  in¬ 
variants  for  3D  structures  both  from  a  single  view  and 
multiple  views. 


Figure  3:  An  example  of  feature  matching  using  a  com¬ 
bination  of  invariants.  The  match  was  based  on  features 
within  the  indicated  ROI. 


Building 

h 

h 

11 

0.770 

0.212 

12 

0.782 

0.208 

13 

0.781 

0.221 

21 

0.776 

0.236 

22 

0.782 

0.240 

23 

0.782 

0.230 

Table  2:  The  value  of  various  invariants  for  the  “L” 
shaped  building  in  the  Fort  Hood  RADIUS  images.  The 
building  label  B,;  refers  to  building  j  in  image  t,  so  one 
can  compare  the  variation  in  invariants  between  copies 
of  same  building  design  in  the  same  image  and  across 
images.  Only  one  five-line  invariant  is  available  because 
three  or  more  of  the  building  edges  are  typically  parallel, 
a  degenerate  case. 

4.1  Invariants  for  3D  Structures  FVom  a  Single 
View 

In  general,  for  a  single  view,  there  are  no  invariants  for  a 
generic  3D  structure.  However,  if  one  makes  an  assump¬ 
tion  about  the  general  class  of  a  3D  object  it  is  possible 
to  establish  a  framework  of  invariants  which  can  be  re¬ 
liably  computed  from  a  single  view.  Examples  under 
development  are: 

Rotational  Symmetry 

A  substantial  body  of  results  have  been  obtained  for 
this  case.  It  is  possible  to  recover  the  axis  of  symme¬ 
try  of  a  rotationally  symmetric  object  from  a  single  view 
and  to  use  distinct  points  on  the  axis  to  compute  invari¬ 
ant  indices,  based  on  the  cross  ratio.  It  is  also  the  case 
that  in  a  single  view,  the  symmetrically  corresponding 
occluding  boundaries  are  within  a  plane  projectivity  of 


ha 

ht 

where  Mijt  =  (li,lj, 
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each  other.  This  fact  enables  the  construction  of  a  con¬ 
tinuous  invariant  signature  along  the  boundaries  which 
can  be  used  to  refine  the  axis  parameters.  This  work 
is  described  in  more  detail  elsewhere  in  these  proceed¬ 
ings  [Liu  et  a/.,  1993]. 

4.2  Polyhedral  Invariants 

It  can  also  be  shown  that  invariants  can  be  constructed 
for  general  polyhedral  surfaces  where  the  faces  have  four 
or  more  vertices.  This  work  is  based  on  the  earlier  theory 
of  Sugihara  [Sugihara,  1986]  which  analyzed  the  degrees 
of  freedom  and  correctness  of  image  projections  of  poly- 
hedra.  We  have  extended  his  theory  to  permit  the  con¬ 
struction  of  projective  invariants  from  uncalibrated  cam¬ 
era  views  of  polyhedral  solids  [Rothwell  e<  o/.,  1993].  The 
restriction  to  polyhedera  is  not  very  severe  since  a  poly¬ 
hedral  “cage”  can  be  invariantly  constructed  around  a 
curved  object  surface  with  vertices  established  at  points 
of  high  curvature. 

4.3  Multiple  Views 

Recent  work  has  established  that  invariants  can  be  con¬ 
structed  for  arbitrary  3D  structures  from  two  or  more 
views.  This  work  builds  on  previous  developments  by 
Barrett  [Mundy  et  al.,  1992a].  It  is  possible  to  recover 
the  epipolar  structure  of  multiple  image  views  from  fea¬ 
ture  correspondences  between  views  in  terms  of  the  es¬ 
sential  matrix,  Q.  Once  this  matrix  is  available,  the  3D 
geometry  of  an  object  can  be  constructed  up  to  a  pro- 
jectivity  of  space.  It  is  thus  natural  to  describe  objects 
in  terms  of  their  3D  projective  invariants.  For  exam¬ 
ple,  four  lines  in  space  define  two  projective  invariants 
and  six  points  define  three  projective  invariants.  The 
latter  case  is  easy  to  see,  since  five  points  define  a  pro¬ 
jective  coordinate  frame  and  the  three  coordinates  of  the 
remaining  point,  in  this  frame,  are  invariant. 

Two  invariants  can  be  obtained  from  two  views  of  a  set 
of  six  points,  with  four  points  coplanar  and  two  points 
not  on  the  plane.  This  configuration  has  been  exploited 
to  effect  a  projective  transfer  of  shapes  between  views 
without  actually  constructing  the  3D  geometry  of  the 
object  [Demey  et  al.,  1992]. 

We  have  also  extended  our  work  on  the  extraction  of 
strticture  from  uncalibrated  cameras  to  incorporate  line 
features.  An  algorithm  has  been  demonstrated  which 
derives  the  epipolar  structure  from  line  features  and  a 
minimum  of  three  uncalibrated  camera  views  [Hartley, 
1993a].  In  addition.  Hartley  h^l8  shown  that  two,  3D 
,projective  invariants  can  be  extracted  from  four  lines. 
This  work  enables  the  recognition  of  objects  using  only 
line  features  [Hartley,  19933- 
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Abstract 

According  to  the  American  Heritage  Dictionary, 
the  adjective  “conscious”  means  having  aware¬ 
ness  of  one’s  own  existence  and  environment. 
In  this  paper  we  present  some  highlights  of  dif¬ 
ferent  aspects  and/or  competencies  that  a  Con¬ 
scious  Observer  must  have.  It  has  been  said 
many  times  that  some  of  these  competencies 
are  less  conscious  than  others,  or  less  context, 
task,  and/or  environment  specific  than  others. 
We  certainly  agree  with  this,  and  accordingly 
our  efforts  can  be  categorized  into  at  least  two 
such  cases  or  processing  stages.  The  first  stage 
can  be  characterized  as  physics-based  under¬ 
standing  of  reflectance  (described  in  Section  1) 
and  physic.s- based  shape  and  motion  modeling 
and  estimation  (described  in  .Section  2).  Dur¬ 
ing  the  second  stage,  we  process  multisensory 
observations  and/or  multiple  features  or  param¬ 
eters  using  statistical  decision  rules  that  will  re¬ 
sult  in  either  physical  action  (obstacle  avoid¬ 
ance,  as  an  example)  or  mental  action.  The 
mental  action  in  this  case  is  the  indexing  stage 
of  recognition  of  objects.  Section  3  describes 
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the  sensory  fusion  process  and  the  statistical 
decision  theory  that  the  sensory  fusion  process 
is  based  on,  while  Section  4  shows  the  prob¬ 
lems  and  issues  with  representation,  i.e.,  iden¬ 
tifying  the  most  statistically  salient  features  for 
inde.xing  into  an  object  databa.se  to  recognize 
categories  of  shapes.  There  are  two  common 
threads  running  through  our  various  research 
projects.  One  is  that  since  the  image  forma¬ 
tion  process  is  physics- based,  it  is  just  common 
sense  to  use  physics- based  models  in  order  to 
perform  the  inverse.  The  other  thread  is  the  re¬ 
alization  that  the  scenes  and  the  measurements 
that  we  make  are  noisy,  incomplete  and  vary 
due  to  environmental  effects  and  hence,  com¬ 
mon  sense  again  dictates  that  one  must  use  mul¬ 
tiple  sensory  measurements  and  multiple  repre¬ 
sentations.  The  theory  that  models  these  pro¬ 
cesses  in  a  most  appropriate  way  seems  to  be 
statistical  decision  theory,  which  we  are  apply¬ 
ing. 

1  Physics-Based  Preprocessing  (R.  Ba¬ 
jcsy) 

During  the  last  four  years  we  have  been  en¬ 
gaged  in  understanding  of  the  interactio*  of 
light  (illumination)  and  surfaces.  This  under¬ 
standing.  which  is  based  purely  on  physical 
principles,  has  led  to  the  development  of  sev¬ 
eral  algorithms  that  enable  us  to  identify,  locate 
and  remove,  if  desirable,  highlights,  interreflec- 
lions.  shading  and  shadows  {Funka-Lea.  1992: 
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Bajcsy  et  al..  1990:  Loe  and  Bajcsy,  1992: 
Lee.  1991]. 

The  motivation  for  this  work  comes  from  the  hy¬ 
pothesis  that  highlights,  interreflections,  shad¬ 
ing  and  shadows  often  create  artifacts  in  parti¬ 
tioning  of  the  scene  that  do  not  correspond  to 
the  true  physical  reality,  and  hence  it  is  desir¬ 
able  to  remove  them  prior  to  segmentation  of 
the  scene.  The  difficulty  lies  in  the  fact  that 
the  imaging  sensor  measures  an  integral  of  all 
the  above  components.  The  research  question 
is;  what  principled  constraints  and/or  other 
measurements  can  one  bring  to  bear  in  order 
to  be  able  to  separate  the  components?  Of 
course  we  are  not  the  only  group  addressing 
this  problem;  currently  there  is  a  whole  commu¬ 
nity  working  in  tlii.s  area  [Boult  and  Wolff.  1991; 
Gershon  et  al..  19S(i;  Healey  and  Binford.  19.'^9; 
Klinker  et  al..  1990:  .\ayar  el  al..  1991:  .Shafer. 
198.5). 

At  the  last  Image  Understanding  workshop  [Ba¬ 
jcsy,  1992]  we  reported  the  results  describing 
identification  of  highlights  from  metallic  objects 
and  surfaces  with  varied  albedo.  In  this  section 
we  shall  concentrate  on  understanding  shadows. 
Shadows  result  from  the  obstruction  of  light 
from  a  source  of  illumination.  As  such,  shadows 
have  two  components;  one  spectral  and  one  geo¬ 
metric.  The  spectral  nature  of  a  shadow  derives 
from  the  characteristics  of  the  light  illuminating 
the  shadow  as  compared  to  the  additional  light 
that  would  illuminate  the  same  area  if  there  was 
no  obstruction.  Hence,  shadows  reveal  them¬ 
selves  as  a  spectral  change  in  radiance  due  to  a 
change  in  the  local  irradiance.  The  geometry  of 
a  shadow  is  determined  by  the  nature  of  the  il¬ 
lumination  obstruction  and  the  scene  geometry. 
A  light  source  may  be  only  partially  obst ru<  Ie<l. 
In  fact,  for  any  non-i)oint  light  source,  t  lie  outer 
portion  of  a  shadow  results  from  the  partial  ob¬ 
struction  of  the  light  source.  This  is  the  penum¬ 
bra  of  the  shadow,  while  the  umbra  is  the  part 
of  the  shadow  where  the  light  is  completely  ob¬ 
structed. 

First,  we  shall  introduce  our  model  of  shadows 
without  other  reflectance  effects.  Let  D  be  the 
amount  of  energy  from  the  illumination  source, 
measured  at  a  given  surface  and  let  E  I)e  the  in¬ 
tegral  of  all  other  illumination  sources  in  the  en¬ 


vironment.  Let  /  repre.sent  the  light  measured 
by  the  camera  from  one  point  of  view.  Coeffi¬ 
cient  l3  from  interval  [0, 1]  indicates  the  amount 
of  obstruction,  where  (3  =  0  corresponds  to  the 
umbra  of  the  shadow.  The  value  /i  =  1  corre¬ 
sponds  to  the  surface  directly  lit.  The  images 
values  I  of  the  surface  in  and  out  of  shadow  are 
proportional  to  the  linear  relationship: 

I  oc  !3D  +  E. 

This  relationship  in  turn  is  used  to  recover  a  sin¬ 
gle  surface  directly  lit  and  in  shadow  as  a  single 
image  region.  It  is  also  used  as  an  aid  in  iden¬ 
tifying  the  umbra  and  penumbra  of  a  shadow. 
However,  this  equation  will  not  help  with  dis¬ 
crimination  between  shadow  penumbra  and/or 
shading.  It  also  may  not  apply  for  multiple  light 
sources  and  varied  albedos.  For  these  cases,  one 
needs  more  constraints!  The  additional  con¬ 
straints  come  from:  (i)  using  a  shadow  casting 
probe;  (ii)  using  spatial  segmentation  based  on 
some  homogeneity  criterion:  (iii)  using  geomet¬ 
ric  constraints  on  where  shadows  can  be  cast :  or 
(iv)  using  active  light  cast  into  the  scene.  We 
enumerate  the  different  cases  in  Table  1  below  . 

For  brevity,  we  demonstrate  our  results  on 
one  example  that  uses  color  image  segmenta¬ 
tion.  This  algorithm  is  based  on  three  ideas: 
(i)  using  line-like  color  models  to  take  into 
account  shadow  candidate  regions:  (ii)  dove¬ 
tailing  the  processing  between  color-space  and 
image-space:  and  (iii)  looking  for  the  best  de¬ 
scription  of  an  image  in  terms  of  primitive  mod¬ 
els  via  region  segmentation.  The  region  grow¬ 
ing  proce.ss  recovers  uniform  or  linear  functions 
in  color  space.  Region  growing  is  initiated  from 
s<*ed  regions  found  from  the  highest  peaks  in  the 
color  histogram,  or  if  no  peaks  are  found,  based 
on  laying  a  grid  of  small  regions  on  the  image. 
The  result  of  this  process  is  shown  in  F'igure  1. 

2  A  Physics-Based  Framework  for 
Shape  and  Nonrigid  Motion  Estimation 
(D.  Metaxas) 

2.1  Introduction 

III  the  past  couple  of  years  we  have  dealt  with 
the  robust  and  efficient  estimation  of  shajie  and 
nonrigid  motion  and  addressed  several  related 
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Identifying  Shadows 


Figiiro  1;  Top  Left:  An  iiiingo  of  a  road  directly  lit  and  in  shadow  courtesy  of  the  Carnegie  Mellon 
\avlal)  [)roject.  Top  Right:  A  gray  scale  labeling  of  the  umbra  and  ])ennmbra  of  the  shadows  on 
the  road  as  recovered  by  onr  color  image  segmentation  for  shadows.  Shadow  und)ra  is  indicatetl 
by  dark  gray,  penumbra  by  light  gray,  and  the  road  directly  lit  by  white.  Bottom:  The  full  color 
image  segmentation  of  the  original  image  aimed  at  recovering  single  materials  directly  lit  and  in 
shadow  as  single  image  regions.  The  different  regions  are  indicated  by  different,  suggestive  colors. 
Black  indicates  that  no  region  was  found  at  that  image  jjosition. 


Table  1:  Shadow  Case  Study.  The  columns  indicate  lighting  conditions  under  which  shadows  may 
be  cast.  The  rows  indicate  material  properties  within  the  scone  and  whether  or  no*  shadows  include 
an  umbra.  A  'y'  indicates  that  the  case  can  be  handled,  while  an  "n'  indicates  that  the  case  cannot 
be  handled  in  general  by  our  system.  The  label  Multiple  Albedos  indie a,tes  distinct  constant  albedos, 
while  the  label  Varying  Albedos  indicates  smoothly  varying  albedos  and  material  properties.  The 
column  in  which  an  ol)server’s  environment  falls  can  be  determined  by  examining  a  shadow  actively 
cast  by  the  observer. 

*  For  this  case,  the  umbra  alone  is  represented  as  a  linear  color  cluster,  as  opposed  to  the  other 
cases  our  system  can  handle,  where  a  single  material  both  in  and  out  of  shadow  is  represented  by 
a  linear  color  cluster. 
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difficult  problems.  VVe  have  consequently  devel¬ 
oped  a  physics-based  framework  for  3D  shape 
and  nonrigid  motion  modeling  which  includes: 
(i)  a  new  class  of  dynamic  deformable  prim¬ 
itives  which  combine  global  and  local  defor¬ 
mation  parameters;  (ii)  a  systematic  approach 
based  on  Lagrangian  dynamics  and  the  finite  el¬ 
ement  method  to  convert  the  geometric  jiaram- 
eters  of  the  primitives  to  dynamic  degrees  of 
freedom:  (iii)  the  development  of  physics-based 
constraints  between  these  deformable  primi¬ 
tives  that  may  be  used  to  track  the  motions 
of  complex  articulated  objects;  (iv)  a  recur¬ 
sive  technique  for  estimating  shape  and  non- 
rigid  motion  from  noise  corrujited  data  based 
on  applying  Kalman  filtering  theory  to  our  dy¬ 
namic  models;  and  (v)  new  ajiplications  to  vi¬ 
sual  estimation.  In  what  follows  we  elaborate 
on  each  of  the  above  technical  contiilmtioiis, 
which  have  also  been  reported  in  [lerzopmi- 
los  and  Metaxas.  1990:  .Vletaxas  and  Terzopou- 
los,  1991a:  Metaxas  and  Terzopoulos.  1991b: 
Metaxas  and  Terzopoulos.  1993]. 

2.2  New  Deformable  Primitives 

VVe  have  created  a  new  family  of  modeling 
primitives  by  developing  a  mathematical  ap¬ 
proach  that  allows  the  combination  of  global 


and  local  deformations.  Our  primitives  include 
global  deformation  parameters  which  represent 
the  salient  shape  features  of  natural  parts  and 
local  deformation  parameters  which  capture 
shape  details.  More  specifically,  we  have  devel¬ 
oped  hybrid  models  whose  underlying  geometric 
structure  allows  the  combination  of  paramet¬ 
ric  models  (superquadrics,  spheres,  cylinders), 
parameterized  global  deformations  (bends,  ta¬ 
pers.  twists,  shears,  etc.)  and  local  spline  free¬ 
form  deformations.  In  this  way.  the  descrip¬ 
tive  power  of  our  models  is  a  superset  of  the 
descriptive  power  of  locally  deformable  models 
[Terzopoulos  and  VVitkin.  1988]  and  globally  de¬ 
formable  models  [Pentland  and  Horowitz.  1991: 
VVitkin  and  Welch.  1990].  .\n  important  bene¬ 
fit  of  the  global/local  descriptive  power  of  these 
models  is  that  it  can  potentially  satisfy  the  often 
conflicting  reipiirements  of  shape  reconstruction 
and  shape  recognition.  I'he  local  degrees  of  free 
dom  of ileformable  models  allow  the  reconstruc 
tion  of  fine  scale  structure  and  the  natural  ir 
regularities  of  real  world  data,  while  the  global 
degrees  of  freedom  capture  the  salient  features 
of  shape  that  are  innate  to  natural  parts  and 
appropriate  for  matching  against  object  proto¬ 
types. 
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2.3  Systematic  Formulation  of  Dynamic 
Primitives 

Through  the  application  of  Lagrangian  mechan¬ 
ics.  we  have  developed  a  method  to  systemat¬ 
ically  convert  the  geometric  parameters  of  the 
solid  primitive,  the  global  (parameterized)  and 
local  (free- form)  deformation  parameters,  and 
the  six  degrees  of  freedom  of  rigid-body  motion 
into  generalized  coordinates  or  dynamic  degrees 
of  freedom.  More  precisely,  our  method  applies 
generally  across  all  well-jrosed  geometric  prim¬ 
itives  and  deformations,  so  long  as  their  equa¬ 
tions  are  differentiable.  The  resulting  Lagrange 
equations  of  motion  which  describe  the  dynamic 
behavior  of  our  models  take  the  form 

Mq  -f  Dq  -I-  Kq  =  g^,  -|-  f^, 

where  M.  D.  and  K  are  the  mass,  damping,  and 
stiffness  matrices,  respectively,  where  gq  are  in¬ 
ertial  forces  arising  from  the  dynamic  coupling 
between  the  local  and  global  degrees  of  freedom, 
and  where  fq(u,/)  are  the  generalized  external 
forces  (computed  from  the  forces  that  the  data 
exert  on  the  model)  associated  with  the  degrees 
of  freedom  q  of  the  model. 

The  distinguishing  feature  of  our  approach  i.s 
that  it  combines  the  parameterized  and  free¬ 
form  modeling  paradigms  within  a  single  physi¬ 
cal  model.  Thus  our  models  exhibit  correct  me¬ 
chanical  behaviors  and  their  various  geometric 
parameters  assume  well-defined  physical  mean¬ 
ings  in  relation  to  prescribed  mass  distributions, 
elasticities,  and  energy  dissipation  rates.  Fur¬ 
thermore.  motivated  by  the  requirements  of  real 
time  vision  applications  we  appropriately  sim¬ 
plify  the  models  and  use  simple  numerical  inte¬ 
gration  techniques  to  achieve  real  time  or  near 
real  time  simulation  rates  on  available  graphics 
workstations. 

2.4  Physics-Based  Constraints 

To  deal  with  constrained  multipart  objects 
such  as  articulated  anthropomorphic  bodies,  we 
have  developed  an  efficient  technirpie  to  imple¬ 
ment  hard  point-to-point  constraints  between 
deformable  primitives.  These  constraints  are 
never  violated,  regardless  of  the  magnitude  of 
the  forces  experienced  by  the  parts.  .Attempt¬ 
ing  to  approximate  such  constraints  with  sim¬ 


ple,  stiff  springs  leads  to  instability.  In  our  ap¬ 
proach  [Metaxas,  1992].  we  compute  the  con¬ 
straint  forces  using  a  stabilized  Lagrange  mul¬ 
tiplier  technique  [Baumgarte,  1972]. 

2.5  Recursive  Estimation 

VVe  also  have  exploited  the  constrained  nonrigid 
motion  synthesis  capabilities  of  our  models  in 
order  to  estimate  shape  and  motion  from  in¬ 
complete.  noisy  observations  available  sequen¬ 
tially  over  time.  .Applying  continuous  non¬ 
linear  Kalman  filtering  theory  [Metaxas  and 
Terzopoulos.  1991b:  Metaxas  and  Terzopoulos. 
199.3],  we  have  constructed  a  powerful  new  re¬ 
cursive  estimator  which  employs  the  Lagrange 
equations  of  3D  nonrigid  motion  as  a  system 
model.  VVe  interpret  the  Kalman  filter  phys¬ 
ically:  the  system  model  continually  synthe¬ 
sizes  nonrigid  motions  in  response  to  generalized 
forces  that  arise  from  inconsistencies  between 
its  state  variables  and  the  incoming  observa¬ 
tions.  The  observation  forces  account  formally 
for  instantaneous  uncertainties  in  the  data.  .A 
Riccati  procedure  updates  an  error  covariance 
matrix  which  transforms  the  forces  in  accor¬ 
dance  with  the  system  dynamics  and  the  prior 
observation  history.  The  transformed  forces  in¬ 
duce  changes  in  the  translational,  rofalional. 
and  deformational  state  variables  of  the  system 
model  to  reduce  the  inconsistencies.  Thus  the 
system  model  synthesizes  nonstationary  shape 
and  motion  estimates  in  response  to  the  visual 
data. 

2.6  Experiments 

The  following  experiments  demonstrate  the  ap¬ 
plication  of  the  above  described  tediniques  to 
various  data.  Fig.  2  illustrates  the  fitting  of  a 
deformable  superquadric  to  a  monocular  image 
of  a  pestle  Fig.  2(a).  The  image  is  converted 
into  a  force  field  that  acts  on  the  model,  de¬ 
forming  it  such  that  it  becomes  consistent  with 
the  occluding  boundary  of  the  pestle  in  the  im¬ 
age.  Fig.  2(b)  shows  the  initial  state  of  the 
deformable  superquadric  displayed  in  wireframe 
projected  onto  the  image.  Fig.  2(c)  shows  an  in¬ 
termediate  step  in  the  fitting  process  as  the  im¬ 
age  forces  are  deforming  the  model  and  Fig.  2(d ) 
shows  the  final  reconstructed  model. 

I'he  following  two  experiments  demonstrate  the 
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performance  of  our  recursive  shape  and  mo¬ 
tion  estimator.  In  the  first  experiment  the  esti¬ 
mator  incorporates  constrained  deformable  su¬ 
perquadrics  as  Kalman  filter  system  models. 
The  figure  illustrates  a  model  composed  of  five 
connected  deformable  superquadrics.  The  esti¬ 
mator  is  applied  to  biomechanical  data  collected 
by  3D  position  sensors  applied  to  the  arms  of  a 
human  subject.  Fig.  3(a)  shows  a  view  of  the  3D 
data  and  the  initial  models.  Fig.  3(b)  shows  an 
intermediate  step  of  the  fitting  process  driven 
by  data  forces  from  the  first  frame  of  the  data 
sequence,  while  Figs.  3(c)  and  (d)  show  differ¬ 
ent  views  of  the  models  fitted  to  the  initial  data. 
Figs.  3(e)  and  (f)  show  intermediate  frames  of 
the  models  tracking  the  nonrigid  motion  of  the 
subject’s  fle.xing  arms,  while  Figs.  3(g)  and  (h) 
show  two  views  of  the  final  position  of  the  mod¬ 
els. 

In  Fig.  4  we  add  uniform  noise,  by  perturbing  by 
±5%  (with  randomly  chosen  sign)  the  noiseless 
value  of  the  123  motion  range  data  point  and 
we  fit  a  deformable  superquadric  with  81  nodes. 
Figs.  4  (a)  and  (b)  show  two  views  of  the  range 
data  and  the  initial  model.  Fig.  4(c)  shows  an 
intermediate  step  of  the  fitting  process  driven  by 
data  forces  from  the  first  frame  of  the  motion 
sequence.  Figs.  4(d)  and  (e)  show  the  model  fit¬ 
ted  to  the  initial  data,  with  visible  tapering  and 
bending  global  deformations.  Fig.  4(f)  shows  an 
intermediate  frame  of  the  model  tracking  the 
nonrigid  motion  of  the  squash,  while  Figs.  4  (g) 
and  (h)  show  the  final  position  of  the  squash. 

3  Multisensor  Fusion  (M.  Mintz) 

3.1  Introduction 

The  successful  design  and  operation  of  au¬ 
tonomous  or  partially  autonomous  agents  which 
are  capable  of  traversing  uncertain  terrains  re¬ 
quires  the  application  of  multiple  sensors  for 
tasks  such  as:  reconnaissance,  surveillance,  and 
target  acquisition  and/or  manipulation.  In  ap¬ 
plications  which  include  a  teleoperation  mode, 
there  remains  a  serious  need  for  local  data  re¬ 
duction  and  decision-making  to  avoid  the  costly 
or  impractical  transmission  of  vast  quantities  of 
.sensory  data  to  a  remote  operator.  There  are 
several  rea.sons  to  include  multisensor  fusion  in 
a  .system  design:  (i)  it  allows  the  designer  to 


Figure  4:  Tracking  of  fully  deformable  squa,sh 
shaped  object  with  noise. 
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combine  iiitriasicallv  dissimilar  data  liom  sev¬ 
eral  sensors  to  infer  some  property  or  properties 
of  the  environment,  which  no  single  sensor  could 
otherwise  obtain:  and  (ii)  it  allows  the  system 
designer  to  build  a  robust  system  by  using  par¬ 
tially  redundant  sources  of  noisy  or  otherwise 
u  ncer  t  ai  n  i  n  for  m  at  ion . 

3.2  Sensor  Fusion  Research  Issues 

The  following  task-related  issues  arise  in  the 
design  and  operation  of  autonomous  systems 
which  employ  multiple  sensors:  (i)  the  value  of 
a  sensor  suite;  (ii)  the  layout,  positioning,  and 
control  of  sensors  (as  agents);  (lii)  the  marginal 
value  of  sensor  information;  the  value  of  sensing¬ 
time  versus  some  measure  of  error  reduction, 
e.g..  statistical  efficiency:  (iv)  the  role  of  sen¬ 
sor  models,  as  well  as  «  priori  models  of  the 
environment:  and  (v)  the  calculus  or  calculi 
by  which  consistent  sensor  data  are  determine<l 
and  combined. 

3.3  Technical  Rationale 

An  important  role  for  active  sensing  is  the 
surveillance  and  sensory  exploration  of  environ¬ 
ments  that  are  characterized  by  significant  a 
priori  uncertainties.  In  addition  to  uncertainty 
in  the  environment,  the  sensors  themselves  e.x- 
hibit  noisy  behavior.  While  good  engineering 
practice  can  reduce  certain  noise  components, 
it  is  impractical  if  not  impossible  to  eliminate 
them  completely.  Thus,  all  sensor  measure¬ 
ments  are  uncertain.  However,  sensor  errors 
can  be  modeled  statistically,  using  both  phy.s- 
ical  theory  and  empirical  data.  In  developing 
these  models,  one  recognizes  that  a  single  dis¬ 
tribution  is  usually  an  inadecpiate  description 
of  sensor  noi.se  behavior.  It  is  much  more  re¬ 
alistic  and  much  safer  to  identify  au  envelope 
or  class  of  distributions,  one  of  whose  members 
could  repre.sent  the  actual  statistical  behavior 
of  the  given  sensor.  I'his  use  of  au  unem  tainty 
class  (or  equivalently  an  envelope,  set.  or  neigh¬ 
borhood)  in  distribution  space  protects  tlie  sys¬ 
tem  user  against  the  inevitable  unpredictable 
changes  that  occur  in  sensor  behavior.  Reasons 
for  uncertainty  in  statistical  sensor  models  in¬ 
clude:  sporadic  interference,  drift  due  to  aging, 
temperature  variations,  miscalibration.  quanti¬ 
zation.  and  other  significant  nonlinearities  over 


the  dynamic  range  of  the  .sensor. 

3.4  Approach 

Our  approach  to  sensor  fusion  employs  statisti¬ 
cal  decision  theory  to  obtain:  (i)  a  robust  test 
of  the  hypothesis  that  data  from  different  sen¬ 
sors  are  consistent:  and  (ii)  a  robust  procedure 
for  combining  the  data  that  pass  this  prelimi¬ 
nary  consistency  test.  Here,  robustness  refers  to 
the  statistical  effectiveness  of  the  decision  rules 
when  the  probability  distributions  of  the  ob.ser- 
t'ation  noise  and  the  a  priori  position  informa¬ 
tion  associated  with  the  individual  sensors  are 
uncertain. 

We  have  developed  a  coherent  decision-theoretic 
approach  to  robust  multisensor  fusion  which 
provides  the  means  to  compute  hard  probabilis¬ 
tic  confidence  measures  of  data  consistency  and 
robustly  combine  consistent  sensor  data  [Kam- 
borova  and  Mintz.  1990:  McKendall  and  .Mintz. 
1992].  Our  approach  allows  the  system  designer 
to  explicitly  incorporate  n  priori  information 
in  the  form  of  geometric  constraints,  and  also 
make  use  of  set -valued  uncertainty  class  models 
which  capture  the  noise  behaviors  of  the  vari¬ 
ous  sensors.  This  work  is  particularly  impor¬ 
tant  because:  (i)  it  allows  the  system  designer 
to  incorporate  geometric  constraints  or  ijifor- 
mation  about  the  features  or  parameters  of  in¬ 
terest  without  requiring  generally  insupportable 
assumptions  about  a  priori  probabilities,  e.g.. 
the  uniform  distribution  on  the  set:  (ii)  it  al¬ 
lows  the  system  designer  to  incorporate  realistic- 
sensor  noise  behavior  in  the  analysis  without  re¬ 
quiring  the  very  often  insupportable  ‘’Gaussian 
hypotheses":  and  (iii)  the  sensor  noise  distri¬ 
bution  may  be  an  element  of  a  nonparametric 
set  of  distributions  which  are  asymmetric,  mul¬ 
timodal.  heavy-tailed,  and  generally  nonmono- 
lone  likelihood  ratio. 

Because  our  methodology  easily  allows  for  the 
accurate  incorporation  of  geometric  constraints, 
we  are  consequently  able  to  address  sensor  fu- 
.sion  tasks  in  which  both  wide  and  narrow  field- 
of-view  .sensors  are  employed.  Specifically,  we 
can  make  robust  confidence  set-based  tests  for 
correspondence  between  features  at  coarse  and 
fine  scales. 

3.5  Current  and  Related  Research 
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We  have  studied  the  theory  and  applica¬ 
tion  of  robust  fixed-size  confidence  intervals 
as  a  methodology'  for  robust  luultisensor  fu¬ 
sion.  This  work  has  been  delineated  in  [Kam- 
berova  and  Mintz,  1990]  and  [McKendall  and 
Mintz,  1992].  Currently,  we  are  implement¬ 
ing  a  DARPA-funded  multiagent  hard\\ are- 
software  testbed  for  studying  multisen.sor  fu¬ 
sion,  and  multiagent  communication  and  coop¬ 
eration  [Bajcsy  et  al.,  1992].  This  testbed  is 
based  on:  (i)  mobile  robotic  agents  with  mul¬ 
tiple  sensors  and  manipulatory  capability:  and 
(ii)  mobile  observer  (sensory)  agents. 

Our  sensor  fusion  studies  focused  initially  on 
confidence  intervals  as  opposed  to  the  more  gen¬ 
eral  paradigm  of  confidence  sets.  The  basic  dis¬ 
tinction  here  is  between  fusing  data  character¬ 
ized  by  an  uncertain  scalar  parameter  versus 
fusing  data  characterized  by  an  uncertain  vec¬ 
tor  parameter,  of  known  dimension.  While  the 
confidence  set  paradigm  is  more  widely  applica¬ 
ble,  we  initially  chose  to  address  the  confidence 
interval  paradigm,  since  we  were  simultaneously 
interested  in  addressing  the  issues  of:  (i)  robust¬ 
ness  to  nonparametric  uncertainty  in  the  sam¬ 
pling  distribution:  and  (ii)  decision  procedures 
for  small  sample  sizes.  Our  research  on  opti¬ 
mal  and  robust  fixed-size  confidence  intervals 
has  appeared  in  (Zeytinoglu  and  Mintz,  1984; 
Zeytinoglu  and  Mintz.  19!^!^]. 

We  have  also  investigated  the  multivariate  (con¬ 
fidence  set)  paradigm.  The  delineation  of  op¬ 
timal  confidence  sets  with  fixed  geometry  is  a 
very  challenging  problem  when:  (i)  the  a  priori 
knowledge  of  the  uncertain  parameter  vector  is 
not  modeled  by  a  Cartesian  product  of  inter¬ 
vals  (a  hyper-rectangle);  and/or  (ii)  the  noise 
components  in  the  multivariate  observations  are 
not  statistically  independent.  .Although  it  may 
be  difficult  to  obtain  optimal  fixed-geometry 
confidence  sets,  we  have  obtained  some  very 
promising  approximation  techni(|ues.  The.se  ap¬ 
proximation  techniques  provide:  (i)  statisti¬ 
cally  efficient  fixed-size  hyper-rectangular  con¬ 
fidence  .sets  for  decision  models  with  hyper- 
ellipsoidal  parameter  sets;  and  (ii)  tight  upper 
and  lower  bounds  to  the  optimal  confidence  co¬ 
efficients  in  the  presence  of  both  Caussian  and 
non-Gau.ssian  sampling  distributions.  We  have 


shown,  through  numerous  examples  of  these  ap¬ 
plications.  that  the  risks  of  the  approximating 
procedures  are  within  0.5%  of  optimal.  These 
approximation  techniques  have  been  reported  in 
(Kamberova  et  al.,  1992]. 

Further,  we  generalize  these  previous  decision- 
theoretic  results  in  two  important  directions: 

(1)  We  obtain  optimal  and  uear-o|)timal  fixed- 
size  confidence  intervals  for  restricted  location 
parameters  for  sampling  distributions  which  do 
not  posses.?  monotone  likelihood  ratio.  Exam¬ 
ples  of  this  sort  of  distribution  are  the  C'auchy 
distribution,  and  Gaussian  distributions  with 
heavy-tailed  contamination.  VV^e  derive  a  class 
of  iionmonotone  almost  equalizer  rules  for  this 
decision  problem.  We  establish  that  rules  in  this 
cia.ss  achieve  near-minimax  risk.  In  particular, 
in  the  case  of  the  Cauchy  sampling  distribution 
we  show,  by  example,  that  the  risk  is  within 
0.3%>  of  optimal.  We  also  establish  that  very 
general  shift  and  scale  mixtures  of  Gaussian  dis¬ 
tributions  have  optimal  procedures  with  a  very 
simple  monotone  form.  Since  these  Gaussian 
mixtures  are  generally  not  monotone  likelihood 
ratio,  this  suggests  that  a  critical  factor  which 
determines  the  need  to  consider  iionmonotone 
rules  is  the  tail-behavior  of  the  sampling  dis¬ 
tribution.  We  obtain  results  which  delineate 
this  connection.  These  results  [Kamberova 
and  Mintz.  199-3]  extend  the  work  on  mono¬ 
tone  procedures  [Zeytinoglu  and  Mintz.  1981: 
Zeytinoglu  and  Mintz,  1988]. 

(2)  We  obtain  minimax  rules  for  restricted  lo¬ 
cation  parameters  under  symmetric  multilevel 
bowl-shaped  loss  for  symmetric  sampling  distri¬ 
butions  with  monotone  likelihood  ratio.  Multi¬ 
level  bowl-shaped  loss  functions  are  obtained  by 
a  convex  combination  of  n  zero-one  loss  func¬ 
tions  with  given  width  parameters.  Sufficient 
conditions  for  minimax  Bayes  rules  are  derived. 
These  conditions  are  easy  to  check  numerically. 
The  minimax  rules  possess  the  following  struc¬ 
ture:  the  rules  are  continuous  (or  piecewise- 
continuous),  piecewise-linear  functions  with  al¬ 
ternating  segments  of  zero  and  unit  slope. 
These  rules  are  simple  to  compute  numerically. 
Further,  we  show:  (i)  how  to  approximate  arbi¬ 
trary  symmetric  bowl-shaped  loss  functions  us¬ 
ing  multilevel  approximants:  and  (ii)  how  to 
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obtain  accurate  approximations  to  the  mini- 
max  rules  for  decision  prolrlems  with  symmetric 
bowl-shaped  loss  functions  and  restricted  pa¬ 
rameter  spaces.  An  outcome  of  this  approxi¬ 
mation  study  is  the  result  that  the  minimum 
Fisher  information  prior  (cos^)  defines  a  bound¬ 
ing  envelope  for  the  least  favorable  prior  distri¬ 
butions  when  the  scale  parameter  of  the  sam¬ 
pling  distribution  tends  to  zero.  These  results 
[Kennedy  and  Mintz,  199.]]  extend  the  work 
on  zero-one  loss  [Zeytinoglu  and  Mintz.  1984; 
Zeytinoglu  and  Mintz.  1988],  and  on  the  role  of 
the  minimum  Fisher  information  prior  [Bickel. 
1981]. 

4  Object  Recognition  (G.  Provan) 

4.1  Introduction 

Over  the  past  year  work  has  been  done  in  gen¬ 
eralizing  existing  object  recognition  capabili¬ 
ties.  building  upon  previous  work  done  in  the 
GRASP  Lab  on  recovering  superquadric  rep¬ 
resentations  for  multi-part  objects  from  dense 
range  data  [Solina,  1987;  Gupta  and  Bajc.sy, 
1990].  The  primary  contributions  are  princi¬ 
pled  solutions  to  the  difficult  problems  of  su¬ 
perquadric  part  classification  and  model  index¬ 
ing,  and  object  class  recognition.  This  project 
consists  of  two  stages:  (i)  object  database  cre¬ 
ation.  and  (ii)  recognition. 

4.2  Object  Database  Creation 

Each  object  is  represented  as  a  set  of  su¬ 
perquadric  parts,  and  each  superquadric  part  in 
turn  is  repre.sented  by  a  superqnadric  parame¬ 
ter  vector  m.  In  creating  an  object  database, 
first,  we  cluster  together  similar  model  parts 
to  create  a  reasonable  number  of  prototypical 
part  classes.  Second,  we  statistically  analyze 
the  parameter-sets  to  identify  the  statistically 
most  significant  subset  of  parameters  m'  which 
distinguish  objects  (or  object  cla.sses)  from  one 
another.  This  is  achieved  by  selecting  a  small 
but  highly  diagnostic  subset  of  the  parameters 
or  by  combining  the  original  parameters  to  yield 
a  small  number  of  new,  more  diagnostic  fea¬ 
tures,  such  as  height-to-width  ratio.  squarene.s.s, 
etc.  For  any  given  domain,  such  distinguishing 
keys  may  be  computed  using  statistical  tech¬ 
niques  such  as  princi|)al  components  analysis. 


The  most  salient  sub-vector  m'  is  used  as 
a  smaller-dimension  initial  index  into  the 
database  during  object  recognition.  This  im¬ 
proves  upon  the  ad  hoc  nature  of  the  fea¬ 
ture/parameter  vectors  used  in  most  recogni¬ 
tion  systems. 

Additional  benefits  of  reduced-dimensionality 
vectors  include  greater  recognition  robustness 
(since  many  elements  of  the  vector  are  often  just 
"noise”,  and  the  remaining  elements  have  more 
accurate  estimated  mean  and  covariance  matri¬ 
ces  used  in  classification),  faster  search  (fewer 
variables  to  match/search),  smaller  databases, 
and  more  efficient  overall  object  recognition. 

4.3  The  Recognition  Stage 

We  start  with  range  data  from  a  single  rigid 
multi-part  object.  We  are  currently  using  the 
segmentation  algorithms  developed  by  Solina 
and  Gupta,  although  we  hope  to  incorporate  the 
alternative  techniques  developed  by  Metaxas 
[Metaxas.  1992].  Once  an  initial  superquadric 
fit  has  been  done,  each  of  the  superquadric 
parts  recovered  from  the  input  is  paired  with 
the  best  matching  prototypical  part  class  using 
precomputed  class  statistics.  The  indexing  keys 
used  are  the  most  distinguishing  parameters  or 
parameter-clusters. 

This  matching  process  produces  a  probability 
associated  with  the  dala/model  compatibility 
for  a  .set  of  object  subparls.  A  formal  evidential 
approach  is  then  used  to  com|)ute  the  probal)il- 
ity  of  this  collection  of  subparts  being  a  multi¬ 
part  object. 

This  feature-selection  approach  to  indexing  has 
been  tested  on  a  domain  of  simple  concave 
kitchen  utensils  (e.g.  bowls,  cups.  pots.  etc.). 
Linear  and  ([uadratic  classifiers  were  trained 
and  tested  on  a  collection  of  64  represeitt alive 
objects  and  a  set  of  3  parameters  were  identified 
as  accounting  for  98%  of  the  variance,  a  signifi¬ 
cant  reduction  in  the  dimensionality  of  tlie  fea¬ 
ture  space.  These  three  parameters  were  tlien 
used  for  indexing  using  dense  range  data  from 
representative  domain  objects. 

Future  work  involves  a  full  implementation  of 
the  subpart  evidential  combination  scheme,  and 
active  vision  routines  to  cope  with  poor  segmen¬ 
tations  or  low  match  probability.  If  recognition 
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is  not  possible  (e.g.  due  to  tin*  viewpoint  bead¬ 
ing  to  an  inability  to  unitiuc'ly  distinguish  the 
object),  this  active  vision  approach  should  i)ro- 
vide  the  capability  to  use  a  new  range  image 
(from  a  different  viewpoint)  to  improve  the  seg¬ 
mentation  and/or  re-run  the  recognition  algo¬ 
rithm. 
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Abstract 

This  paper  presents  an  overview  of  the  research  in 
image  understanding  (lU)  at  the  University  of  Illi¬ 
nois  (UI)  conducted  during  1991-92.  During  this 
period,  our  work  has  been  in  five  areas:  integra¬ 
tion  for  three-dimensional  vision,  motion  analysis, 
analysis  guided  synthesis,  representation  and  navi¬ 
gation,  and  learning  and  recognition.  Work  in  each 
of  these  areas  is  reviewed. 

1  Introduction 

We  review  here  the  research  progress  we  have  made 
since  [33].  This  includes  the  progress  (Secs.  2-5)  on 
previously  ongoing  projects  in  the  four  areas  reported 
in  [33],  as  well  as  work  in  a  new  area  (Sec.  6).  The  first 
area  (Sec.  2)  is  concerned  with  integration  of  multiple 
image  cues  in  performing  image  interpretation.  These 
cues  capture  different  aspects  of  the  three-dimensional 
(3D)  scene  structure,  and  their  integrated  analysis  leads 
to  a  more  robust  inference  about  the  scene  character¬ 
istics  than  possible  from  individual  cues.  The  second 
area  (Sec.  3)  reports  our  work  on  interpretation  of  im¬ 
age  sequences  showing  dynamic  scenes.  Here  we  con¬ 
sider  the  problems  of  detecting  feature  correspondences 
and  estimating  the  3D  motion  parameters  and  surface 
structure  from  feature  correspondences,  in  a  sequence  of 
images  showing  motion  which  is  rigid  or  nonrigid,  and 
motion  specific  or  relatively  general.  The  third  area 
(Sec.  4)  is  concerned  with  analysis  guided  synthesis  of 
scenes  which  we  introduced  in  [33,  24].  Here  the  goal  is 
to  synthesize  images  for  depiction  of  3D  characteristics 
of  the  scene,  using  attributes  recovered  during  inter¬ 
pretation  or  artificial  attributes.  The  use  of  image  at¬ 
tributes  identified  during  3D  recovery  takes  advantage 
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Commerce  and  Community  Affairs  under  grant  90-103. 


of  3D  analysis  for  perceptually  realistic  synthesis.  The 
fourth  area  (Sec.  5)  is  concerned  with  different  compo¬ 
nents  of  an  evolving  3D  representation  and  navigation 
system  which  has  the  goal  of  autonomously  acquiring, 
maintaining  and  using  3D  information  about  the  envi¬ 
ronment.  We  have  begun  work  in  the  area  of  learning 
object  recognition  strategies  (Sec.  6).  The  goal  here  is 
to  learn  to  automatically  perform  extraction  and  recog¬ 
nition  of  a  class  of  objects  from  examples  of  such  recog¬ 
nition.  Representative  projects  in  each  of  these  areas 
are  summarized  in  the  following  sections.  To  keep  the 
paper  brief,  we  have  minimized  discussion  of  and  refer¬ 
ences  to  relevant  work  done  by  others.  Such  discussion 
and  references  are  available  in  the  cited  and  other  listed 
publications. 

2  Integration 

Our  goal  in  this  area  is  to  perform  image  interpreta¬ 
tion  such  that  the  interpretation  simultaneously  satis¬ 
fies  a  remge  of  constraints  imposed  by  the  image  struc¬ 
ture  and  the  model  of  the  scene.  To  do  this,  we  use 
different  computational  processes  each  of  which  carries 
complementary  or  redundant  information  derived  from 
different  image  cues.  Image  interpretation  is  the  result 
of  a  cooperative  computation  that  resolves  conflicts  and 
ambiguities  arising  from  the  individual  processes.  We 
have  presented  several  examples  of  the  integration  ap¬ 
proach  in  previous  lU  workshops  [31,  32,  33].  Here  we 
summarize  some  recent  work  on  integration. 

2.1  Integrated  Active  Stereo 

The  goal  of  our  continuing  work  on  active  stereo  is  sur¬ 
face  estimation  from  stereo  images  of  large  scenes  having 
large  depth  ranges,  where  it  is  necessary  to  aim  cameras 
in  different  directions  to  fixate  at  different  objects  and 
to  construct  the  global  surface  map  of  the  scene  from 
small  patches.  The  first  stage  of  this  work  involves  sur¬ 
face  reconstruction  of  a  single  object,  having  no  depth 
discontinuities.  It  performs  integration  of  camera  ver- 
gence,  focus,  aperture,  stereo  and  calibration  processes. 
The  second  stage  allows  scenes  containing  arbitrarily 
placed  and  arbitrary  size  objects.  In  this  stage,  a  part 
of  the  visual  field  that  has  not  yet  been  fixated  but 
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has  appeared  as  the  peripheral  visual  field,  helps  deter¬ 
mine  the  next  fixation  point  and  provides  coarse  (inac¬ 
curate)  structural  information,  to  be  refined  following 
future  fixations.  Details  of  these  stages  can  be  found  in 
[31,  32,  33,  1]  and  the  references  cited  therein.  We  have 
now  started  a  study  to  analytically  compare  the  per¬ 
formances  of  the  binocular  cues  of  stereo  and  vergence, 
and  the  monocular  cue  of  focus  for  estimating  scene  sur¬ 
faces  [2].  For  ease  of  analysis,  this  is  done  by  considering 
the  estimation  of  range  of  a  scene  point,  thus  exclud¬ 
ing  the  changes  in  the  range  values  that  would  result 
from  the  use  of  surface  smoothness  constraint  during 
surface  fitting.  The  performance  of  the  individual  cues 
is  evaluated  as  a  function  of  errors  in  their  respective 
parameters.  Two  types  of  errors,  called  systematic  and 
random  errors,  are  identified  for  each  of  the  range  es¬ 
timation  methods.  The  effects  of  random,  quantization 
errors  are  expressed  in  terms  of  the  mean  and  variance 
of  the  resulting  depth  error.  Analytical  expressions  for 
the  effects  of  systematic,  calibration  errors  on  estima¬ 
tion  using  each  cue  are  also  obtained.  Further,  we  have 
developed  a  simplified  approach  to  modeling  the  spatial 
quantization  error  in  axial  stereo  vision  systems  with 
rectangular  pixel  geometry  [56].  The  need  for  simplifi¬ 
cation  arises  due  to  the  lack  of  symmetry  and  spatial  in¬ 
variance  of  the  image  disparity.  Numerical  simulations 
show  that  the  modeling  accuracy  is  adequate  for  prac¬ 
tical  purposes,  and  points  to  the  underlying  complexity 
of  the  true  error  distributions.  Such  performance  evalu¬ 
ation  of  the  individual  cues  is  useful  for  identifying  the 
imaging  parameters  whose  control  would  be  most  effec¬ 
tive  in  improving  the  range  estimate,  and  for  devising 
strategies  to  integrate  the  use  of  the  cues  in  order  to 
combine  their  strengths  and  to  overcome  their  individ¬ 
ual  limitations. 

We  have  made  further  progress  [57,  58]  on  the  multi- 
component  blurring  (MCB)  model  presented  in  [33]. 
Unlike  previous  blurring  models,  this  model  can  cap¬ 
ture  emergent  image  details  that  occur  due  to  depth- 
discontinuities  in  the  scene.  It  also  means  that  MCB 
effects  do  not  obey  the  maximum  principle  in  scale- 
space,  and  thus  blurring  can  create  spurious  details.  We 
have  simulated  Pentland’s  depth-from-blur  algorithm  in 
normal  and  MCB  blurring  cases  (MCB  cases  are  not 
trstctable  by  Pentland’s  method).  Previously  we  had 
suggested  that  human  perception  of  blur  and  depth 
could  be  enhanced  by  the  presence  of  MCB  effects. 
We  now  have  more  supporting  data  from  independent 
psychophysical  studies  concerning  an  interesting  phe¬ 
nomenon  in  human  blur  perception  that  has  some  cor¬ 
relation  to  the  MCB  effects.  We  have  done  extensive 
experiments  on  the  MCB  effects  in  real  images. 

2.2  Integrating  Camera  Panning  and 
Depth  from  Focus  using  a  Novel 
Camera 

In  integrated  active  stereo,  the  cameras  must  scan  large 
visual  fields  by  fixating  different  parts,  and  at  each  fix¬ 


ation,  integrate  the  information  from  multiple  cues.  We 
have  now  developed  a  new  approach  which  integrates 
panning,  focusing,  and  range  estimation.  To  experiment 
with  this,  we  have  developed  a  novel  camera  system 
whose  image  plane  tilt  with  respect  to  the  optical  axis 
is  controllable,  and  the  two  common  mechanical  opera¬ 
tions  of  focusing  and  panning  are  replaced  by  panning 
alone.  Consequently,  range  estimation  takes  place  at  the 
speed  of  panning.  Thus,  imaging  geometry  and  optics 
are  exploited  to  eliminate  explicit  sequential  computa¬ 
tion.  Since  the  camera  implements  a  range  from  focus 
approach,  the  resulting  estimates  have  the  advantages 
and  disadvantages  of  any  such  approach.  For  details  of 
this  work,  see  [3]  in  these  proceedings. 

2.3  Integrating  Shape  Estimation  from 
Stereo  and  Shading 

We  have  developed  an  approach  to  the  integration  of 
shape  information  provided  by  stereo  with  that  pro¬ 
vided  by  shading  for  estimating  surface  maps  [4].  Such 
integration  is  facilitated  by  the  use  of  color  images  which 
are  more  easily  segmented  than  gray  level  images.  The 
integrated  system  is  able  to  accurately  obtain  depth  es¬ 
timates  under  a  wider  range  of  conditions  than  either 
stereo  alone  or  shape  from  shading  alone.  Specifically, 
integrating  stereo  and  shape  from  shading  has  several 
advantages.  Stereo  algorithms  cannot  accurately  deter¬ 
mine  depth  for  large  featureless  regions.  Errors  in  sur¬ 
face  shape  are  locally  large  but  are  independently  dis¬ 
tributed  so  that  global  errors  tend  to  be  no  larger  than 
local  errors.  On  the  other  hand,  shape  from  shading 
tends  to  be  locally  accurate  but  can  cause  large  global 
errors,  especially  if  the  boundary  conditions  are  not  well 
known.  In  an  integrated  system,  shape  from  shading 
can  use  small  features  that  would  not  be  resolved  by 
a  stereo  system.  At  the  same  time,  stereo  can  provide 
initial  conditions,  boundary  conditions  and  stability  to 
the  iterative  solution  process  used  in  shape  from  shad¬ 
ing.  It  can  also  provide  the  initial  estimate  of  the  depth 
map  which  is  required  for  estimating  the  light  source 
direction.  We  have  developed  an  approach  to  estima¬ 
tion  of  the  light  source  distribution  to  facilitate  better 
interpretation  of  shading,  and  thereby  more  robust  in¬ 
tegration  of  stereo  and  shading  [5].  Shape  from  shad¬ 
ing  algorithms  are  limited  in  their  applicability  by  the 
assumption  of  overly  simplistic  models  of  the  complex 
light  source  distributions  that  occur  in  the  real  world. 
We  have  obtained  a  more  complete  representation  in 
which  the  lighting  model  makes  use  of  multiple  fixed 
point  sources  located  at  infinity.  Methods  of  estimat¬ 
ing  the  model  parameters  are  developed  and  a  method 
of  estimating  the  surface  shape  given  the  source  distri¬ 
bution  is  presented.  The  shape  estimation  algorithm 
using  multiple  sources  is  based  on  a  new,  single  source 
algorithm  which  is  an  improvement  over  existing  shape 
from  shading  algorithms.  The  source  distribution  al¬ 
gorithm  and  the  generalized  algorithm  for  shape  from 
shading  and  stereo  have  been  applied  to  images  of  real 
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and  artificial  scenes. 

2.4  Integrating  Target  Tracking  and 
Coarse-to-fine  Surface  Estimation 

We  have  begun  to  investigate  the  problem  of  tracking 
a  target  object  in  a  complex  visual  environment  un¬ 
der  binocular  viewing,  and  simultaneously  generating 
an  evolving  surface  map  [7,  8].  A  corner  feature  detec¬ 
tor  is  applied  to  each  image  to  locate  point  features.  To 
initiate  the  tracking  process,  we  use  correlation  across 
stereo  images  to  match  the  feature  points  both  in  space 
2uid  in  time,  thus  producing  partial  3D  trajectories  with¬ 
out  prior  knowledge  of  target  motion.  The  unknown 
object  motion  is  assumed  to  be  piecewise  smooth.  To 
capture  such  object  motion,  we  use  the  Autoregressive 
Moving  Average  (ARMA)  model  to  fit  the  3D  motion 
trajectory.  The  ARMA  model  is  initialized  by  each  par¬ 
tial  3D  trajectory  obtained,  and  then  the  model  is  used 
to  extend  the  trajectory  to  predict  future  image  plane 
positions  of  feature  points.  The  stereo  correspondences 
between  feature  points  are  established  by  comparing  the 
predicted  and  the  actual  image  locations  of  the  feature 
points.  The  3D  positions  of  the  feature  point,  com¬ 
puted  from  their  image  plane  location  he  trajectories 
are  extended,  the  ARMA  models  are  %ted,  and  the 
camera  configurations  are  modified  to  ac.  :eve  tracking. 
The  surface  maps  of  the  target  object  and  the  back¬ 
ground  are  constructed  by  fitting  the  3D  feature  points 
with  smooth  surface  patches.  Since  the  target  object  is 
maintained  in  fixation,  the  accuracy  of  the  environment 
range  map  depends  on  the  location  of  the  target  rela¬ 
tive  to  other  surfaces.  This  yields  target  tracking  and 
simultaneous  generation  of  surface  map. 

This  as  well  as  our  integrated  active  stereo  algorithms 
[31,  32,  33]  have  been  implemented  on  the  University 
of  Illinois  Active  Vision  System  described  in  [6].  The 
system  employs  two  high-resolution  cameras  for  image 
acquisition  and  is  capable  of  automatically  directing 
movements  of  the  cameras  so  that  camera  positioning 
and  image  acquisition  are  tightly  coupled  with  visual 
processing.  The  system  was  designed  and  developed  in 
1987  as  a  research  tool,  largely  based  on  off-the-shelf 
components.  A  central  workstation  controls  imaging 
parameters,  which  include  five  degrees  of  freedom  for 
camera  positioning  (tilt,  pan,  translation,  and  indepen¬ 
dent  vergence)  and  six  degrees  of  freedom  for  the  control 
of  two  motorized  lenses  (focus,  aperture,  and  zoom).  In 
[6],  we  have  described  the  hardware  of  the  system,  the 
imaging  model,  the  calibration  method  employed  and 
some  of  the  system  software.  A  second  version  of  this 
system  has  recently  been  constructed  and  placed  on  an 
autonomous  vehicle. 

2.5  Integrating  Region  and  Border 
Extraction  for  Image  Structure 
Detection 

We  have  begun  work  on  developing  a  new  transform 
to  extract  the  edge  contours  and  skeletons  of  image  re¬ 


gions  at  mutiple  scales.  We  have  considered  the  applica¬ 
tion  of  the  transform  to  detecting  edge  structure.  The 
transform  is  motivated  by  the  observation  that  linear 
processing  based  approaches,  such  as  convolution  and 
matching,  have  the  fundamental  deficiency  of  using  a 
priori  models  of  edge  geometry.  The  proposed  transform 
avoids  this  limitation  by  letting  the  structure  “emerge,” 
bottom-up,  from  interactions  among  pixels,  in  analogy 
with  statistical  mechanics  and  particle  physics.  The 
transform  involves  global  computations  on  pairs  of  pix¬ 
els  followed  by  vector  integration  of  the  results,  rather 
than  the  commonly  used  scalar,  local,  linear  process¬ 
ing.  An  attraction  force  field  is  computed  over  the  im¬ 
age.  Pixels  belonging  to  the  same  region  are  mutually 
attracted  whereas  those  across  edges  repel  each  other. 
Scale  is  an  integral  parameter  of  the  force  computation. 
The  resulting  groupings  of  pixels  represent  multiscale 
image  structure.  The  properties  desired  in  multiscale 
edge  detection  are  given,  and  it  is  theoretically  and  ex¬ 
perimentally  shown  that  the  transform  possesses  these 
properties.  Along  with  their  contours,  the  tr2insform 
also  extracts  skeletons  of  mutiscale  regions.  Prelimi¬ 
nary  experimental  results  with  synthetic  and  real  im¬ 
ages  demonstrate  the  above  properties  of  the  transform. 
Details  of  this  work  are  given  in  a  separate  paper  in 
these  proceedings  [9].  Our  recent  work  on  integration 
of  Gestalt  constraints  for  dot  pattern  grouping  can  be 
found  in  [28]. 


2.6  Computational  Models  of  Integration 

One  formalism  for  modeling  the  process  of  integration 
using  dynamical  systems  is  presented  in  [30,  10].  Ac¬ 
cording  to  this  model,  visual  processing  is  performed 
in  parallel  at  each  location  in  an  image  by  multiple, 
relatively  simple  dynamical  systems.  Multiple  vision 
computations  are  unified  by  many  interacting  dynam¬ 
ical  systems.  For  example,  features  may  be  identified 
with  limit  sets  of  a  multi-attractor  system.  The  position 
of  the  feature  can  be  obtained  by  mapping  the  profile 
of,  say,  the  Laplacian-of-Gaussian  of  the  image  onto  a 
limit  cycle  attractor  where  phase  along  the  limit  cycle 
corresponds  to  relative  image  position  Similarly,  veloc¬ 
ity  information  can  be  recovered  by  mapping  the  tem¬ 
poral  derivative  of  the  Laplacian-of-Gaussian  operator 
onto  a  different  component  of  the  dynamics.  The  design 
and  analysis  of  a  three-dimensional  nonlinear  dynami¬ 
cal  system,  in  which  the  position  and  motion  profiles  of 
an  intensity  edge  are  mapped  onto  a  two-dimensional 
submanifold  of  the  model  dynamics,  are  discussed  in 
[10].  This  demonstrates  a  simple  form  of  integration 
by  embedding  two  inputs  within  a  single  system.  An 
array  of  such  dynamical  systems  can  he  used  for  de¬ 
tecting  spatio-temporal  trajectories  in  the  image  where 
the  analog  nature  of  the  dynamical  systems  ensures  real 
time  performance. 
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3  Motion  Analysis 

The  long-range  goal  of  our  research  in  this  area  is 
the  understanding  of  dynamic  scenes.  We  have  made 
progress  in  three  major  subareas:  finding  feature  corre¬ 
spondences  in  image  sequence,  determining  rigid  motion 
parameters  and  surface  structure  from  the  correspon¬ 
dences,  and  analyzing  nonrigid  motion. 

3.1  Detecting  Feature  Correspondences 

Detecting  feature  correspondences  is  difficult  due  to  a 
wide  variety  of  3D  structural  discontinuities  and  occlu¬ 
sions  that  occur  in  real  world  scenes.  A  major  part  of 
our  work  on  this  problem  is  concerned  with  matching 
point  features.  Our  objective  is  to  integrate  identifica¬ 
tion  of  feature  correspondences  with  segmentation  and 
motion  and  structure  estimation.  We  have  begun  to  de¬ 
velop  an  approach  to  detecting  and  segmenting  feature 
trajectories  in  image  sequences  having  multiple  moving 
objects  that  may  have  temporal  discontinuities  in  their 
motion.  One  common  type  of  motion  discontinuity  is 
in  motion  direction,  e.g.  when  an  object  undergoes 
collision.  The  object  surfaces  are  assumed  to  contain 
point  features  that  are  automatically  detected  as  the 
images  are  acquired.  The  process  of  feature  tracking 
begins  when  the  second  frame  is  acquired  and  the  fea¬ 
ture  points  detected.  With  two  frames,  feature  track¬ 
ing  amounts  to  finding  two-view  correspondences.  As 
more  frames  become  available,  tracking  becomes  equiv¬ 
alent  to  extending  the  trajectories  already  determined, 
and  matching  with  feature  points  in  the  next  frame.  In 
both  cases,  appropriate  constraints  are  used  to  compute 
costs  that  are  associated  with  every  candidate  match. 
These  constraints  involve  image  plane  similarity  in  the 
arrangement  of  neighbors  around  the  points,  smooth¬ 
ness  in  the  3-D  motion  of  objects  and  smoothness  in 
the  image  plane  motion  of  the  features.  The  costs  com¬ 
puted  using  the  above  constraints  are  merged  together 
to  obtain  a  single  cost.  This  cost  is  incorporated  into  an 
energy  function  along  with  the  uniqueness  constraints, 
and  this  energy  function  is  minimized  using  a  Hopfield 
network.  The  problem  of  removing  wrong  correspon¬ 
dences  that  may  result  from  local  minima  is  solved  using 
another  Hopfield-like  network.  Details  of  this  work  are 
presented  in  a  separate  paper  in  these  proceedings  [11]. 
In  another  effort,  we  have  considered  feature  extraction 
and  matching  as  special  cases  of  the  more  general  prob¬ 
lem  of  signal  detection  [12].  Our  early  our  work  on  two 
view  matching  appears  in  [35]. 

3.2  Rigid  Motion  and  Structure  from 
Image  Sequences 

Our  work  in  this  area  is  concerned  with  estimating  mo¬ 
tion  and  structure  of  a  scene  from  feature  correspon¬ 
dences.  We  are  interested  in  segmentation  of  the  se¬ 
quence  into  distinctly  moving  objects,  as  well  as  in  the 
estimation  of  the  motion  and  structure  of  each  object. 
As  stated  earlier,  our  objective  has  been  to  integrate  de¬ 


tection  of  correspondences,  segmentation,  and  motion 
and  structure  estimation. 

We  have  developed  sufficient  conditions  for  double  or 
unique  solution  of  the  problem  of  motion  and  structure 
estimation  of  a  rigid  surface  from  pairs  of  monocular  im¬ 
ages  [27].  These  conditions  further  the  understanding  of 
the  uniqueness  problem  of  rigid  motion  estimation.  We 
show  that  5  correspondences  of  noncolinear  points  that 
do  not  lie  on  a  special  type  of  quadratic  curve,  called 
Maybank  Curve,  in  the  image  plane  suffice  to  deter¬ 
mine  a  pure  rotation  uniquely,  and  6  correspondences 
of  points  that  do  not  correspond  to  space  points  lying 
on  a  Maybank  Quadric  suffice  to  determine  a  motion 
with  nonzero  translation  uniquely.  We  show  that  each 
Maybank  quadric  can  sustain  at  most  two  physically 
acceptable  motion  solutions  and  surface  interpretations, 
provided  that  a  sufficient  number  of  correspondences  are 
present.  In  particular,  we  show  that  in  the  plane  motion 
case,  6  correspondences  of  points  that  do  not  lie  on  a 
quadratic  curve  in  the  image  plane  only  admit  the  true 
motion  and  structure  and  their  duals  as  solutions.  We 
discuss  how  noise  affects  the  uniqueness  of  solution  and 
present  a  nonlinear  algorithm  for  estimation  of  motion 
parameters. 

We  have  developed  algorithms  for  estimating  motion 
and  structure  parameters  from  long  monocular  image 
sequences  by  using  the  most  appropriate  of  a  set  of  long 
sequence  motion  models  [13,  14,  15].  We  first  present  a 
new  two-view  motion  algorithm  and  then  extend  it  to 
long  sequence  motion  analysis.  The  two-view  motion 
algorithm  requires  generally  6  pairs  of  point  correspon¬ 
dences  to  give  unique  solution  of  the  motion  parameters. 
However,  when  the  points  used  for  correspondences  lie 
on  a  Maybank  Quadric,  the  algorithm  requires  7  pairs 
of  point  correspondences  to  give  all  possible  double  so¬ 
lutions.  Object-centered  motion  representations  and 
models  of  motion  described  by  up  to  the  second  or¬ 
der  polynomials  are  analyzed.  Two  long  sequence  al¬ 
gorithms  are  presented,  one  using  interframe  matches, 
and  the  other  using  point  trajectories.  The  long  se¬ 
quence  algorithms  automatically  find  the  proper  model 
that  applies  to  an  image  sequence  and  gives  the  globally 
optimal  solution  for  the  motion  and  structure  parame¬ 
ters  under  the  chosen  model.  Since  the  algorithm  does 
not  involve  structure  parameters,  it  contains  fewer  un¬ 
knowns  than  usual  which  makes  it  more  efficient  and 
robust.  Experimental  results  with  several  real  image 
sequences  showing  different  motions  demonstrate  the 
performance  of  the  algorithm. 

We  have  shown  that  the  motion  and  structure  of 
rigidly  moving  objects  can  be  completely  determined 
from  two  monocular  image  sequences  using  only  tem¬ 
poral  matches  [55].  Three  aspects  of  this  scheme  are 
useful;  (1)  since  stereo  matching  is  not  necessary,  two 
cameras  can  view  totally  different  parts  of  the  rigid 
scene;  (2)  as  temporal  disparity  is  usually  significantly 
smaller  than  stereo  disparity,  matching  needs  only  to 
deal  with  relatively  small  disparities;  and  (3)  the  re- 
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coverable  scene  structure  is  defined  by  the  union  of  the 
fields  of  view  of  two  cameras  instead  of  the  intersec¬ 
tion,  and  so  is  much  larger  than  that  of  a  conventional 
stereo  setup.  Experiments  with  synthesized  data  and 
real  world  images  demonstrate  the  feasibility  of  this 
scheme. 

We  have  developed  necessary  and  sufficient  conditions 
for  determining  shape  and  motion  to  within  a  mirror 
uncertainty  from  orthographic  projections  of  any  num¬ 
ber  of  point  trajectories  over  any  number  of  views  [16]. 
We  prove  that  there  are  always  two  sets  of  solutions 
of  shape  and  motion  under  orthographic  projection;  if 
shape  S  is  a  solution,  so  is  its  mirror  image  S  which 
is  symmetric  to  S  about  the  image  plane.  The  neces¬ 
sary  and  sufficient  conditions  for  determining  the  two 
sets  are  associated  with  the  rank  of  the  measurement 
matrix  W.  We  prove  that  if  the  rank  of  W  is  3,  then 
a  necessary  and  sufficient  condition  is  to  be  satisfied  to 
determine  the  solution  to  within  a  mirror  uncertainty. 
If  the  condition  is  not  satisfied,  then  infinitely  many 
solutions  result.  If  the  rank  of  W  is  2  and  the  image 
points  in  at  least  one  view  are  not  colinear  in  the  image 
plane,  then  there  are  two  possibilities:  either  the  mo¬ 
tion  is  around  the  optical  axis  or  the  3-D  points  all  lie 
on  the  same  plane.  In  the  first  case,  the  motion  can  be 
determined  uniquely  but  the  shape  is  not  determined. 
In  the  second  case,  a  necessary  and  sufficient  condition 
is  to  be  satisfied  and  at  least  3  point  trajectories  over  at 
least  3  views  are  needed  to  determine  the  shape  in  each 
view  to  within  a  mirror  uncertainty.  If  th  rank  of  W  is 
2  or  1  and  the  points  in  each  view  are  colinear  in  the  im¬ 
age  plane,  then  the  three  dimensirnal  motion  problem 
reduces  to  a  two  dimensional  motion  problem.  In  this 
case,  a  necessary  and  sufficient  condition  needs  to  be 
satisfied  to  determine  the  shape  and  motion  to  within 
a  mirror  uncertainty.  All  proofs  are  constructive  and 
can  be  used  to  build  a  completely  linear  algorithm  for 
estimating  shape  and  motion  from  point  trajectories. 

Our  factorization  based  approach  [33]  to  integrated 
motion  and  structure  estimation  from  orthographic  im¬ 
age  sequences  of  arbitrary  length  and  containing  ar¬ 
bitrary  numbers  of  features  per  frame  is  presented  in 
[29,  17].  Our  other  recent  work  on  motion  and  struc¬ 
ture  estimation  is  reported  in  [34,  39,  36,  21] 

3.3  Nonrigid  Motion 

We  have  continued  work  on  interpretation  of  image  se¬ 
quences  showing  nonrigidly  moving  objects  such  as  flu¬ 
ids,  deformable  objects  (human  face),  and  articulated 
objects  (human  body). 

Fluid  motion  is  a  major  focus  of  nonrigid  motion  re¬ 
search.  Unlike  rigid  body  or  elastic  body  motion  anal¬ 
ysis,  fluid  motion  analysis  is  based  on  the  vortex  struc¬ 
tures  in  the  velocity  fields  of  fluid.  Particle  tracking 
velocimetry  is  one  way  to  obtain  velocity  vectors  of  a 
turbulence  on  random  positions  experimentally.  To  fur¬ 
ther  analyze  the  motion  of  the  fluid,  we  need  to  inter¬ 
polate  for  the  velocity  vectors  at  regular  grid  points. 


Previously,  we  developed  a  physically-constrained  ro¬ 
bust  interpolation  method.  The  velocity  fields  were 
considered  as  nonrandom  functions,  and  multivariate 
reciprocal  quadratics  were  chosen  as  the  interpolants 
for  their  nice  methematical  properties.  We  now  model 
the  velocity  field  as  a  vector-valued  multidimensional 
random  process  where  the  particle  particles  are  approx¬ 
imated  by  a  Poisson  sampling  process  [59].  Based  on 
this  model,  the  velocity  interpolation  problem  becomes 
that  of  interpolation  of  a  multidimensional  random  pro¬ 
cess.  While  considering  the  special  characteristics  of 
fluid  flow,  we  extended  previous  work  on  random  pro¬ 
cess  interpolation  from  scalar-valued  one-dimensional  to 
vector- valued  multidimensional  for  homogeneous  turbu¬ 
lence.  An  optimum  linear  filter  which  minimize  the 
mean  squares  error  is  derived  for  this  interpolation  prob¬ 
lem.  The  mean  squares  errors  of  this  interpolation  are 
analyzed  against  the  fluid  property  as  well  as  the  par¬ 
ticle  density.  Compared  with  previous  interpolation 
methods,  the  interpolants  in  this  work  are  the  correla¬ 
tion  functions  of  turbulent  flows  which  have  clear  phys¬ 
ical  meaning  and  could  be  obtained  by  theoretical  anal¬ 
ysis  or  by  experiments. 

Human  face  modeling  is  an  important  problem  in  ap¬ 
plications  such  as  videophone,  teleconferencing  and  per¬ 
son  identification.  We  have  developed  a  method  to  ob¬ 
tain  a  standard  3D  wire-frame  model  of  a  person’s  face 
from  only  the  front  and  two  side  views  [60].  The  generic 
face  model  we  have  used  consists  of  a  set  of  connected 
triangular  meshes.  For  each  view  of  the  face,  the  model 
is  fit  to  the  face.  Each  view  supplies  sufficient  depth  in¬ 
formation  to  modify  the  model  parameters.  The  model 
fit  is  then  compared  with  the  real  3D  data  of  the  face 
to  obtain  a  measure  of  relative  error.  For  this,  we  first 
fit  the  face  model  to  the  real  3D  data  by  least-squares 
matching  of  several  key  points,  and  then  find  the  corre¬ 
sponding  points  on  the  3D  data  according  to  the  nodes 
of  the  face  model  to  compute  the  error.  The  results  of 
our  experiments  show  that  this  method  can  generate  a 
realistic  3D  face  model  for  a  person,  which  can  be  fur¬ 
ther  used  for  facial  motion  analysis  and  expression  syn¬ 
thesis.  Such  30  model-based  coding  differs  from  con¬ 
ventional  waveform  coding  in  that  it  makes  use  of  3D 
properties  of  the  objects.  It  uses  an  explicit  3D  model 
for  the  object,  encodes  images  based  on  computer  vision 
techniques  and  recovers  the  original  images  with  com¬ 
puter  graphics  methods.  Since  it  only  transmits  several 
analysis  parameters,  an  extremely  low  bit  rate  of  image 
transmission  can  be  achieved.  We  are  also  continuing 
our  study  of  the  visual  motion  of  human  ambulatory 
patterns. 

4  Analysis  Guided  Synthesis 

In  [33,  24],  we  introduced  a  framework  for  image  synthe¬ 
sis  using  the  information  extracted  during  3D  interpre¬ 
tation.  The  objective  is  identification  and  depiction  of 
image  attributes  such  that  the  display  effectively  com¬ 
municates  the  3D  scene  structure  as  seen  by  an  observer 


121 


in  relative  motion  to  the  scene.  We  have  continued  and 
extended  this  work  [25].  There  are  two  major  aspects 
of  this  research.  First,  it  introduces  the  notion  that 
the  cues  that  contribute  the  most  to  three-dimensional 
interpretation  are  also  the  ones  that  would  yield  the 
most  realistic  synthesis,  thus  suggesting  an  approach 
to  einalysis  guided  compression.  Second,  it  presents  an 
approach  to  recovering  3D  motion  and  structure  param¬ 
eters  from  multiple  cues  present  in  a  monocular  image 
sequence,  such  as  point  features,  optical  flow,  regions, 
lines,  texture  gradient  and  vanishing  line.  The  use  of 
line  features  has  been  recently  incorporated  into  our 
implementation  of  the  approach.  For  concreteness,  this 
work  focuses  on  flight  image  sequences  of  a  planar,  tex¬ 
tured  surface.  The  integration  of  information  in  the 
diverse  cues  is  carried  out  using  optimization.  In  our  re¬ 
cent  work,  the  requirement  that  motion  be  smooth  is  no 
longer  imposed  and  the  vanishing  line  in  each  frame  is 
now  automatically  recognized  from  the  detected  lines  by 
using  the  motion  estimates  from  two  successive  views. 
For  reliable  estimation,  a  sequential  batch  method  is 
used  to  compute  motion  and  structure.  For  synthesis, 
real  and/or  artiflcial  attributes  are  shown  as  a  monoc¬ 
ular  sequence  or  as  a  binocular  (stereo)  sequence  thus 
further  highlighting  the  recovered  motion  and  structure 
parameters.  Experiments  have  been  conducted  with 
two  image  sequences,  one  digitized  from  a  commercially 
available  videotape  as  reported  in  [24]  and  a  new  se¬ 
quence  acquired  from  a  laserdisc.  This  second  sequence 
is  more  challenging  to  our  algorithm  since  the  images 
contain  partially  or  completely  occluded  vanishing  lines 
and  there  is  reflection  of  the  ground  in  the  bottom  of 
the  airplane.  The  quality  of  the  images  is  somewhat 
better  than  that  of  the  VHS  tape  used  as  the  source 
of  the  first  sequence,  resulting  in  better  estimates.  Im¬ 
age  compression  ratios  achieved  for  these  sequences  are 
502  and  367  per  frame.  However,  since  the  motion  and 
structure  parameters  do  not  change  significantly  with 
each  new  frame,  the  contents  of  a  frame  can  be  esti¬ 
mated  from  the  structure  computed  from  other  nearby 
frames,  and  the  motion  and  structure  parameters.  Con¬ 
sequently,  only  1  out  of  every  n  frames  may  be  used  (say 
for  transmission  in  a  communication  scenario),  thus  in¬ 
creasing  the  compression  ratios  achieved  to  502n  and 
367n,  respectively.  A  stereo  display  of  the  results  has 
also  been  developed  on  Silicon  Graphics  workstation, 
which  can  be  viewed  using  stereo  glasses.  The  display 
sequence  appears  very  similar  to  the  original  sequence 
in  informal,  monocular  as  well  as  binocular  viewing. 

The  use  of  domain  specific  (model  based)  cues  in  in¬ 
tegrated  analysis  results  in  identification  of  model  com¬ 
ponents.  In  the  first  sequence,  this  is  demonstrated 
by  the  use  and  identification  of  the  vanishing  line.  In 
the  second  sequence,  the  constraint  that  runway  edges 
are  parallel  to  each  other  as  well  as  to  the  direction  of 
translation  results  in  identification  of  these  edges.  Such 
identification  of  model  components  (in  addition  to  more 
general  scene  characteristics)  can  be  viewed  as  “analysis 


guided  recognition”  [26].  Figure  1  shows  experimental 
results  for  the  second  sequence  in  which  the  runway  is 
recognized  and  then  enhanced  in  the  synthesized  im¬ 
ages. 

5  Representation  and  Navigation 

Our  goals  in  this  area  continue  to  be  two.  First,  we 
are  interested  in  efficient  computation  of  representations 
of  the  shape  information  such  as  acquired  by  three- 
dimensional  estimation  algorithms  described  in  previ¬ 
ous  sections.  Second,  we  are  interested  in  using  the 
scene  representations  for  path  planning.  Our  work  on 
both  these  problems  uses  potential  field  as  computa¬ 
tional  tool.  Details  of  our  initial  work  on  potential 
field  based  approach  to  path  planning  are  presented  in 
[20].  In  [33]  we  summarized  our  potential  field  based  ap¬ 
proach  for  efficient  derivation  of  the  medial  axis  trans¬ 
form  and  the  generalized  cylinder  representations  of 
a  two-dimensional  region  [18].  Further,  also  re¬ 
viewed  how  the  skeletons  of  rigid  moving  jects  are 
used  in  conjuntion  with  potential  field  based  represen¬ 
tation  of  free  space  for  solving  path  planning  problems 
using  closed  form  expressions  for  repulsive  force.  We 
have  now  extended  this  work  to  path  planning  for  pla¬ 
nar  robot  arms.  Algorithms  are  developed  for  obtain¬ 
ing  arm  configurations  of  minimum  Newtonian  poten¬ 
tial  by  constraining  the  skeleton  of  the  robot  arm  ac¬ 
cording  to  the  given  path  topology  [19].  We  have  also 
surveyed  the  work  on  gross  motion  planning,  i.e.,  mo¬ 
tion  planning  without  contacts  between  robots  and  ob¬ 
jects,  for  point  robots,  rigid  robots  and  manipulators 
in  stationary,  time-varying,  constrained,  and  movable- 
object  environments  [22].  It  reviews  numerous  research 
results  on  motion  planning  reported  during  1985-1992 
by  researchers  in  various  disciplines  such  as  robotics, 
artificial  intelligence  and  computational  geometry.  It 
presents  a  taxonomy  of  motion  planning  problems  and 
a  classification  of  various  motion  planning  approaches 
which  is  used  to  structure  the  survey.  E2M:h  type  of  mo¬ 
tion  planning  problem  is  explained  and  its  complexity 
is  described.  Relevant  algorithms  are  explained  briefly 
and  their  performances  are  compared  with  other  algo¬ 
rithms.  Future  research  directions  are  suggested  for 
each  motion  planning  problem. 

6  Learning  and  Recognition 

We  have  begun  work  on  learning  recognition  of  object 
classes,  including  the  segmentation  or  extraction  of  ob¬ 
ject  area  [37,  38].  Learning  involves  automatically  esti¬ 
mating  the  criteria  for  segmentation  and  classification 
in  terms  of  multiscale  structural  constructs,  called  con¬ 
cepts.  These  constructs  or  concepts  are  dynamically 
formed  in  terms  of  basic  image  features  such  as  edges 
from  a  large  number  of  recognition  examples. 

A  framework  called  Cresceptron  is  introduced  for  au¬ 
tomatic  algorithm  design  through  learning  of  concepts 
and  rules  needed  for  the  development  of  algorithms. 
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thus  deviating  from  the  traditional  mode  in  which  hu¬ 
mans  specify  the  rules  which  comprise  a  vision  algo¬ 
rithm.  With  Cresceptron,  the  designers  need  only  to 
provide  a  good  structure  for  learning,  but  they  are  re¬ 
lieved  of  most  design  details.  Cresceptron  is  tested  on 
the  task  of  visual  recognition:  recognizing  3-D  gen¬ 
eral  objects  from  2-D  photographic  images  of  natural 
scenes,  and  segmenting  the  recognized  objects  from  the 
cluttered  image  background.  Cresceptron  uses  a  hierar¬ 
chical  structure  to  grow  networks  automatically,  adap¬ 
tively  and  incrementally  through  learning.  Each  neural 
plane  in  the  network  hierarchy  gets  automatically  as¬ 
sociated  with  a  different  type  of  concept,  the  concepts 
being  detected  automatically,  and  the  network  grows  by 
creating  new  nodes  and  connections  which  memorize  the 
new  concepts  and  their  context.  Cresceptron  makes  it 
possible  to  generalize  training  examples  to  other  percep¬ 
tually  equivalent  items.  Segmentation  and  recognition 
are  simultaneous.  No  foreground  extraction  is  neces¬ 
sary,  which  is  achieved  by  backtracking  the  response 
of  the  network  down  the  hierarchy  to  the  image  parts 
contributing  to  recognition.  Several  types  of  network 
structures  have  been  developed,  and  their  properties  are 
studied  in  terms  of  knowledge  recallability,  positional 
invariance,  generalization  power,  discrimination  power 
and  space  complexity.  Experiments  with  a  variety  of 
real-world  images  have  been  performed  to  demonstrate 
the  feasibility  of  learning  in  Cresceptron. 
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Figure  1:  Analysis  guided  syntlicsis  of  a  take-off  sequence  taken  from  a  commercial  laserdisc. 
Results  are  shown  for  three  frames,  (a)  Input  sequence,  (b)  Synthesized  .sequence  composed 
of  those  image  attributes  selected  during  integrated  3D  analysis,  (c)  Same  as  input  sequence 
but  the  recognized  runway  edges  are  enhanced  by  repainting  them  and  placing  yellow  disks 
in  them.  As  results  of  the  integrated  31)  analysis:  the  airplane  could  be  removed  from  the 
image  se<iiirii( r.  ;he  occluded  scene  parts  filled  in  (by  inlerplolating  the  estimated  nearby 
structure),  .uni  jIio  scene  part  al)oV('  the  vanishing  line  culured  (bine). 
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Abstract 

Navigation  in  outdoor  terrain  is  difficult  due  to 
a  lack  of  easily  and  uniquely  identifiable  land¬ 
marks.  This  paper  outlines  current  research 
on  extraction  of  navigationally  salient  features 
from  images  and  maps,  feature  matching  and 
viewpoint  determination,  landmark  selection, 
detection  and  diagnosis  of  route  following  er¬ 
rors,  perceptual  issues  related  to  vision-based 
navigation,  and  database  and  software  avail¬ 
ability. 

1  Introduction. 

Navigation  involves  two  closely  related  tasks;  localiza- 
iion  and  route  planning  and  following.  Most  often,  a 
navigating  agent  has  available  a  map  or  some  other 
mode]  of  the  environment  within  which  it  is  operating, 
together  with  sensor  data  about  relevant  aspects  of  that 
environment  at  the  current  instant  in  time.  Localization 
finds  the  agent’s  position  within  the  map  or  model  frame 
of  reference.  Route  planning  involves  the  determination 
of  a  sequence  of  actions  aimed  at  accomplishing  some 
goal.  This  may  be  based  in  part  on  sensor  data  or  com¬ 
pletely  on  the  map  or  model  if  they  are  sufficiently  rich. 
Route  following  includes  those  processes  which  execute 
the  plan  and  monitor  for  errors.  These  activities  must 
be  closely  integrated.  For  example,  accurate  localization 
estimates  are  needed  for  route  planning  since  an  initial 
position  is  usually  required  and  for  route  following  to 
provide  closed  loop  control  of  position. 

Image  understanding  approaches  to  localization  must 
necessarily  contain  three  parts;  feature  extraction, 
matching,  and  viewpoint  inference.  Feature  extraction 
involves  the  detection  of  salient  patterns  in  both  sensed 
data  and  the  map  or  model.  Extracted  features  are  then 
matched,  establishing  a  correspondence  between  the  two 
frames  of  reference.  Finally,  this  correspondence  is  used 
to  place  the  viewpoint  in  the  map/model  frame  of  ref¬ 
erence.  At  least  in  principle,  these  steps  are  relatively 
straightforward  when  downward  looking  aerial  imagery 
is  matched  against  a  standard  “plan  view”  map.  (TER- 
COM  is  a  classic  example  [Andreas  et  ai,  1978]).  Fea- 
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tures  can  be  either  raw  data  or  simple,  derived  point  or 
contour  properties.  Matching  is  essentially  2-D  correla¬ 
tion.  Viewpoint  determination  involves  standard  meth¬ 
ods  from  photogrammetry. 

Localization  is  much  more  difficult  when  performed  at 
or  near  ground  level  due  to  the  90*  change  in  perspective 
from  sensed  data  to  map.  Passive  image  understanding 
techniques  are  likely  to  have  serious  problems  estimating 
range  to  environmental  features  and  thus  the  relative  po¬ 
sition  of  those  features  to  each  other  and  to  the  viewpoint 
in  the  map  frame  of  reference.  More  sophisticated  fea¬ 
ture  extraction  and  matching  is  required  and  viewpoint 
determination  methods  must  be  able  to  to  function  in 
the  absence  of  accurate  3-D  information  from  sensors. 
We  have  made  progress  in  the  following  areas; 

•  Feature  extraction:  Domain  specific  feature  extrac¬ 
tion  routines  have  been  demonstrated  which  exploit 
constraints  imposed  by  the  geometry  of  terrain. 

•  Matching  and  viewpoint  determination:  Higher- 
level  symbolic  problem  solving  has  been  integrated 
with  lower-level  computer  vision  methods  to  pro¬ 
duce  :ui  image  understanding  system  capable  of 
dealing  with  inference  and  ambiguity  in  localization. 

•  Landmark  selection:  Path  following  is  significantly 
aided  by  selecting  landmarks  which  minimize  local¬ 
ization  errors. 

•  Diagnosis  and  recovery:  Al-like  problem  solving 
methods  can  complement  lower-level  computer  vi¬ 
sion  in  detecting  failures  in  route  following  and  di¬ 
agnosis  where  the  original  error  occurred. 

•  Perceptual  issues:  An  understanding  of  the  abili¬ 
ties  and  limitations  of  human  perception  of  terrain 
features  can  give  insights  into  the  construction  of 
automated  navigation  and  also  lead  to  better  train¬ 
ing  methods. 

•  Database:  Examples  of  panorama  images  registered 
to  digital  elevation  data  together  with  a  variety  of 
useful  software  tools  are  being  made  available  to  the 
research  community. 

Results  of  this  work  are  of  potential  relevance  to  au¬ 
tonomous  and  semi-autonomous  mobile  vehicles,  naviga¬ 
tion  aids,  mission  planning,  simulation,  and  training. 
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2  Localization. 

Our  work  on  navigation  hcis  focused  primarily  on  prob¬ 
lems  involving  outdoor,  unstructured  terrain.  Figure  1 
shows  typical  feature  correspondences  that  must  be  es¬ 
tablished.  Since  distinctive  cultural  landmarks  are  not 
available  in  such  environments,  considerable  difficulties 
can  be  expected  in  reliably  associating  map  and  view  fea¬ 
tures.  One  way  to  approach  this  problem  is  to  use  sym¬ 
bolic  matching  by  first  independently  extracting  from 
the  view  and  map  patterns  likely  to  represent  the  same 
topographic  features  and  then  establishing  correspon¬ 
dences  using  a  hypothesize  and  test  strategy. 


Figure  1;  Correspondence  between  map  and  view. 


Feature  extraction  from  images  of  outdoor  terrain  is 
based  on  finding  ridge  contours  with  shapes  indicative 
of  peaks,  saddles,  and  valleys.  Peaks  and  saddles  are 
simply  vertical  extrema  in  ridge  line  contours.  Valleys 
are  more  difficult  to  find,  since  the  actual  valley  terrain 
is  usually  not  visible  and  must  be  inferred  from  other 
features  such  as  T-junctions  in  ridge  line  contours. 

Simple  edge  detection  alone  is  not  sufficient  to  find 
ridge  contours  in  an  image.  Images  of  large-scale,  out¬ 
door  terrain  contain  many  important  but  indistinct  fea¬ 
tures  and  many  extraneous  features  which  convey  no 
useful  information  about  the  topography.  The  contrast 
across  ridge  contours  is  often  low  and  of  limited  spa¬ 
tial  extent.  Often,  local  sections  of  a  ridge  contour  are 
lacking  in  contrast  variation  altogether,  while  many  non¬ 
ridge,  high-contrast  features  are  present. 

Figure  2  shows  a  40®  portion  of  the  panorama  image 
shown  in  Figure  11.  Figure  3  shows  the  results  of  apply¬ 
ing  a  zero-crossing  edge  detector  to  this  image.  Hystere¬ 
sis  thresholding  was  used  and  parameters  were  carefully 
matched  to  the  nature  and  scale  ol  the  image.  As  a  re¬ 
sult,  this  represents  about  the  best  that  can  be  expected 
from  edge  detection  alone.  Figure  4  shows  a  new  edge 
image  in  which  a  variety  of  filtering  and  gap  filling  steps 
have  been  applied.  These  steps  are  based  on  exploiting 
constraints  about  how  ridge  lines  appear  in  horizontally¬ 
looking  views  of  rugged  terrain.  Finally,  Figure  5  shows 
extracted  features  and  line  segments. 

Figures  6  and  7  show  similar  results  for  map  features. 
Unlike  the  problem  of  extracting  topographic  structure 


Figure  2:  Original  image. 


Figure  3:  Output  from  zero-crossing  edge  detector. 


from  images,  the  “map-understanding”  problem  does  not 
have  to  deal  with  the  multitude  of  effects  that  can  lead  to 
contrast  variation  in  imag3s.  Difficulties  associated  with 
scale  are  still  very  real,  however.  For  example,  ridge  lines 
have  a  large  spatial  extent  along  their  length.  Across  the 
length  of  a  single  ridge  line,  extent  can  vary  from  small 
(a  sharp  section  of  ridge)  to  quite  large  (a  section  where 
the  ridge  top  is  essentially  a  plateau).  Peaks  are  likewise 
more  difficult  to  accurately  detect.  Simply  finding  local 
maxima  in  elevation  is  not  sufficient.  Figure  6  shows  the 
results  of  applying  a  local  ridge  detector  similar  to  [Har- 
alick  e<  a/.,  1983]  to  a  portion  of  our  elevation  database 
(see  section  6).  Figure  7  shows  the  final  results  of  fea¬ 
ture  extraction  after  thinning  the  raw  results  and  filling 
gaps  where  ridge  sharpness  was  low.  in  addition,  peaks 
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are  found  using  a  large  area  search  that  is  more  reli¬ 
able  than  simple  local  maxima  detection  and  ridge  lines 
are  organized  into  a  hierarchy  of  importance  that  allows 
significant  ridges  to  be  used  for  initial  matching  while 
making  available  subsidiary  ridges  for  subsequent  verifi¬ 
cation  operations.  (The  ridge  lines  to  the  northwest  are 
not  rendered  in  this  view  of  the  hierarchy,  since  they  are 
actually  part  of  the  parent  of  the  ridges  shown.) 

Features  extracted  using  these  processes  still  have  a 
great  deal  of  ambiguity  associated  with  them.  For  ex¬ 
ample,  lacking  a  prio  >  information  about  viewing  po¬ 
sition  and/or  direction,  it  is  hard  to  extract  features 
such  as  peaks  and  ridges  known  to  correspond  in  the 
view  and  map.  This  difficulty  is  similar  to  that  faced 
by  many  symbolic  problem  solving  systems  dealing  with 
tasks  such  as  classification  and  diagnosis.  In  [Thompson 
ei  ai,  1993],  we  show  that  high-level  hypothesize  and 
test  strategies  can  be  integrated  with  lower-level  feature 
extraction  to  solve  difficult  localization  problems. 

Additional  details  about  feature  extraction  from  views 
and  maps  can  be  found  in  [Savitt  et  al.,  1992,  Savitt, 
1992,  Thompson  ei  al.,  1993].  High-level  strategies  for 
feature  matching  are  described  in  [Heinrichs  et  ai,  1989, 


Thompson  et  al.,  1990,  Smith  et  al.,  1991,  Heinrichs 
ei  ai,  1992]  and  computational  implementations  us¬ 
ing  these  strategies  can  be  found  in  [Bennett,  1992, 
Bennett,  1993,  Thompson  ei  ai,  1993,  Thompson,  1993]. 

3  Landmark  Selection. 

We  have  previously  demonstrated  that  the  accuracy  of 
landmark-based  viewpoint  determination  is  quite  sensi¬ 
tive  to  geometric  properties  of  the  particular  configu¬ 
ration  of  landmarks  used  [Sutherland,  1992].  Recently, 
the  image  understanding  community  has  been  paying 
increased  attention  to  error  estimation.  Of  equal  impor¬ 
tance  are  approaches  which  minimize  the  amount  of  error 
which  can  occur  rather  than  only  providing  a  posteriori 
characterizations  of  the  error  distribution. 

The  extraction  of  navigationally  salient  landmarks 
typically  involves  costs  in  time,  computation,  and  sens¬ 
ing  resources.  As  a  result,  there  is  benefit  to  be  gained  if 
simple  strategies  can  be  used  to  select  a  small  set  of  land¬ 
marks  which  are  likely  to  lead  to  accurate  localization. 
Effective  landmark  selection  methods  are  also  relevant 
to  mission  planning,  where  one  of  the  criteria  entering 
into  route  selection  should  be  the  availability  of  land¬ 
marks  sufficient  to  provide  whatever  degree  of  accuracy 
is  required. 

Error  analysis  is  complicated  by  the  lack  of  general 
sensor  models  which  effectively  describe  position  vari¬ 
ability  in  properties  used  for  viewpoint  determination. 
This  is  particularly  true  when  localization  is  based  on 
bearings  to  features  over  a  wide  field  of  view,  since  sens¬ 
ing  might  involve  mechanical  scanning  of  cameras,  fish 
eye  optics,  or  more  exotic  technologies.  We  take  a  con¬ 
servative  approach  in  which  we  assume  that  the  angular 
error  in  detected  bearings  to  features  is  bounded,  but 
the  distribution  of  values  within  this  range  is  not  known. 
We  then  find  the  the  region  within  which  the  viewpoint 
must  lie  to  be  consistent  with  these  assumptions  and  are 
thus  able  to  determine  if  conflicts  with  obstacles  or  un- 
traversable  terrain  are  possible.  Figure  8  shows  an  exam¬ 
ple  in  which  the  relative  bearing  between  two  landmarks 
and  the  absolute  bearing  to  a  third  landmark  [Thomp¬ 
son  et  ai,  1993]  separately  generate  possible  viewpoint 
regions  shown  in  light  gray,  the  intersections  of  which 
are  marked  in  dark  gray. 

Starting  from  the  analysis  of  uncertainty  regions,  it  is 
possible  to  develop  simple  heuristics  for  selecting  land¬ 
marks  likely  to  minimize  the  size  of  such  regions  [Suther¬ 
land  and  Thompson,  1993,  Sutherland,  1993].  Note  that 


Figure  8:  Intersection  of  viewpoint  uncertainty  regions. 
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this  is  not  as  easy  as  it  might  seem,  since  the  problem 
must  be  solved  with  very  minimal  knowledge  about  the 
true  viewing  location.  Figures  9  and  10  demonstrate  the 
effectiveness  of  this  method.  Simulated  navigators  have 
identified  landmeirks  on  a  map.  Their  task  is  to  move 
along  the  segmented  path  shown  by  the  dashed  line. 
Current  position  is  estimated  at  the  beginning  of  each 
straight  path  segment,  using  relative  bearing  to  three 
landmarks.  In  Figure  9,  the  landmark  selection  heuris¬ 
tic  is  used  at  each  step  to  choose  the  three  landmarks 
on  which  localization  is  based.  In  Figure  10,  landmark 
selection  is  random.  Both  navigators  start  at  the  square 
at  the  left  end  of  the  dashed  line.  Direction  and  dis¬ 
tance  of  move  are  based  on  estimated  position.  Uniform 
multiplicative  error  is  assumed  in  both  relative  bearing 
measurement  and  in  movement.  The  squares  mark  ac¬ 
tual  navigator  positions  at  the  end  of  each  path  segment 
for  fifty  trials.  The  scattering  of  location  in  Figure  10  is 
much  increased  over  that  in  Figure  9. 


Figure  9:  Fifty  trials  -  “intelligent”  landmark  selection. 


4  Error  Detection  and  Diagnosis. 

Mobile  robots  capable  of  independent  operation  all  em¬ 
ploy  some  form  of  “perceptual  servoing”  to  implement 
a  sense-plan-act-verify  cycle  in  which  expectation  about 
sensor  data  are  compared  with  actual  observations,  and 
then  differences  are  quantified  and  used  to  update  es¬ 
timates  of  current  position  and  desired  path  (e.g.,  [Fen- 
nema  el  ai,  1990]).  If  a  match  between  expectations  and 
observations  cannot  be  established,  then  some  sort  of  re¬ 
planning  activity  is  initiated,  a  part  of  which  requires  a 
solution  to  the  localization  problem.  This  approach  is 
only  effective  when  a  rich  model  of  the  environment  is 
available,  allowing  for  complete  and  specific  predictions 
about  the  appearance  of  the  world  from  any  predicted 


viewpoint.  Often,  such  models  do  not  exist,  particularly 
in  tasks  involving  outdoor  maneuvering.  (Consider  the 
effort  that  went  into  producing  5m  resolution  DEM  data 
for  the  ALV  site.)  When  this  is  the  case,  it  is  not  possible 
to  determine  with  certainty  that  an  expectation  does  or 
does  not  match  actual  sensor  values.  At  best  some  sort  of 
confidence  estimate  can  be  produced.  One  consequence 
of  this  is  that  it  is  possible  to  travel  substantial  distances 
on  what  is  in  fact  an  incorrect  path  before  determining 
with  reasonable  certainty  that  an  error  has  occurred. 

Sparse  world  models  and  the  potential  for  substantial 
delays  between  when  an  error  occurs  and  when  it  is  de¬ 
tected  mean  that  lower-level  image  understanding  tech¬ 
niques  are  not  sufficient  in  and  of  themselves  to  support 
effective  plan  monitoring  in  mobile  robotics.  We  are  ad¬ 
dressing  this  problem  by  creating  a  qualitative  model  of 
error  in  vision-based  navigation  and  using  this  model  to 
characterize  the  sorts  of  errors  that  can  occur,  how  they 
can  be  detected,  and  what  sort  of  diagnosis  is  possible  to 
determine  the  original  source  of  difficulties  [Stuck,  1992]. 
The  research  suggests  a  number  of  techniques  that  may 
usefully  complement  lower-level  perceptual  servoing. 

5  Perceptual  Issues. 

Our  approach  to  the  development  of  novel  methods 
for  vision-based  navigation  is  interdisciplinary,  involv¬ 
ing  computational  analysis,  computer  simulations,  and 
studies  of  expert  map  users.  Many  of  the  strategies  we 
use  to  automatically  solve  localization  problems  [Hein¬ 
richs  et  ai,  1992,  Thompson  et  ai,  1993]  arose  out  of  ex¬ 
periments  done  with  experts  solving  actual  and  artificial 
navigation  problems  [Pick  et  ai,  in  press].  In  retrospect, 
these  strategies  make  excellent  computational  sense  since 
the  experts  are  highly  adapted  to  dealing  with  the  ambi¬ 
guity  and  complexity  inherent  in  these  problems.  Nev¬ 
ertheless,  the  strategies  were  not  obvious  to  us  or  others 
until  we  undertook  our  studies. 

This  interdisciplinary  investigation  is  continuing  with 
a  current  focus  on  the  accuracy  with  which  people  are 
able  to  determine  terrain  geometry.  By  comparing  hu¬ 
man  and  machine  vision  perceptual  competence,  we  can 
better  understand  the  relevance  of  expert  strategies  to 
image  understanding  solutions.  At  the  same  time,  we 
can  identify  specific  perceptual  skills  for  which  mecha¬ 
nized  aids  and/or  alternative  training  might  significantly 
improve  human  performance.  Elsewhere  in  this  proceed¬ 
ings  we  summarize  two  such  studies  [Pick  et  ai,  1993]. 
One  demonstrates  that  people  are  poor  at  estimating  dis¬ 
tance  and  slope  in  environments  of  the  scale  and  topog¬ 
raphy  typical  of  outdoor  navigation  tasks  [Melendez  et 
ai,  in  prep].  Since  passive  vision  systems  are  also  poor  at 
these  estimates,  the  strategies  people  use  to  compensate 
for  their  perceptual  limitations  may  also  be  relevant  in 
automated  systems.  The  second  study  deals  with  local¬ 
ization  using  visual  angle.  Again,  people  are  quite  poor 
at  using  this  cue.  On  the  other  hand,  sensors  which  are 
capable  of  measuring  large  visual  angles  with  reasonable 
accuracy  can  be  designed,  suggesting  both  possible  dif¬ 
ferences  between  machine  and  human  solutions  and  aids 
that  might  assist  people  in  performing  the  task. 
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Figure  11;  First  panorama  image. 


6  Database  and  Software. 

Many  recent  papers  addressing  ground  level  localization 
have  presented  results  obtained  only  from  synthetic  im¬ 
agery.  A  few  have  used  the  highly  calibrated  data  avail¬ 
able  for  the  Martin  Marietta  ALV  test  area.  In  addition 
to  the  well-known  pitfalls  of  failing  to  test  new  image  un¬ 
derstanding  algorithms  on  real  data,  the  use  of  synthetic 
terrain  data  to  generate  test  imagery  is  problematic  since 
realistic  digital  elevation  data  is  often  in  error. 

We  have  produced  two  360‘  panorama  images  of 
mountainous  terrain  obtained  with  a  video  camera,  dig¬ 
itized  at  high  resolution,  and  digitally  photo-mosaicked. 
(Figure  11  shows  one  of  them).  They  extend  approxi¬ 
mately  6,000  pixels  horizontally  by  450  pixels  vertically. 
Viewpoint  location  has  been  registered  ±30m  to  USGS 
30m  DEM  data.  Direction  relative  to  UTM  north  and 
tilt  are  known  within  ±0.5°.  Geometric  distortions  due 
to  misalignments  between  the  pan  axis,  the  camera,  and 
“true”  vertical  have  been  normalized  to  approximately 
±0.25°.  Included  in  the  database  are  4  USGS  7.5’  DEM 
quadrangles  composed  together  and  containing  the  view¬ 
points  for  the  panorama  images.  Also  available  is  soft¬ 
ware  for  converting  USGS  format  data  into  a  useful  form, 
mosaicking  DEM  quads  and  panorama  frames,  and  ren¬ 
dering  expected  views  given  map  position. 
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Abstract 

Work  progressed  on  many  fronts  this  year.  A  pri¬ 
mary  focns  has  been  on  the  assembly  of  an  indoor  service 
robot  which  has  the  capability  of  searching  for  a  sdk- 
iiied  tree-form  object  in  a  cluttered  environment.  The 
robot  mounts  a  number  of  sensors  —  stereo  and  laser 
ran^g  —  and  has  model-based  planning  capability  to 
optimize  use  of  its  resources.  This  system,  wmch  is  now 
functioning  is  undergoing  testing  and  continuing  devel¬ 
opment.  This  project  is  a  collaborative  effort  by  the 
P.1.8.  In  addition,  vigorous  research  programs  have  been 
continuing  in  computer  vision,  planning,  and  mechani¬ 
cal  sensing.  This  report  summarizes  our  program  for  the 
year.  Quite  a  few  papers  were  publuhed  dutmg  the  year. 
A  numoer  of  them  are  referenced  in  this  paper.  Others 
are  referenced  in  our  technical  papers  in  this  workshop 
proceedings. 

1.  Vision:  A  powerful  new  technology  for  recogniz¬ 
ing  free-form  2D  and  3D  objects  has  now  been  brought 
to  a  usable  state.  This  involves  fitting  a  high  degree 
implicit  polynomial  to  the  data,  computing  a  vector  of 
invariants  of  the  polynomial  co^dents,  followed  by  a 
minimum  probability  of  error  comparison  of  that  vec¬ 
tor  with  a  vector  of  invariants  storra  in  a  data  base  for 
each  poimble  object  to  be  recognized.  A  new  geometric- 
stochastic  approach  has  been  developed  for  completely 
automated  estimation  of  main  roads  and  similar  struc¬ 
tures  such  as  rivers  in  atrial  images.  An  approach  has 
been  developed  for  the  joint  estimation  of  3D  structure 
and  craera  motion  based  on  finding  corresponding  re¬ 
gions  in  two  images  through  use  of  new  affine  moment 
mvariants  and  smving  simple  motion  equations  explic¬ 
itly. 

Considerable  progress  has  been  made  on  a  determinis¬ 
tic  approach  to  shape  by  myelinic  deformations  of  it  by 
Hanulton-Jacobi  and  reaction-diffusion  equations,  and 
has  led  to  a  novel  theory  of  partitioning  of  visual  form 
and  a  geometric  evolutionary  view  of  mathematical  mor- 
phologv,  among  others. 

2.  Flraning  and  System  Integration:  A  mobile 
rolwt  carrying  infrared  proximity  sensors,  a  laser  l^t- 
stripe  range  sensor,  and  a  pair  of  stereo  cameras  is  now 
functional.  The  system  can  recognize  free-form  objects 
and  make  optimal  use  of  the  robot’s  sensing  routines  to 
search  a  cluttered  environment  for  objects  of  a  specified 
type. 

3.  Object  Recognition  via  3-D  Surface  Track¬ 
ing:  A  3-D  dnal-onve  surface  tracking  controller  that 
enables  a  robot  to  track  along  any  specified  path  on  the 
surface  of  an  unknown  object  in  order  to  identify  the  ob¬ 
ject  is  under  development  and  testing.  The  dual-drive 
controller  computes  the  normal  and  tangent  vectors  rel¬ 
ative  to  movement  along  the  path.  The  result  is  con¬ 
trolled  movement  in  3-D  on  the  surface  of  an  object.  It 
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is  assumed  that  the  path  is  generated  by  an  external  rec- 
ogniMr  in  snch  a  way  that  the  data  points  collected  by 
tactile  sensiiig  alou  the  path  wiD  mudmize  the  proba¬ 
bility  of  correctly  identifying  the  object. 

1  Vision 

1.1  IVee-Form  2D  and  3D  Object 
Recognition  Based  on  Implicit 
Polynomials,  Algebraic  Invariants,  and 
Assrmptotic  Bayesian  Methods. 

We  have  now  brought  our  free-form  object  recog¬ 
nition  based  on  implicit  polynomials  and  algebraic 
invariants  to  a  useful  stage  where  we  can  begin  to 
apply  it  to  real  problems.  At  present,  the  system 
can  deal  with  the  recognition  of  which  of  L  known 
objects  is  present  in  3D  range  data  when  the  ob¬ 
ject  in  the  range  data  is  in  arbitrary  position,  i.e., 
location  and  orientation.  Examples  of  this  are  ob¬ 
ject  recognition  based  on  LADAR  range  data  or  3D 
data  from  stereo.  Exactly  the  same  te^nology  han¬ 
dles  the  recognition  of  which  of  L  object  boundaries 
is  present  in  im^e  edge  data  when  the  camera  view 
direction  is  arbitrary  so  that  the  boundary  in  the 
image  is  in  ubitrary  position  and  will  usually  have 
undergone  different  si^e  changes  in  each  of  two  di¬ 
rections.  Examples  of  this  problem  are  ground  tar¬ 
get  recognition  based  on  a  silhouette  in  an  aerid 
imue,  or  recognition  of  an  airborne  plane  based  on 
a  silhouette  in  an  imzige  by  a  ground-based  camera. 
Our  recognizer  fits  one  generu  implicit  polynomial 
to  the  data,  computes  a  vector  of  invariamts  for  the 
polynomial  fthese  are  functions  of  the  polynomial 
coefficients  tnat  are  functions  of  shape  only  and  are 
invariant  to  object  position  and  stretchings  in  two 
different  directions),  and  does  a  Bayesian  compari¬ 
son  of  this  vector  with  a  stored  vector  of  invariants 
for  each  object  in  the  database. 

This  recognizer  has  two  crutiallg  imwrtant  fea¬ 
tures!  First,  it  functions  well  even  if  data  is  over 
only  a  portion  of  an  object  bounda^  due  to  self  oc¬ 
clusion  or  occlusion  by  another  object.  Second,  the 
required  computation  is  small-  linearly  proportional 
to  the  number  of  data  points  used  in  the  recognition. 
Details  of  the  approach  and  a  number  of  examples 
are  given  in  [22]  in  this  proceedings. 

One  application  of  this  approach  is  the  following. 
An  approach  to  target  recoraition  is  to  decompose 
a  target  into  parts  invariamtly,  i.e.,  a  decomposition 
that  IS  invwant  to  partial  occlusion  or  to  changes  in 
viewing  direction.  Such  a  decomposition  is  that  of 
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the  silhouette  of  an  F-16  using  an  approach  devel¬ 
oped  by  Kitnia  et.  al.  [19].  The  parts  obtained  by 
the  hierarchical  decomposition  are  shown  in  Figure 
1(a).  The  final  partitioning  of  the  plane  is  shown 
in  the  upper  right  corner  of  the  image.  These  parts 
would  then  be  recognized  by  our  recognizer,  and  the 
results  combined  for  target  recognition.  All  of  the 
final  parts  can  be  fit  with  negligible  error  by  4th  de¬ 
gree  implicit  polynomial  curves,  and  it  is  practical 
to  go  to  higher  degree  if  needed.  Fits  to  a  represen¬ 
tative  group  of  these  airplane  parts  are  displayed  in 
Figure  1(b),  where  it  is  almost  impossible  to  distin¬ 
guish  the  part  boundaries  sind  the  fitted  curves.  If  a 
more  complex  part  is  to  be  represented,  e.g.,  a  rocket 
at  a  wingtip,  a  4th  degree  polynomial  may  provide 
a  fit  of  onlv  modest  accuracy,  as  shown.  In  such  a 
case,  a  higher  degree  polynomial  can  used,  e.g.,  as 
shown,  an  8th  degree  polynomial  fits  the  data  with 
negligible  error.  The  computational  cost  of  fitting 
an  8%  degree  polynomial  is  roughly  that  of  fitting 
a  4th  degree  polynomial.  We  are  also  exploring  the 
recognition  of  very  complex  objects  using  patches  of 
low  degree  polynomials  without  the  need  of  repeat¬ 
able  segmentation. 

1.2  Completely  Automated  Recognition 
and  Estimation  of  Main  Roads  in 
Aerial  Images. 

Our  ultimate  goal  is  recognition  and  estimation  of 
all  structure  of  interest  in  aerial  imagery.  However, 
to  start  we  are  focussing  on  main  roads  and  similar 
structure  such  as  rivers,  etc.,  because  we  can  build 
on  technology  which  we  have  developed  previously. 
Most  algorithms  in  the  published  literature  require 
an  operator  to  give  a  pair  of  initial  road  boundary 
points.  This  interaction  simplifies  the  problem  of 
road  estimation  tremendously.  We  are  after  a  com¬ 
pletely  autonomous  system.  We  build  on  our  earlier 
work  on  blob  boundary  finding  that  included  the  in¬ 
troduction  of  "stochastic  snakes”,  which  we  termed 
"the  ripple  filter”,  the  Cramer- Rro  lower  bound  on 
the  minimum  achievable  error  variance  in  estimat¬ 
ing  blob  boundary  location,  and  a  dynamic  pro¬ 
gramming  algorithm  for  implementing  maximum  a 
posteriori  probability  blob  boundary  estimation  [6, 
8].  We  use  stochastic-geometric  models  and  model 
road  geometry  and  image  intensity  inside  and  out¬ 
side  the  roads  by  autoregressive  processes  which  are 
well  suited  to  both  image  synthesis  and  road  estima¬ 
tion.  We  find  all  of  the  main  ro^uls  in  an  image,  ir¬ 
respective  of  their  intersections,  variability  in  width, 
lack  of  image  intensity  discontinuity  at  some  groups 
of  boundary  points,  whether  or  not  they  have  visible 
barriers,  etc..  This  work  is  discussed  in  [2]  in  this 
proceedings. 

1.3  Estimation  Of  3D  Surfaces  And 
Camera  Motion  FVom  Two  More 
Images. 

3D  surfaces  are  to  be  estimated  from  two  images.  It 
is  assumed  that  the  position  of  the  camera  at  which 
each  image  is  taken  is  completely  arbitrary  and  a  pri¬ 
ori  unknown.  This  occurs  if  a  camera  is  moving  in  an 
unknown  waj;  or  if  images  are  taken  by  two  or  more 
cameras  distributed  through  a  3D  region  and  moving 
locally  in  an  unknown  way.  Our  approach  is  to  ap¬ 
proximate  an  arbitrary  3D  surface  with  small  planar 


patches  where  each  such  patch  has  location  and  ori¬ 
entation  that  is  to  be  estimated,  and  simultaneously 
estimate  camera  position  2  with  respect  to  camera 
position  1.  We  do  this  by  solving  a  set  of  explicit 
equations  for  the  unknown  parameters  in  terms  of 
low  computational  cost,  stable  measurements  that 
we  make  on  the  images.  These  measurements  re¬ 
quire  matching  —  finding  a  corresponding  region  in 
image  2  for  each  region  in  image  1.  Since  the  im¬ 
age  in  such  a  region  in  image  2  may  be  a  distortion 
of  that  in  the  corresponding  region  in  image  1  be¬ 
cause  of  the  differences  in  camera  viewing  directions, 
we  use  affine  moment  invariants  for  carrying  out  the 
matching.  This  approach  is  compuiationalTg  attrac¬ 
tive  and  can  be  used  more  generally  for  aligning  two 
or  more  aerial  images.  The  approach  seems  to  work 
well,  and  is  described  in  [17]  in  this  proceedings  and 
in  [18]. 

1.4  A  Hamilton-Jacobi  Approach  to 
Recognition 

We  have  made  substantial  progress  in  our  “shape 
from  deformation”  framework  [14,  12]  which  is 
based  on  shocks  formed  from  reaction-diffusion  and 
Hamilton-Jacobi  equations.  We  have  shown  that  al¬ 
gebraic,  set-theoretic  mathematical  morphology  op¬ 
erations  with  any  convex  structuring  element  can 
be  viewed  and  implemented  as  geometric  evolution 
equations  governed  by  a  Hamilton-Jacobi  partial  dif¬ 
ferential  equations  [l].  We  had  earlier  shown  that 
Gaussian  smoothing  is  a  special  case  of  this  frame¬ 
work  as  well.  The  general  approach  has  also  moti¬ 
vated  successful  application  to  shape-from-shading 
[15].  Robust  recognition  of  shape  requires  a  multi¬ 
dimensional  representation  of  it  in  terms  of  its  parts, 
protrusions,  and  bends.  In  the  past  year,  we  have 
developed  a  theory  of  partitioning  for  visual  form, 
which  is  based  on  general  assumptions  about  ob¬ 
ject  formation  and  projection.  Our  notion  of  parts 
is  based  on  notions  of  necks  and  limbs  and  is  sup¬ 
ported  by  computation,  psychophysical  and  ecolog¬ 
ical  constraints.  The  decompitions  give  natual  in¬ 
tuitive  puts  that  are  in  correspondence  with  func¬ 
tional  three-dimensional  parts  for  a  range  of  biolog¬ 
ical  and  man-made  shapes  [23,  20].  We  have  suc¬ 
cessfully  applied  this  scheme  to  military  targets  in 
LADAR  imagery.  We  are  currently  studying  protru¬ 
sions  and  bends,  the  other  two  nodes  of  the  shape 
triangle  [l3],  necessary  to  describe  shape  for  robust 
recognition. 

2  Mobile  Robot  Project 
2.1  Mobile  Robot  Planning 
Over  the  past  yeu,  we  have  made  a  concerted  effort 
to  combine  our  work  in  image  understanding,  plan¬ 
ning,  and  control.  To  that  end  we  have  designed  and 
constructed  a  new  mobile  robot,  developed  software 
to  control  the  robot  and  interpret  the  data  returned 
by  its  sensors,  and  begun  conducting  experiments 
to  evaluate  our  hardware  and  software.  This  work 
builds  on  our  experience  with  a  smaller  robot,  ex¬ 
tending  the  same  basic  architecture  [9]  and  incorpo¬ 
rating  more  sophisticated  image  understanding  tech¬ 
niques.  In  the  following,  we  briefly  summarize  this 
work;  a  more  detailed  account  is  available  in  these 
proceedings  [4]. 
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Our  reseuch  has  been  driven  by  a  specific  class 
of  tasks.  These  tasks  involve  navigation,  obstacle 
avoidance,  and  object  recognition  and  focus  primar¬ 
ily  on  efficiently  searching  in  cluttered  environments 
for  objects  specified  in  some  suitable  representation 
language.  In  a  typical  task,  the  robot  is  confined 
to  an  area  of  about  1000  square  feet,  cluttered  with 
obstacles  and  containing  a  number  of  target  objects 
corresponding  to  severm  known  object  types.  The 
robot  is  given  a  particul2ir  object  type  and  required 
to  find  an  instance  of  that  type  in  the  enclosed  area. 
The  robot’s  search  is  methodical  but  the  robot  does 
not  spend  time  on  computationally  expensive  infor¬ 
mation  gathering  operations  when  less  expensive  op¬ 
erations  suffice. 

While  the  emphasis  is  on  employing  the  image  un¬ 
derstanding  algorithms  developed  in  our  overall  ef¬ 
fort,  we  use  a  variety  of  sensors  to  expedite  search. 
Familiar  with  the  advantages  and  disadvantages  of 
using  sonar,  we  decided  to  complement  our  acoustic 
ranging  capability  with  near-infrared  obstacle  detec¬ 
tion  and  laser-light-stripe  ranging.  In  addition,  the 
robot  has  a  fixed,  forward-directed  stereo  pair.  The 
sensors  and  computing  machinery  for  the  control 
system  were  built  on  a  heavy-duty  24  inch  mobile 
base  that  provides  power  for  experiments  running 
up  to  six  hours  (this  represents  a  3-5  fold  increase  in 
battery  life  with  about  a  10  fold  increase  in  onboard 
computing  power). 

We  integrated  the  sensors  into  a  set  of  robust  nav¬ 
igation  routines  that  comprise  the  low-level  control 
system.  The  low-level  control  and  sensor  fusion  al¬ 
gorithms  were  implemented  as  a  set  of  objects  and 
methods  in  C-H+.  Due  to  the  modularity  of  our  soft¬ 
ware  design  and  the  similarity  of  the  design  of  the 
new  base  with  that  of  our  earlier  base,  much  of  our 
existing  C-H-  software  could  be  directly  transferred 
to  the  new  robot. 

The  high-level  planning  system  is  based  on  our 
work  on  temporal  belief  networks  [lO]  which  is  de¬ 
signed  to  address  a  range  of  planning  and  control 
problems  [16,  3].  We  are  now  extending  and  refining 
our  techniques  to  handle  more  complicated  stochas¬ 
tic  models  describing  the  dynamics  of  the  domain 
and  the  characteristics  of  sensors.  In  particular,  we 
are  working  on  planning  applications  that  make  use 
of  the  information  returned  from  object-recognition 
routines. 

The  tasks  that  we  are  focusing  on  require  the 
robot  to  search  for  and  recover  an  object  of  a  spec¬ 
ified  type  in  a  cluttered  environment.  In  order  to 
search  efficiently,  the  robot  has  to  deploy  its  sensors 
carefully.  Laser  ranging  takes  a  few  milliseconds,  a 
(^uick-and-dirty  analysis  of  an  image  to  identify  pos¬ 
sible  locations  of  the  target  takes  a  few  seconds,  and 
a  more  careful  matching  against  a  prototype  takes 
15-30  seconds.  Planning  involves,  among  other 
things,  choosing  from  among  a  set  of  information¬ 
gathering  strategies  so  as  to  expedite  search  for  the 
target  object. 

Current  research  involves  the  development  of  a 
programming  environment  that  facilitates  the  design 
and  compilation  of  planning  systems  implemented  as 
temporal  belief  networks.  We  anticipate  that  by  the 
end  of  the  first  half  of  1993  we  will  be  able  to  produce 
complete  planning  systems  using  a  semi-automated, 
interactive  system  in  a  small  fraction  of  the  time  re¬ 
quired  previously. 


2.2  Mobile  Itebot  EVee-Form  Object 

Recognition  Based  on  Stereo  Vision 

A  3D  object  is  to  be  recognized  from  a  stereo-pair 
of  images  by  estimating  a  portion  of  the  3D  object 
surface  and  then  applying  the  recognizer  from  sec¬ 
tion  1.1.  Objects  of  interest  can  be  anything  from 
simple  polyhedra  to  complicated  free-from  solids. 

Our  approach  to  estimating  a  surface  from  two 
or  more  images  involves  modeling  the  surface  as 
a  smooth  stochastic  process  with  occasional  depth 
discontinuities,  and  estimating  the  surface  directly 
from  the  images  using  maximum  aposteriori  prol^ 
ability  estimation.  We  design  and  use  an  appro¬ 
priate  Markov  Random  Field  (MRFl  as  a  prior 
model  for  the  surfaces.  Though  the  field  can  be  de¬ 
signed  to  capture  any  desired  structure,  in  the  sys¬ 
tem  used  in  these  experiments  the  purpose  of  this 
field  is  largely  to  act  as  a  smoother  (i.e.,  regulariza¬ 
tion).  Details  of  the  algorithm  are  given  in  [7,  5, 
11]. 

Having  obtained  3D  surface  points,  object  recog¬ 
nition  is  done  by  fitting  a  fourth  degree  implicit  poly¬ 
nomial  surface  to  these  points,  and  then  comparing 
the  vector  of  algebraic  invariants  for  this  polyno¬ 
mial  with  stored  vectors  of  invariants.  Comparison 
is  done  using  a  Bayesian  recognizer.  The  beauty  of 
using  this  recognizer  is  that  it  is  roughly  equivalent 
to  checking  how  well  the  set  of  SD  estimated  surface 
points  matches  the  stored  model,  and  hence,  works 
excellently  even  if  the  set  of  points  ts  over  only  a 
portion  of  the  object  surface  due  to  occlusion!  De¬ 
tails  of  the  Bayesian  approach  and  other  references 
are  given  in  [22,  21). 

Figure  2  shows  the  various  steps  in  the  adgorithm. 
Figures  2(a)  and  2(b)  are  images  of  the  object.  The 
object  is  located  in  a  busy  environment.  Table  legs, 
wires,  a  file  cabinet  and  a  person’s  legs  can  been 
seen  in  these  images.  The  reconstructed  surface  pro¬ 
duced  by  the  stereo  dgorithm  is  shown  in  Figure 
2(c).  Object  recognition  is  done  using  the  Bayesian 
recognizer.  Final  verification  of  the  recognition,  if 
needed,  is  done  by  translating  and  rotating  the  data 
set  so  as  to  fit  the  database  model  as  shown  in  Figure 
2(d). 

The  problem  with  recognition  based  on  the  data 
from  one  stereo  pair  of  images  is  that  only  a  quar¬ 
ter  to  a  third  of  the  object  surface  is  seen  by  both 
cameras,  and  it  is  therefore  difficult  to  see  enough  of 
the  curved  surface  to  discriminate  between  similiar 
shapes.  Highly  reliable  detection  is  possible  if  3D 
point  estimates  from  two  or  more  stereo  pairs,  each 
pair  taken  from  a  different  position,  can  be  aligned. 
Then  estimates  over  more  of  an  object  surface  can 
be  used.  We  are  presently  completing  software  for 
this  purpose. 

3  Object  Recognition  via  3-D 
Surface  Tracking 

The  current  focus  of  this  research  has  been  the  devel¬ 
opment  and  testing  of  a  3-D  dual-drive  surface  track¬ 
ing  controller  that  enables  a  robot  to  track  along  any 
specified  trajectory  on  the  surface  of  an  unknown  ob¬ 
ject.  In  the  complete  “object-dependent”  tracking 
system,  we  envision  an  external  recognition  program 
that  uses  partial  data  sets  collected  by  tactile  sensors 
on  the  object’s  surface  to  attempt  an  identification 
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and  to  direct  future  data  collection  in  such  a  way  as 
to  limit  the  uncertainty  in  making  a  positive  iden¬ 
tification.  (see,  e.g.,  the  active  vision  algorithm  in 

[21]).  This  tactile  data  collection  method  is  referred 
to  as  “objject-dependent”  sensing  because  although 
no  prior  information  is  used  by  the  controller,  the 
location  of  the  sensing  paths  is  driven  by  external 
sensor  data  or  by  comparisons  made  by  a  recognizer 
to  a  model  data  base.  The  application  for  such  a 
data  collection  system  is  object  recognition  tasks  in 
environmental  exploration  and  manipulation. 

The  specification  of  the  tracking  path  can  be  illus¬ 
trated  by  the  following.  An  external  sensing  system, 
i.e.  vision,  roughly  locates  bounding  points  for  the 
trajectory.  These  points  may  be  above,  below  or  on 
the  surface.  The  trajectory  that  the  robot  follows 
can  be  found  by  petssing  a  plane  along  the  surface 
normal  between  the  bounding  points.  Force  and  ve¬ 
locity  errors  along  the  path  are  zeroed  by  the  dual¬ 
drive  controller.  The  result  is  controlled  movement 
in  3-D  on  the  surface  of  an  object. 

Data  collection  with  the  3-D  dual-drive  controller 
is  an  improvement  over  the  general  data  collection 
methods  with  the  2-D  dual-drive  controller.  The 
3-D  dual-drive  controller  allows  the  robot  to  track 
along  any  path  with  the  end  effector  in  any  orienta¬ 
tion.  The  2-D  dual-drive  controller  limits  tracking 
to  mutually  perpendicular  horizontal  and  vertical 
planes.  By  using  orientation  coordinate  transforma¬ 
tion  mappings,  the  3-D  tracking  controller  enables 
the  robot  to  move  in  any  specified  tracking  plane. 

This  controller  is  implemented  and  tested  using 
an  IBM  7565  Cartesian  robot  equipped  with  strain 
gauges  on  the  end  effector.  The  tracking  controller 
is  designed  so  that  the  robot  will  be  able  to  track 
any  free  form  complex  object. 

Figure  3(a)  and  3(b)  are  examples  of  two  4th  de¬ 
gree  polynomial  surfaces  fit  to  data  sensed  by  this 
system  by  tracking  around  an  object  along  horizon¬ 
tal  slices.  One  4th  degree  polynomial  is  nt  to  data 
over  an  eggplant;  the  other  is  fit  to  data  over  a 
pear. 
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Figure  2:  (a)  and  (b)  :  Stereo  images  of  the  object; 

(c)  ;  Surface  reconstruction  using  the  stereo  algo¬ 
rithm; 

(d)  :  ^constructed  3D  surface  points  superimposed 
on  the  database  model  for  the  object 


Figure  3:  Examples  of  4th  degree  polynomial  sur¬ 
faces  fit  to  data  sensed  by  robotic  tracking  around 
an  eggplant  and  a  pear,  respectively,  along  parallel 
slices. 


Figure  1:  (a)  :  Final  partitioning  of  the  plane  and 
the  parts  obtained  by  hierarchical  decomposition; 
(b)  :  Polynomial  fits  to  a  representative  group  of 
airplane  parts 
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Abstract 

This  report  covers  initial  research  on 
learning  in  vision  conducted  at  the  GMU 
Center  for  Artificial  Intelligence.  The 
research  is  currently  conducted  by  two 
faculty  members  and  one  research 
assistant.  The  report  describes  our  research 
goals,  general  approach,  several  developed 
methods,  and  results  from  experiments 
with  implemented  systems.  The  research 
has  been  concerned  primarily  with  the 
development  of  efficient  methods  for 
inductive  learning  of  texture  descriptions 
from  texture  samples.  The  following 
methods  (and  implemented  systems)  are 
briefly  described:  Textral  (employing 
multilevel  symbolic  image  transformations 
and  the  AQIS  inductive  learning 
program),  PRAX  (using  a  “piincipal 
axes”  representation  of  texture 
descriptions),  AQ-NT  (oriented  toward 
learning  from  noisy  inputs),  AQ-GA 
(combining  inductive  rule  learning  with  a 
genetic  algorithm  based  rule 
enhancement),  and  Chameleon  (based  on 
“model  evolution”  approach). 

1  Introduction 

The  goal  of  this  research  is  to  explore  the 
applicability  of  machine  learning  methods  to 
(xoblems  of  computer  vision.  The  underlying 
premise  is  that  computer  vision  will  ultimately 
need  to  exhibit  learning  ciq>abilities  in  order  to 
be  fully  successful. 


This  research  was  supported  in  part  by  the  Defense 
Advmced  Research  Projects  Agency  under  the  giant  No. 
F49620-92-J-0549,  administered  by  the  Air  Force  Office 
of  Scientific  Research,  iuid  the  grant  No.  N00014-91-J- 
1854,  administrated  by  the  Office  of  Naval  Research,  in 
part  by  the  Office  of  fbval  Research  under  the  grant  No. 
N00014-91-J-1351,  and  in  part  by  the  National  Science 
Foundation  under  the  grant  No.  IRI-9020266. 


The  reasons  for  this  view  are  based  on  the 
following  observations: 

•  The  world  changes  in  uqvedictable  ways, 
therefore  it  is  inqiossible,  in  principle,  to  pre¬ 
program  in  the  vision  systems  all  the 
knowledge  necessary  for  image 
understanding. 

•  Handcrafting  the  knowledge  needed  for 
image  understanding  into  computer  vision 
systems  is  a  difficult  and  time-consuming 
process;  learning  provides  a  fundamental 
vdiicle  for  Amplifying  this  process. 

•  In  biological  vision  systems,  many  aspects  of 
image  perception  are  genetically 
{Hreprogrammed,  but  many  are  learned. 
Similarly,  computer  viAon  systems  should  be 
able  to  acquire  some  capabilities  through 
learning. 

An  important  result  of  our  initlA  research  is  a 
demonstration  that  symbolic  learning  methods 
can  be  successfully  lulled  to  selected  problems 
of  low-level  vision,  in  which  nonsymbolic 
methods  have  been  traditionally  employed. 
SpecificAly,  the  results  obtAned  demonstrate 
that  these  methods  have  been  very  useful  for 
creating  descriptions  of  textures  from  their 
samples,  obtained  from  the  originA  camera¬ 
generated  images. 

2  General  Approach 

The  developed  approach,  cAled  “Multilevel 
Logical  Templates”  (MLT)  aims  at 
automatically  determining  texture  class 
descriptions  (“texture  signatures”)  from  texture 
samples.  The  basic  Aep  in  this  process  is  an 
iterative  (multilevel)  application  of  symbolic 
inductive  learning  to  generAe  texture  rules. 
These  rules  serve  as  ”logicA  templates”  ttuti  are 
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matched  against  window-size  samples  of  texture  “symbolic”  image,  in  which  picture  elements 
classes.  are  labels  of  corresponding  texture  areas. 


The  approach  was  originally  proposed  by 
Michalski  [1973],  and  initially  aiq)lied  using  the 
ILLIAC  111  image  recognition  computer 
facilities. 

The  research  at  the  GMU  Center  for  Artificial 
Intelligence  has  developed  a  variety  of  novel 
extensions  and  new  directions  stemming  from 
the  above  general  approach.  The  novelty  is  in 
utilizing  new  types  of  image  transformations, 
self-improvement  of  the  representation  space 
(constructive  induction),  advanced  noise-tolerant 
learning  techniques,  and  new  multistrategy 
learning  techniques. 


The  sequence  of  operators  that  produces  such  a 
labeling  serves  as  a  texture  description  (“texture 
signature’).  The  basic  opwator  in  this  process  is 
an  q>plication  of  a  set  of  logic -style  rules  to 
transformed  texture  samples.  The  rules  can  be 
applied  in  parallel,  and  serve  as  “logical 
templates”  that  are  applied  to  “events” 
(attribute  vectors)  reiM'esenting  texture  samples. 

To  recognize  an  unknown  texture  sample,  the 
system  matches  it  with  all  candidate  texture 
descriptions.  This  is  done  by  applying  decision 
rules  to  the  events  in  the  sample.  For  each  event, 
the  class  membership  (texture  class)  is 
determined. 


The  basic  idea  behind  the  MLT  approach  can  be 
explained  as  follows  (Figure  1).  Given  an  image 
with  labeled  samples  of  different  textures,  the 
learning  system  determines  a  sequence  of 
operators  that  transform  this  image  to  a 


Hie  assignment  of  the  sample  to  a  given 
decision  class  (texture)  is  based  on  determining 
which  of  the  candidate  classes  gets  the  majority 
(or)  plurality  of  votes.  Thus,  even  if  some  events 
in  the  sample  are  incorrectly  recognized,  the 
classification  of  the  sample  may  be  correct. 


Figure  1:  An  illustration  of  the  MLT  approach  to  texture  learning  and  recognition. 
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The  process  of  learning  such  texture 
descriptions  consists  of  the  following  phases 
(Figure  1): 

1)  Image  preprocessing  (volume  reduction). 

(2)  Training  events  generation  (selection  of 
texture  samples,  determining  attributes,  and 
formulating  training  examples) 

(3)  Inductive  learning  of  texture  rules  (“logical 
templates”),  and 

(4)  Texture  rule  optimization. 

The  above  process  may  be  repeated  iteratively 
until  a  desired  image  transformation  is  obtained. 

The  first  phase  of  the  process  adapts  the  “image 
volume”  to  the  texture  classes  characterized  by 
training  samples. 

This  is  done  by  modifying  the  spatial  resolution 
and  a  gray-level  resolution  of  the  image  so  that 
the  similarities  between  samples  of  the  same 
texture  and  dissimilarities  between  samples  of 
different  textures  are  increased.  In  the  initial 
experiments,  the  events  were  extracted  from  the 
second  or  third  level  of  the  Gaussian  pyramid. 
The  typical  resolution  of  camera-acquired 
images  was  512  by  512  image  elements. 


In  the  machine  learning  method  described  here,, 
a  concept  corresponds  to  a  single  texture  class. 
A  concept  description  is  a  logical  expression  in 
disjunctive  normal  form  associated  with  a 
decision  class  (here,  a  texture  class). 


Each  conjunction  in  this  expression  together 
with  the  associated  decision  class  can  be  viewed 
as  a  single  decision  rule. 


The  conjunctions  (serving  as  condition  parts  of 
the  rules)  are  logical  products  of  elementary 
conditions  in  the  form: 

[L#R] 

where; 


L,  called  the  referee,  denotes  an  attribute. 

R,  called  the  referent,  is  a  subset  of  values  from 
the  domain  of  the  attribute  L. 

#  is  one  of  the  following  relational  symbols: 
=,<,>,>=,<=,  o. 

Each  rule  is  assigned  two  parameters:  “t”  (for 
“total  weight”) — measuring  the  total  number  of 
positive  training  examples  covered  by  the  rule, 
and  “u”  (for  “unique  weight”) — measuring 
the  number  of  positive  examples  covered  by  the 
given  rule  and  not  covered  by  any  other  rule  for 
the  given  decision  class. 


The  second  {^ase  extracts  a  set  of  spatial  texture 
samples,  called  events,  from  classified  texture 
regions  (Module  2  in  Figure  1).  An  event  is  a 
vector  of  attribute  values  that  represent  different 
image  (texture)  features.  Initial  attributes  are 
predefined.  Additional  attributes  can  be 
determined  through  the  process  of  constructive 
induction  [Wnek  and  Michalski,  1991]. 

Thore  are  many  possible  attributes  that  could  be 
determined  to  characterize  textures.  The  most 
desirable  are  those  that  define  a  description 
space  in  which  points  corresponding  to  the  same 
texture  class  constitute  easily  describable 
clusters. 

The  attributes  generated  by  different  systems 
described  in  this  report  fall  into  one  of  three 
categories:  neighboring  gray-level  values, 
statistical  measurements,  and  convolution  Alter 
outputs.  Sets  of  events  extracted  from  texture 
classes  to  be  learned  are  used  as  training 
examples. 

Texture  rules  are  determined  using  the  AQ-15 
method  for  inductive  concept  learning  from 
examples  ([Michalski,  1986]).  The  rules 
learned  by  the  AQ  method  are  represented  In 
VLi  (Variable-Valued  Logic  System  1); 
[Michalski,  1972]).  Advantages  of  this 
representation  are  that  it  is  amenable  for  parallel 
execution  and  easy  to  interpret  conceptually. 


Here  is  an  example  of  an  AQ-15  decision  rule: 
[Class=l]<=[x2=l][x4>3][x6=1..7]:  (t=6,  u=2) 
This  rule  covers  6  examples  of  Class  1,  out  of 
which  2  are  covered  only  by  this  rule,  and  not 
by  any  other  rule  for  this  class.  In  the  case  of 
texture  rules,  xi  are  attributes  characterizing  a 
texture  sample  (in  our  experiments  we  used 
primarily  8x8  windows).  The  above  rule  is 
satisfled,  if  attribute  x2  takes  value  1,  attribute  x4 
has  value  greater  than  3,  and  attribute  x6  takes 
value  between  1  and  7. 

As  mentioned  earlier,  a  description  of  a  texture 
class  can  be  viewed  a  set  of  such  rules  (a 
“ruleset”).  In  such  a  ruleset,  individual  rules  are 
ordered  according  to  the  decreasing  values  of 
the  t-weight.  The  following  is  an  example  of  a 
texture  description; 

[Texture  class  =  sweater  surface^  «= 
[xl=7,9,12]&[x2>]&[x3=0..4]&[x4=0..5]&[x5=0..3] 
[x6=0..7]  [x7=2..4]  [x8=0..3]  (t;28,  u:21) 

OR 

(xl=5,7,9]  [x2>2]  [x3=0..2]  [x4=0..4]  [x5=1..4] 
[x6=0..6]  [x7=0..2]  [x8=0..4]  (t:27,  u:20) 

OR 

[xl=2,5,7]&[x2=l ..  12]&[x3=l  ..2] 
[x4=2..6]&Ix5=3..4]&[x6=0..4]&[x7=1..3.5,7] 
[x8«0..1,3..4]  (t:16,u:ll) 

OR 

Ixl=5..14]&[x2>6]&(x3=0..2,4..5]&[x4*2..5]  (tS,  u:3) 

where  xl  is  the  Laplacian  edge  operator,  x2  is  the 
Frequency  spot,  x3  is  the  horizontal  edge 
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operator,  x4  is  the  vertical  edge  operator,  xS  is 
die  horizontal  V-shape  operator,  x6  is  the  vertical 
V-shape  operator,  x7  is  the  vertical  line  operator, 
and  x8  is  the  horizontal  line  operator. 

The  method  uses  “truncated”  descriptions  of 
texture  classes.  A  truncated  description  is 
obtained  by  the  removing  from  the  initially 
generated  rules  the  ones  with  a  very  low  t- 
weight.  The  reason  for  this  is  that  rules  with  a 
low  t-weight  can  be  viewed  as  insignificant,  or  as 
representing  noise. 

We  have  discovered  experimentally  that  so 
truncated  descriptions  often  give  a  higher 
texture  recognition  performance  than  non- 
truncated  descriptions.  Since  truncated 
descriptions  are  also  simpler,  then  such  a 
truncation  process  is  highly  desirable.  A  detailed 
study  of  this  phenomenon  (in  the  context  of 
non-vision  applications)  have  been  described  in 
[Bergadano  et  al.,  1992]. 

The  learned  texture  descriptions  are 
generalizations  of  the  observed  texture  events 
(i.e.,  attribute-value  vectors  characterizing 
window-size  texture  samples).  Therefore,  they 
can  be  used  to  classify  unobserved  texture 
samples.  There  are  two  methods  for  applying 
the  descriptions  for  recognizing  the  class 
membership  of  an  event:  the  strict  match  and  the 
flexible  match. 

In  the  strict  match,  the  system  tests  whether  an 
event  strictly  satisfies  (the  condition  part  of)  a 
rule.  The  satisfied  rule  determines  the 
ciassification  decision.  In  the  flexible  match,  the 
system  computes  a  degree  of  match  between  the 
event  and  candidate  rules.  The  degree  of  match 
can  vary  in  the  range  from  0.  (no  match)  to  1.0 
(complete  match).  The  rule  with  the  highest 
degree  of  match  determines  the  classification 
decision. 

To  explain  the  calculation  of  the  degree  of 
match,  assume  that  a  recognition  rule  contains  a 
condition  [x  =  ai^].  If  the  domain  of  the  attribute 
X  is  a  set  of  numerical  values  <ai,a2,  ...,an>,  and 
an  event  includes  the  statement  [x=ai],  the 
normalized  degree  of  match  between  the  rule 
and  the  condition  in  the  event  is  defined: 

1  -  ( I  aj  -  aic  I  /  n) 

If  the  condition  has  several  values  in  the  referent 
(on  its  right-hand-side),  the  value  closest  to  aic  is 
used.  The  degree  of  match  between  a  rule 
containing  several  conditions  and  an  event  was 
computed  as  the  average  of  the  degrees  of  match 
between  the  conditions  and  the  event  conditions. 

The  degree  of  match  between  a  class  description 
(which  may  have  several  rules)  and  a  given 
testing  event  (an  example)  is  determined  as  the 


maximum  of  the  degrees  of  match  between 
individual  rules  in  the  description  and  the  event. 
The  description  with  the  highest  match  among 
classes  determines  the  recognition  decision.  The 
measure  of  recognition  accuracy  of  a  rule  when 
applied  to  a  set  of  testing  events  is  the 
p^centage  of  the  number  of  correctly  classified 
test  events  to  the  total  number  of  testing  events 
in  the  set. 

3  Implemented  Systems 

3.1  Learning  Texture  Signatures: 

TEXTRAL 

The  TEXTRAL  system  implements  a  version  of 
the  MLT  approach  (“Multiple  Logical 
Templates”).  The  system  generates  multiple 
level  of  descriptions  frulesets)  by  applying  the 
same  learning  process  to  images  general^  at 
each  level.  The  first  level  ruleset  relates  to  the 
original  camera-acquired  image.  The  next  level 
ruleset  relates  to  a  “symbolic  image”  that 
consists  of  numerical  labels  associated  with  the 
texture  classes.  These  labels  are  generated  by  the 
application  of  the  first  level  ruleset,  and 
represent  texture  classes  assigned  to  texture 
events  in  the  original  image  [Bala  and  Mich^ski, 
1991].  Subsequent  levels  of  rulesets  are 
generated  by  reai^lying  this  same  {xocess  to  the 
symbolic  images  generated  at  the  previous  step.. 

Here  is  a  more  detailed  description  of  the 
algorithm: 

•  Stq>  1  extracts  a  random  set  of  training  events 
from  the  training  areas  in  the  original  images  by 
applying  various  local  operators  (such  as  Law 
masks,  statistical  measures,  convolution 
operators,  etc.),  and  learns  the  “first-level” 
texture  of  rules; 

•  Step  2  determines  rulesets  generalizing  the 
training  events.  These  rulesets  are  applied  to  the 
training  areas  of  the  original  image,  and  a  new 
image  (a  “symbolic  image”)  is  created.  The 
pixels  of  the  new  image  (the  next  level  image) 
are  numerical  labels  of  texture  classes  assigned 
by  the  ruleset  to  corresponding  events  in  the 
original  (previous  level)  image. 

•  Step  3  determines  the  match  between  the 
texture  training  areas  labeled  by  the  teacher  and 
the  corresponding  areas  in  the  symbolic  image. 
If  the  match  is  sufficiently  high  (or  the  system 
reaches  a  designated  number  of  levels)  then  the 
process  stops.  Otherwise,  the  control  is  passed  to 
the  step  1.  The  events  are  extracted  from  the 
symbolic  image  (the  last  level  image)  and 
assigned  classes  corresponding  to  the  training 
assignment  of  pixels  in  the  original  image  (i.e., 
representing  the  “correct”  partitioning  of  the 
image  into  texture  classes  done  by  the  teacher). 


We  have  performed  a  number  of  experiments 
with  the  system  for  various  numbers  of  texture 
classes  (between  4  and  16),  representing  fined- 
grained  textures,  such  as  sand,  paper,  pebbles, 
etc.  Training  events  were  determined  from 
texture  samplings  using  8x8  windows,  and 
selected  from  texture  training  areas.  The  texture 
training  and  testing  areas  for  each  texture  class 
was  determined  by  a  teacher. 

Table  1  and  2  show  the  confusion  matrices 
characterizing  the  system’s  recognition  rates  (in 
%)  for  individual  texture  events  (using  8x8 
windows)  selected  from  testing  areas  of  four 
texture  classes.  Cl,  C2,  C3  and  C4.  Table  1 
shows  the  recognition  rate  for  first  level  rules, 
and  Table  2 — for  the  second  level  rules.  Recall 
that  the  conditions  of  the  second  level  rules 
apply  not  to  properties  of  the  original  image,  but 
to  the  distribution  of  texture  labels  generated  by 
the  first  level  rules. 


Recognized  texture  class 


Correa 

C  1 

C2 

C3 

C4 

Class 

C  1 

84 

15 

16 

23 

C2 

10 

78 

20 

10 

C3 

7 

14 

79 

27 

C4 

26 

17 

27 

67 

Recognition  rates  using  the  first  level  rules. 
Table  1. 


Recognized  texture  class 


Correa 

Class 

C  1 

C2 

C3 

C4 

C  1 

94 

3 

2 

6 

C2 

4 

96 

4 

4 

C3 

4 

9 

88 

12 

C4 

13 

7 

12 

80 

Recognition  rates  using  the  second  level  rules. 
Table  2, 

The  average  correct  recognition  rate  of 
individual  events  for  the  4  class  experiment  was 
77%  when  using  the  first  level  rules,  and  89.5% 
when  using  the  second  level  rules.  At  the  same 
time,  the  average  misclassiflcation  rate  decreased 
from  17.6%  to  6.6%,  respectively.  Thus,  the 


experiment  has  demonstrated  that  multilevel 
learning  (using  higher  level  rules)  can  increase 
the  system’s  recognition  of  individual  events. 

It  should  be  clearly  noted,  however,  that  to 
recognize  a  given  san^le  of  a  texture,  one  would 
extract  from  the  unknown  texture  not  just  single 
event  (representing  an  8x8  window),  but  also 
several  neighboring  events.  In  such  a  case,  the 
texture  identification  decision  will  be  based  on 
the  majority  of  class  assignments  of  individual 
events  in  the  neighborhood. 

Therefore,  even  when  there  is  a  relatively  low 
recognition  rate  of  individual  events,  one  can 
achieve  100%  recognition  rate  of  the  sample  (it 
is  sufficient  that  the  plurality  of  events  in  the 
sample  are  recognized  correctly).  A  problem 
may  occur  mainly  when  a  sample  is  taken  from 
a  border  area  between  different  textures,  or 
includes  events  characterizing  rare  local  texture 
distortions. 


Figure  2  presents  the  recognition  rate  of 
individual  events  on  learning  of  12  textures, 
using  rules  of  level  1, 2,  3  and  4. 


Figure  2:  An  increase  of  the  system 
performance  with  the  rule  level. 

Figure  2A  shows  the  increase  of  the  recognition 
rate  of  individual  events  from  12  texture  classes 
with  the  rule  level.  Figure  2B  shows  the 
corresponding  decrease  of  the  standard 
deviation  (in  %)  of  the  recognition  rates  with  the 
rule  level.  The  average  recognition  rate  of 
individual  events  increased  from  48%  with  level 
one  rules  to  58%  with  level  four  rules.  At  the 
same  time,  standard  deviation  (of  correct 
recognition  values)  decreased  from  above  20  to 
15,  respectively.  The  minimum  recognition  rate 
increased  from  21%  to  36%. 
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The  TEXTRAL  method  represents  an  extension 
and  improvement  of  our  earlier  method 
implemented  in  the  TEXPERT  system  [Channic, 
1989].  TEXPERT  used  our  earlier  inductive 
learning  system  GEM  (“Generalization  of 
Examples  by  Machine;’’  also  called  AQ14) 
[Reinke,  1984],  'TEXPERT  was  applied  to  the 
problem  of  recognizing  faults  in  laminated 
aircraft  materials  using  ultra-sound  images 

3.2  Learning  Large  Number  of 
Classes:  PRAX 

In  the  TEXTRAL  system,  each  texture  class  is 
represented  by  a  ruleset  [Bala  et  al.,  1992].  If 
there  are  very  many  textile  classes,  there  will  be 
correspondingly  many  rulesets,  and  the  learning 
and  recognition  process  may  become  complex. 
The  PRAX  system  represents  an  alternative 
approach  to  the  problem  of  learning  a  large 
number  of  concept  descriptions  (in  our 
application,  textine  classes). 

The  basic  idea  is  to  designate  some  concepts  to 
be  basic,  and  describe  the  remaining  concepts  in 
terms  of  the  relations  to  the  basic  concepts.  This 
idea  can  be  simply  illustrated  by  the  example  in 
Figure  3. 

If  the  system  already  knows  the  concept  of 
“orange”  (Desl)  and  “lemon”  (Des2),  then  it 
can  learn  the  concept  of  “grapefruit”  by 
relating  properties  of  the  grapeftuit  to  those  of 
the  lemon  and  the  orange  (Des3''),  rather  than  in 
terms  of  original  properties  (Des3'). 


Phase  I.  Learning  Basic  Concepts 


ORANGE 

1 

LEMON 

I 

f 

Det1  =  F(cataf,tMlB,Uc) 

1 

Dm2  =  Ficdor,  tad*,  ale) 

Description  of  basic  concepts 


Phase  II.  Learning  New  Concept 


1  CnAPEFRUT  I 

X 

D«s3':F(  color,  tisti.dc) 

DM3’:F(Dts1,Das9 

Figure  3:  A  simple  illustration  of  the  PRAX 
method. 

In  the  PRAX  method,  descriptions  of  the  basic 
concepts  are  called  “principal  axes.”  They  are 
learnt  in  the  similar  way  as  in  the  TEXTRAL 
system. 


To  learn  a  new,  non-basic  concept,  the  system 
determines  a  similarity  matrix  (SM)  for  that 
concept.  'The  SM  specifies  the  average  degrees 
of  similarity  between  the  training  examples  of 
the  new  concept  and  all  the  principal  axes. 

'The  degree  of  similarity  between  an  event  aiKl 
each  principal  axis  is  determined  according  to  a 
procedure  called  ATEST  [Michalski,  et  al, 
1986].  The  procedure  determines  the 
accumulated  difference  between  the  attribute 
values  in  the  event  and  the  conditions  in  each 
rule  in  the  principal  axis.  To  obtain  a  uniform 
representation  of  all  class  descriptions,  the 
similarity  matrix  is  also  computed  for  all  basic 
concepts. 

These  degrees  of  similarity  can  be  viewed  as 
values  of  the  new  constructed  attributes.  'Ihus, 
this  method  represents  a  special  case  of 
constructive  induction.  ('The  general  concept  of 
constructive  induction  includes  any  method  that 
self-modi  Ties  the  concept  representation  space 
during  the  induction  process.  Generating 
additional,  problem  oriented  attributes  is  an 
important  form  of  such  self-modification  of  the 
representation  space  [Michalski,  1978;  Wnek  & 
Michalski,  1991]). 

To  recognize  an  unclassified  event,  the  method 
creates  an  SM  for  it,  that  is,  determines  a  matrix 
of  similarities  between  the  event  and  the 
principal  axes.  Subsequently,  the  system 
determines  the  best  match  between  the  SM  of 
that  event  and  SMs  of  all  candidate  concepts. 
The  best  match  indicates  the  class  membership. 

'The  method  was  empirically  evaluated  by 
applying  it  to  the  problem  of  learning  24  texture 
classes  from  examples  (  Table  3).  Each  example 
was  described  in  terms  of  eight  multivalued 
attributes  (representing  detectors  of  various  basic 
geometrical  concepts,  such  as  the  presence  of 
lines,  edges,  V-shapes,  etc.).  The  performance  of 
the  PRAX-derived  descriptions  was  compared 
with  the  performance  of  the  k-NN  classifier. 
Different  level  of  misclassification  noise  were 
added  to  test  the  robustness  of  the  method. 

The  main  strength  of  the  method  lies  in  a 
problem-relevant  transformation  of  the 
descriptor  space.  The  new  descriptors  form 
generalized  sub-spaces  of  the  initial,  training 
space.  In  addition,  the  method  uses  a  non-linear 
distance  metric  to  calculate  values  of  constructed 
attributes.  The  distance  metric  based  on  the  idea 
of  flexible  matching  is  less  sensitive  to  noise, 
then  traditional  Euclidean  distance  metric  often 
used  by  pattern  recognition  methods. 
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method 

The  Recognition  Rate 
(in  %)  of  Examples 
from  Unknown 
Texture 

TRXX 

No  Noise 

5%  Noise 

100% 

10%  Noise 

100% 

irm - 

No  Noise 

53^5 

5%  Noise 

92% 

10%  Noise 

87% 

4.  If  the  size  of  the  dataset  falls  below  an 
assumed  percentage  of  the  training  data 
(which  reflects  an  assumed  mor  rate  in  the 
data),  then  go  to  Phase  2.  Otherwise,  return  to 
step  1. 

Phase  2:  Acquire  concept  descriptions  from  the 
reduced  training  dataset  using  the  AQ 
learning  program. 

Hgure  4  shows  results  from  one  run  of  the  AQ- 

NT  system  for  six  texture  classes. 


Table  3:  The  results  from  comparing  PR  AX  with 
the  K-NN  method. 

The  current  jx^oblem  with  the  method  is  that  it 
does  not  have  a  mechanism  for  deciding  how  to 
choose  basic  concepts.  Choosing  the  minimal 
subset  of  concepts  to  be  used  for  principal  axes 
generation  is  important  for  method  to  be 
efficient.  This  problem  will  be  a  subject  of 
future  research.  Another  weakness  is  that  the 
similarity  matrix  is  a  relatively  complex 
representation. 

3.3  Learning  From  Noisy  Data: 
AQ-NT 

The  AQ-NT  method  represents  a  novel  way  of 
handling  problems  of  learning  from  noisy  real- 
world  data  [Pachowicz  and  Bala,  1991].  It  is 
based  on  the  idea  that  events  covered  by  rules 
with  a  low  t-weight  may  be  rei^esenting  noise  in 
the  data.  The  assumption  is  that  the  system 
learning  from  a  dataset  that  does  not  contain 
such  events  has  a  greater  chance  to  produce 
correct  concept  descriptions  than  when  learning 
from  the  original  events. 

The  process  of  learning  concept  descriptions  (in 
the  form  of  a  ruleset)  is  done  in  the  following 
two  {biases: 

Phase  1:  Performs  a  rule-based  “filtration”  of 
the  noise  from  the  training  data.  This  is  done  in 
the  following  way: 

1.  Induce  decision  rules  from  a  given  dataset 
using  the  AQ  learning  program. 

2.  Truncate  concept  descriptions  by  removing 
“the  least  significant”  rules,  defined  as  rules 
that  cover  only  a  small  portion  of  the  training 
data  (have  small  t-weight  relative  to  the  t- 
weight  of  other  rules). 

3.  Create  a  new  training  dataset  that  includes 
only  training  examples  covered  by  the 
mo^fied  concept  descriptions. 


a 


0  5  10  15 

Iteradoiis 


Figure  4:  The  AQ-NT  results. 


Figure  4A  shows  the  increase  of  the  recognition 
accuracy  (in  %)  of  individual  events  with  the 
number  of  iterations.  After  12  iterations,  the 
recognition  accuracy  reached  95.3%.  Figure  4B 
shows  the  average  number  of  rules  for  each 
iteration.  An  average  number  of  rules  can  be 
viewed  as  a  measure  of  description  complexity. 
Figure  4B  shows  a  significant  decrease  in  tl^ 
average  number  of  rules  (from  37  to  3).  This 
result  is  a  significant  indication  of  the 
advantages  of  the  proposed  approach. 

3.4  Rule  Improvement  by  Genetic 
Algorithm:  AQ>GA 

The  size,  complexity,  variability  and  an  inherent 
noise  in  the  vision  data  pose  significant 
difficulties  in  developing  a  reliable  concqrt 
learning  system.  The  AQ-GA  multistrategy 
system  was  developed  to  address  some  of  these 
issues  [Bala  et.  al.,  1993].  This  system  integrates 
two  forms  of  learning,  symbolic  inductive 
generalization  and  genetic  algorithm  based 
learning.  The  integration  is  done  in  a  closed- 
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loop  fashion  in  order  to  achieve  robust  concept 
learning  capabilities. 

The  learning  process  cycles  through  two  phases 
(Figure  5). 

Training  Data  Set 


Hnal  Learned  Description 


Figure  5:  The  AQ-GA  architecture. 

In  the  first  phase,  initial  concept  descriptions  are 
acquired  by  running  a  noise-tolerant  extension 
of  the  AQ15  rule  induction  system.  The 
resulting  concept  descriptions  may  not  be, 
optimal  from  the  performance  viewpoint,  due  to 
the  AQ  bias  to  generate  simple,  cognitively- 
oriented  descriptions.  Therefore,  in  the  second 
phase,  the  system  attempts  to  improve  the 
performance  of  the  descriptions  by  employing  a 
genetic  algorithm  (GA). 

The  descriptions  obtained  from  AQ15  are  semi¬ 
randomly  modified,  using  basic  genetic 
operators:  mutation  and  crossover.  The  resulting 
descriptions  are  evaluated  according  to  a 
performance  criterion.  The  criterion  was  the 
recognition  accuracy  of  the  descriptions  on  the 
“tuning”  data  (a  subset  of  the  training  set  of 
events).  The  best  performing  descriptions  are 
selected  from  the  population,  and  a  new 
generation  is  repeated.  The  process  stops  when  a 
desirable  performance  level  is  achieved,  or  the 
number  of  generations  exceeds  some  limit. 

The  effectiveness  of  this  multistrategy  approach 
was  tested  on  several  texture  recognition 
problems. 


Genetic  algorithms  typically  represent 
individuals  in  a  population  (here,  concept 
descriptions),  using  fixed-length  binary  strings. 
However,  if  the  effective  cooperative  learning  A 
iwvelty  of  this  method  is  that  it  uses,  instead  of 
binary  strings,  concept  descriptions  (formally, 
VLi  expressions)  produced  by  AQ15.  To  this 
end,  a  special  mutation  operator  was  designed  to 
introduce  small  changes  to  selected  condition 
parts  of  the  rules  in  each  concept  description. 
The  condition  parts  are  selected  by  randomly 
generating  two  pointers;  the  first  selects  a  rule, 
and  the  second  one  selects  a  condition  in  this 
rule. 

The  most-left  or  the  most-right  values  of  the 
referent  in  this  condition  are  slightly  modified.. 
For  example,  the  condition  [xl=  10..23]  might 
be  mutated  to  any  of  the  following:  [xl  = 

10.. 20],  [xl  =  10..24],  [xl  =  12..23]  or  [xl  = 

8.. 23],  as  well  as  others.  Such  a  mutation  process 
samples  the  space  of  possible  concept 
description  boundaries  to  improve  the 
performance  criteria.  The  mutation  process  can 
be  viewed  as  equivalent  various  transmutations 
(knowledge  transformations;  Michaiski,  1993) 
of  the  conditional  part  of  a  rule.: 

•  specialization;  [x5  =  3, 10..23]=»  [x5  =  3, 10..20] 

•  generalization:  [x5  =  3, 10..23]=^  [x5  =  3, 10..24] 

•  variation:  [x5  =  3, 10..23]=»  [x5  =  5, 10..23] 

The  crossover  operation  is  performed  by 
splitting  concept  description  into  two  parts, 
upper  rules  and  lower  rules.  These  parts  are 
exchanged  between  parent  concept  descriptions 
to  produce  new  child  concept  descriptions.  Since 
the  degree  of  match  of  a  given  tuning  event 
depends  on  the  degree  of  match  of  this  event  to 
each  rule  of  concept  description,  this  exchange 
fx^ocess  enables  inheritance  of  information  about 
strong  rules  (strongly  matching)  in  the 
individuals  of  the  next  evolved  population.  An 
example  of  crossover  aiq>lied  to  short,  four  rules 
description  is  depicted  below; 

Parent  description  1 

1  [X  1=7..8]  [x2=8..  19]  [x3=8..  1 3]  [x5=4..541 

2  [xl=15..54]  [x3=11..14]  [x6=0..9]  Ix7=0..11] 

- crossover  position - 

3  [xl=9..181  [x3=16..21]  [x4=9..101 

4  [xl=10..14]  [x3=13..16]  [x4=14..54] 

Parent  description  2 

1  lxl=16..54I  lx5^..6]  lx7=5..12] 

2  /xJ=&.25]  /x3=8..23]/x4=9..II}  lx5=0..3] 

- crossover  position - 

3  [x4^0..22]  lx5^..9J  /x(5=tt.7y  1x7=11. .48] 

4  lx2=5..8I  lx3=7..8]  1x4=8.11]  [x5=0..3] 
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The  result  of  the  crossover  operation  (one  of  two 
child  descriptions)  is  the  following: 

1  [xl=7..8]  [x2=8..19]  [x3=8..13]  (x5=4..54] 

2  [xl=15..54]  [x3=11..14]  [x6=0..9]  [x7=0..11] 

3  [x4=0..22]  [x5=8..9]  [x6=0..7]  1x7=1  LAS] 

4  [x2=5..8]  [x3=7..8]  [x4=8..11]  [x5=0..3] 

The  performed  experiments,  involving  learning 
rules  for  describing  texture  classes,  demonstrated 
that  the  classification  results  obtained  with  the 
hybrid  learning  algorithm  (Figure  5)  (  AQ 

Training  Data  ->  AQ  and  Tuning  Data  -> 
GAs  )  exceed  the  performance  of  the  AQ 
algorithm  used  alone  (AQ  Training  Data  -t- 
Tuning  Data  ->  AQ).  Figure  6  presents 
recognition  rates  from  one  of  the  experiments. 


Figure  6:  Recognition  accuracy  during  the  GA 
generations. 

The  result  shows  the  recognition  rates  for  the 
class  that  was  optimized  by  the  GA.  White  marks 
on  the  diagrams  represent  results  obtained  for 
tuning  data  and  black  marks  represent 
recognition  results  for  testing  data. 

3.5  Texture  Recognition  Under 

Varying  Perceptual  Conditions: 
Chameleon 

In  recognizing  natural  objects  outdoors  we  have 
to  deal  with  a  great  variability  of  perceptual 
conditions  that  influence  changes  in  object 
visual  characteristics.  Most  vision  research  on 
object  recognition  in  outdoor  environments, 
however,  has  been  focused  on  recognizing 
objects  under  stationary  conditions  rather  than 
dynamic  conditions  (varying  resolution,  lighting, 
pose,  and  state).  Object  models,  particularly 
texture  models,  when  learned  under  a  given 


conditions  are  not  effective  in  recognizing 
objects  under  different  perceptual  conditions. 

To  develop  robust  object  recognition  systems, 
we  have  to  implement  system  adaptation 
capabilities  that  can  mitigate  influence  of 
changing  perceptual  conditions  on  the 
effectiveness  of  object  recognition.  Our  ultimate 
goal  is  to  integrate  learning  and  vision  modules 
in  such  a  way  that  learning  functions  can 
support  adaptation  functions  of  the  vision 
system. 

The  variability  of  texture  characteristics  under 
changing  resolution,  lighting  and  pose  has  been 
investigated.  It  was  found  that  texture  attribute 
distribution  (for  different  attribute  extraction 
methods)  can  vary  significantly  when  these 
conditions  are  changed.  The  shape  of  the 
distribution  often  contains  a  multimodality  of 
texture  characteristics.  In  most  cases  studied, 
the  distribution  cannot  be  determined  a-priori. 
We  also  observe  that  the  variability  of  perceptual 
conditions  causes  a  significant  translation  of  the 
attribute  distribution  within  the  attribute  space. 

Relatively  little  has  been  done  on  the  i^lication 
of  machine  learning  methods  to  the  at^Mability 
of  vision  systems  to  the  dynamic  environment. 
Bhanu  et  al.  [1989,  1990]  apply  genetic 
algorithms  to  image  segmentation  problems  with 
an  extension  towards  segmentation  under 
variable  perceptual  conditions. 

The  variability  of  object  appearance 
(particularly  texture  characteristics)  requires  the 
development  of  system  capabilities  that  will 
dynamically  reconfigure  and  update  object 
models  (knowledge).  In  our  approach 
[Pachowicz,  1991],  system  adaptation  is  applied 
to  recognize  objects  on  images  acquired  over 
time. 


In  order  to  recognize  an  object  on  images 
sequentially,  the  system  has  to  iteratively  update 
the  object  model  with  regard  to  changes  in 
object  characteristics.  In  this  iq)proach,  a  time 
sequence  of  images  monitors  slight  changes  in 
resolution,  lighting  and  surface  positioning  from 
one  image  to  the  next  one  ~  a  sequence  of 
images  is  affected  by  continuous  changes  in 
perceptual  conditions. 

We  integrate  the  learning  and  recognition 
processes  within  a  closed  loop  to  update  texture 
models.  Analysis  of  system  recognition 
effectiveness,  performed  over  a  sequence  of 


147 


images,  detects  changes  in  textures.  If  this 
effectiveness  decreases  then  the  system  activates 
incremental  learning  processes  of  model 
modification  to  improve  the  model 
discriminating  power.  The  system  learns  initial 
texture  models  from  teacher-provided  examples. 
Then,  the  system  updates  these  descriptions 
automatically  without  teacher  help. 

Two  systems  were  developed:  CHAMELEON  '91 
and  CHAMELEON  '92.  Hie  first  system  had 
only  some  of  the  adaptability  functions 
implemented  [Pachowicz  et  al.,  1992,  Pachowicz, 
1992].  In  this  system,  a  teacher  segments  each 
image  in  a  sequence.  The  system  was  useful  for 
investigating  stability  problems  and  for  the 
modification  of  object  models  performed  on¬ 
line.  The  second  system  is  more  autonomous, 
and  needs  much  less  help  from  a  human 
operator.  The  underlying  methodology,  system 
architecture,  and  experimental  results  are 
in’esented  in  a  separate  paper  in  the  Proceedings 
[Pachowicz,  1993]. 


4  Summary 

We  have  presented  a  general  approach,  called 
Multilevel  Logical  Templates  (MLT),  and  several 
implemented  systems  for  inductive  learning 
descriptions  of  texture  classes  from  texture 
samples.  These  systems  represent  different 
variations  and  extensions  of  the  general 
approach  oriented  toward  various  types  of 
learning  problems: 


problem  of  learning  of  the  texture  descri{Hions, 
but  also  to  other  types  of  problems  in  vision. 

There  are  several  limitations  of  the  current 
methods.  We  have  not  investigated  issues  of  the 
robustness  and  the  sensitivity  of  the  methods  to 
various  invariant  texture  transformations  (e.g.,  a 
significant  changes  in  the  illumination).  Also,  it 
is  unclear  how  the  performance  of  the  methods 
depends  on  the  number  of  texture  classes. 

There  are  several  other  major  topics  to  be 
investigated  in  future  research: 

(i)  the  enhancements  to  the  current  learning 
methodology  to  include  capabilities  for 
automatically  generating  higher  level 
problem-relevant  attributers  (constructive 
induction) 

(ii)  the  ^plicability  of  multistrategy  learning 
(e.g.,  combining  symbolic  rule  learning 
with  neural  network  learning;  the  issue  of 
representing  and  learning  of  imprecisely 
defined  visual  concq>ts). 

(iii)  the  extensions  of  the  methodology  to  other 
problems  in  vision,  e.g.,  learning  of  shape 
classes. 

(iv)  learning  new  visual  concepts  in  terms  of 
differences  and  similarities  form  known 
concepts,  and  developing  a  calculus  for 
representing  symbolic  differences  between 
visual  concepts. 


•  TEXTRAL  — for  multilevel  learning  to 
maximally  improve  the  recognition  accuracy 
of  new  textures, 

•  AQ-NT — for  learning  from  data  with  noise, 

•  PRAX  —  for  learning  from  large  numbers  of 
texture  classes, 

•  AQ-GA  —  for  automatic  tuning  and 
enhancement  of  concept  descriptions 
obtained  by  a  standard  AQ  inductive  learning 
method. 
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Abstract 

We  have  made  progress  in  a  number  of  areas  this  past 
year,  primarily  in  Pl^cs-Based  Vision.  We  have  built 
and  are  continniM  to  develop  different  versions  of  a  new 
type  of  sensor  cawd  a  polarization  camera  which  we  be¬ 
lieve  will  make  the  more  general  capabilities  of  polarisa¬ 
tion  vision  more  accessible  to  many  image  understanding 
apidications.  An  advancement  in  the  modeling  of  dif- 
fuM  reflection  from  smooth  dielectric  surfaces  has  been 
made,  resulting  in  a  relatively  simple  formula  that  sig¬ 
nificantly  generalises  Lambert’s  Law.  We  have  developra 
a  new  stereo  technique  which  utilises  photometric  ratios 
as  an  invariant  across  a  stereo  pair  of  cameras  for  point 
correspondence  on  smooth  surges,  producing  accurate 
dense  depth  maps.  Shape  recovery  techniques  have  been 
studied  which  utilise  both  reflectance  and  ran^e  data  to 
compute  shape  more  accurately  than  ficom  each  individual 
data  source  ^one. 

1  Polarisation  Vision 

There  is  a  compelling  motivation  to  study  polarisation 
vision  •  polarisation  affords  a  more  general  description 
of  light  than  does  intensity,  and  can  therefore  provide  a 
richer  set  of  descriptive  physical  constraints  for  the  inter¬ 
pretation  of  an  imaged  scene.  As  intensity  is  the  linear 
sum  of  polarisation  components,  intensity  images  physi¬ 
cally  represent  reduced  polarisation  information.  Because 
the  study  of  polarisation  vision  is  more  general  than  in¬ 
tensity  vision  there  are  polarisation  cues  that  can  im¬ 
mense  simp^  some  important  visual  tasks  (e.g.,  mate¬ 
rial  classiflcation,  reflection  component  analyra,  identifi¬ 
cation  of  specular  reflection,  image  region  segmentations, 
etc...)  whiiu  are  more  complicated  or  posubly  infeasible 
when  limited  to  using  intensity  and  color  information.  A 
detailed  description  of  a  variety  of  polvisation-based  vi¬ 
sion  methods  are  contained  in  [14],  [15],  [25],  [16],  [3]. 

1.1  Polswisatioii  Camera  Computational 
Senaora:  The  Big  Picture 

A  criticism  that  has  sometimes  been  leveled  at 
pdarisation-based  vision  methods  is  the  inconvenience  of 
obtaining  polarisation  component  images  by  having  to 
place  a  linear  pcdatising  filter  in  front  of  an  intensity  CCD 
camera  and  medianicwy  rotating  this  filter  by  hand  or 
by  motor  into  different  orientations.  This  inconvenience 
is  simidy  a  result  of  commercially  available  camera  sen¬ 
sors  b^g  geared  towards  taking  intensity  images  instead 
of  polarisation  images.  In  our  conception,  polarisation  vi¬ 
sion  is  no  more  a  Multiple  view”  problem  than  is  color 
vision,  and  a  camera  can  be  developed  that  can  automat- 
kaDy  sense  polarisation  components  and  even  automati¬ 
cally  compute  physical  scene  properties  that  are  directly 


related  to  this  polarisation  information.  Such  a  polariza¬ 
tion  camera  sensor  was  originally  suggested  in  fl^,  and  in 
the  past  year  at  Johns  Hopkins  we  have  built  a  uquid  crys¬ 
tal  implementation  of  a  polarisation  camera  [26] .  We  are 
continuing  to  build  a  variety  of  other  ^larisation  camera 
sensors  including  a  beamsplitter  implementation,  and  a 
self-contained  VLSI  chip  implementation  some  of  which  is 
discussed  in  an  article  contained  in  these  proceedings  [23]. 

Because  humans  do  not  observe  polarisation  directly 
except  with  the  aid  of  special  filters,  it  is  beneficial  for 
a  polarisation  camera  to  produce  some  kind  of  visualisa¬ 
tion  for  representing  sensed  polarisation  information  (e.g., 
intensity-color  representation)  for  scene  analysis.  We  uti¬ 
lise  a  hue-saturation-intensity  visualisation  for  partially 
linearly  polarised  light  [26],  [23].  Such  a  scheme  was  sug¬ 
gested  by  Bernard  and  Wehner  [1]  as  a  functional  nmi- 
larity  between  polarisation  vision  and  color  vision  for  bi¬ 
ological  vision  systems.  Whether  a  polarisation  camera 
computes  a  visualisation  of  sensed  polarisation  informa¬ 
tion  at  each  pixel,  or  computes  a  visualisation  of  physical 
information  (e.g.,  dielectric/metal  composition)  at  each 
pixel  related  to  sensed  polarisation,  a  polarisation  cam¬ 
era  is  inherently  a  computational  sensor.  The  speed  at 
which  such  computations  can  be  performed  is  important 
for  real-time  applications. 

We  feel  that  there  are  considerable  advantages  to  build¬ 
ing  a  polarisation  camera  sensor  geared  towaru  doing  po¬ 
larisation  vision.  There  already  exist  polarisation-baUd 
vision  methods  that  can  significantly  benefit  a  number 
of  application  areas  such  as  aerial  reconnaissance,  au¬ 
tonomous  navigation  (e.g.,  UGV),  target  recognition,  in¬ 
spection,  luid,  manuiMturing  and  quahty  control.  A  po¬ 
larisation  camera  would  mue  polarisation-based  vision 
methods  more  accessible  to  these  application  areas  and 
others.  It  should  be  fully  realised  that,  as  intensity  is 
a  compression  of  polarisation  component  information,  a 
polarisation  camera  can  function  as  a  conventional  in¬ 
tensity  camera,  so  that  intensity  vision  methods  can  be 
implemented  by  such  a  camera  either  alone,  or,  together 
with  polarisation-based  vision  methods.  As  intensity- 
based  methods  are  physical  instances  of  polarisation- 
based  methods,  a  camera  sensor  geared  towards  polarisa¬ 
tion  vision  does  not  in  any  way  exclude  intensity  vision,  it 
only  generalises  it  providing  more  physical  input  to  an  au¬ 
tomated  vision  system!  Admng  color  sensing  capability  to 
a  polarisation  camera  makes  it  possible  to  sense  the  com¬ 
plete  set  of  electromagnetic  parameters  of  light  incident 
on  the  camera. 

We  are  also  considering,  once  a  high  resolution  VLSI 
polarisation  camera  chip  has  been  eventually  made,  of 
using  such  a  chip  for  polarization  goggles  exten^ng  human 
vision  into  the  polarisation  domain.  This  is  in  analogy 
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with  the  way  night  vision  goggles  extend  human  vision  to 
"see”  othet  wavdengths  of  light.  Polarisation  gobies  may 
be  useful  to  a  number  of  areas  which  require  this  type  of 
enhanced  vision.  It  u  known  that  many  biological  animals 
(e.g.t  bees  and  fish)  receive  important  visual  information 
in  the  polarisation  domain. 

A  patent  is  pending  for  a  variety  of  these  polarisation 
camera  and  polarisation  goggle  devices  discussed  above 

[13]. 

1.2  Liquid  Crystal  Polarization  Camera 

A  polarisation  camera  has  been  designed  and  built  in 
the  Computer  Vision  Laboratory  that  utilises  two  Twisted 
Nematic  (TN)  liquid  crystak  in  series  with  a  fixed  polar- 
iser  analyser  placed  in  &ont  of  a  standard  intensity  CCD 
camera.  See  Figure  1.  The  TN  liquid  crystals  electro- 
optically  rotate  the  plane  of  the  polarisation  of  light,  con¬ 
trolled  by  electrical  voltages  placed  across  the  liquid  crys¬ 
tals  in  synchronisation  with  camera  video.  Not  only  does 
this  obviate  the  need  for  mechanically  rotating  a  polaris¬ 
ing  filter  in  front  of  a  CCD  camera  to  image  polarisation 
components,  but  optical  distortion  caused  by  the  wobbling 
&om  such  mechanically  rotation  is  virtuaUy  eliminated. 
Components  of  polarisation  are  imaged  under  full  auto¬ 
matic  computer  control,  and  these  are  processed  on  a  Dat- 
acube  MV-20  board  programmable  via  Image  flow  soft¬ 
ware  from  a  SUN  workstation.  For  details  see  [26],  [23]. 
One  program  on  the  Datacube  MV-20  computes  from  po¬ 
larisation  component  images  a  hue-saturation-intensity  vi¬ 
sualisation  at  each  pixel  for  partial  linear  polarisation  rep¬ 
resenting,  respectively,  the  orientation  of  the  plane  of  the 
linear  component  of  polarisation,  partial  polarisation  (i.e., 
percentage  of  linear  polarisation  content),  and,  intensity. 
Another  program  computes  dielectric/metal  composition 
from  polarisation  component  images.  Our  liquid  crystal 
polarisation  camera  can  generate  up  to  2.5  polarization 
images  a  second.  The  midn  timing  bottleneck  is  the  re¬ 
laxation  time  of  100ms  for  each  of  the  liquid  crystals  to 
switch  states.  With  the  most  current  faster  liquid  crystals 
we  can  at  least  double  the  rate  of  polarisation  images  per 
second,  and  we  intend  to  incorporate  these  newer  liquid 
crystals  in  our  implementation.  A  nice  feature  about  our 
liquid  crystal  polarisation  camera  is  that  with  the  Dat¬ 
acube  MV-20  board,  it  is  a  programmable  computational 
sensor  in  that  sensed  polarisation  components  can  be  pro¬ 


Depending  upon  interest  from  the  image  understand¬ 
ing  community,  a  self-contained  optical  head  can  be  made 
from  liquid  crystals  and  a  polarizer,  with  appropriate  elec¬ 
trical  contacts,  that  can  be  mounted  on  the  lens  for  a  CCD 


camera.  Assuming  the  existence  of  hardware  for  digitis¬ 
ing  and  processing  polarisation  component  images  this  is 
a  Tow  cost  way  of  converting  an  intensity  CCD  camera 
into  a  polarisation  camera. 

1.3  Polarization  Camera  Using  Beamsplitter 

A  common  design  for  high  quality  color  cameras  is  to 
use  a  beamsplitter  that  directs  equ^  amounts  of  incom¬ 
ing  light  onto  3  separate  CCD  clups  for  red,  green,  and, 
blue.  A  similar  idea  can  be  used  to  direct  light  onto  mul¬ 
tiple  CCD  chips,  each  chip  covered  by  a  uniquely  oriented 
polarising  filter.  Unfortunately  the  polarising  properties 
of  most  common  kinds  of  beamsplitters  can  be  variable 
across  standard  wide  fields  of  view. 

We  are  developing  a  prototype  for  a  2-CCD  polarisa¬ 
tion  camera  utilising  a  polarizing  plate  beamsplitter.  See 
Figure  2.  The  simplicity  of  this  design  stems  from  the  use 
of  a  special  coating  on  a  glass  plate  producing  a  beam¬ 
splitter  that  effects  the  polarization  of  transmitted  and  re¬ 
flected  light  in  a  nearly  constant  known  way  across  a  fairly 
wide  range  of  angles  (i.e.,  ±20").  Some  details  are  de¬ 
scribed  in  [23]  contain^  in  these  proceedings.  The  polar¬ 
ization  state  of  reflected  and  transmitted  light  is  effected 
in  a  linearly  independent  way  by  the  plate  beamsplitter. 
If  reflected  and  transmitted  light  are  incident  on  different 
CCD  chips,  the  horizontal  and  vertical  components  of  po¬ 
larization  can  be  resolved  by  solving  a  linear  set  of  simulta¬ 
neous  equations,  and  without  the  need  for  any  polarising 
filters  on  the  CCD  chips.  The  current  trade-off  between 
this  type  of  polarization  camera  and  our  liquid  crystal 
polarization  camera  is  speed  vs.  amount  of  polarisation 
information.  While  the  2-CCD  polarisation  camera  with 
beamsplitter  can  operate  at  least  at  15  frames/second  and 
probably  at  30  frames/second,  only  two  components  of  po¬ 
larization  are  resolved  as  compared  with  the  complete  set 
of  three  components  resolved  by  the  liquid  cryst^  polar¬ 
ization  camera  needed  to  compute  a  complete  state  of  par¬ 
tial  linear  polarization.  One  way  of  extending  the  2-CCD 
camera  design  to  resolve  three  components  of  polarisation 
b  to  add  a  TN  liquid  crystal  intercepting  light  before  it 
reaches  the  beamsplitter.  There  are  obvious  extensions 
using  three  CCD  chips  at  the  expense  of  more  difficult 
registration  problems.  However,  the  2-CCD  polarization 
camera  with  polarizing  plate  beamsplitter  currently  ap¬ 
pears  to  be  a  simple  robust  design  that  may  b?  able  in 
the  short  term  to  give  near  frame-rate,  partial  capability 
for  polarization  vision  for  applications  such  as  the  DARPA 
Unmanned  Ground  Vehicle. 
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1.4  Polarisation  Camera  Chips 

In  collaboration  with  Prof.  Andreeis  Andreou  in 
the  Electrical  and  Computer  Engineering  department  at 
Johns  Hopkins  we  are  in  the  process  of  developing  self- 
contained  VLSI  versions  of  polarization  cameras  that 
sense  complete  states  of  partial  linear  polarization  on- 
chip.  Currently  we  are  experimenting  with  chips  (designed 
at  Hopkins  and  fabricated  at  MOSIS)  that  compute  on- 
chip  a  hue-saturation-intensity  representation  of  partial 
linear  polarisation  at  each  “polarization  pixel”  and  output 
this  directly  in  video.  VLSI  offers  very  high  computational 
throughput  so  that  VLSI  polarization  cameras  can  poten¬ 
tially  operate  at  very  high  speeds  and  benefit  a  number 
of  application  areas  that  require  real-time  operation. 

1.5  Modeling  of  Polarization  in  Outdoor 
Scenes 

Skylight  is  partially  polarized,  dependent  upon  posi¬ 
tion  relative  to  the  sun  according  to  the  law  of  Rayleigh 
scattering.  Reflected  polarization  from  different  terrain 
types  (e.g.,  water,  vegetation,  soil,  etc...)  is  dependent 
upon  which  patch  of  skylight  is  producing  the  reflection. 
A  polarisation  reflectance  model  is  proposed  in  [19]  which 
combines  the  polarisation  model  originally  presented  in 
[14]  with  a  polarisation  model  for  skylight  illumination. 
Such  a  model  can  be  used  to  predict  reflected  polarization 
dependent  upon  viewer  orientation  relative  to  surface  ter¬ 
rain  and  could  be  potentially  useful  in  providing  insight 
into  polarization-based  identification  of  terrain  type  for 
autonomous  navigation. 

2  Reflectance  Modeling 
2.1  Diffuse  Reflection 

Perhaps  the  most  widely  used  assumption  about  re¬ 
flectance  from  materials  in  computer  vision  and  in  com¬ 
putet  graphics  is  Lambert's  Law  for  diffuse  reflection  [8]. 
Lambert  predicted  that  diffuse  reflection  from  a  material 
contributed  by  light  incident  from  a  specified  direction  is 
proportional  to  the  cosine  of  the  angle  between  this  inci¬ 
dent  direction  and  the  surface  norm^,  independent  of  the 
direction  of  reflection.  While  relatively  little  physical  mo¬ 
tivation  was  given  for  this  law  when  it  was  first  published 
over  200  years  ago,  it  has  been  adopted  by  the  computer 
vision  and  computer  graphics  communities  primarily  be¬ 
cause  it  serves  as  a  reasonably  accurate  and  computation¬ 
ally  simple  approximation  for  describing  diffuse  reflection 
under  a  number  of  conditions. 

A  prevalent  class  of  materials  encountered  both  in  com¬ 
mon  experience  and  in  vision/robotics  environments  are 
inhomogeneous  dielectrics  which  include  plastics,  ceram¬ 
ics,  and,  rubber.  It  has  been  known  that  diffuse  reflection 
from  smooth  inhomogeneous  dielectric  surfaces  can  seri¬ 
ously  deviate  from  Lambert’s  Law  under  certain  condi¬ 
tions.  We  have  formally  derived  from  first  physical  princi¬ 
ples  and  extensively  empirically  verified  a  relatively  simple 
formula  for  diffuse  reflection  from  smooth  inhomogeneous 
dielectric  surfaces  that  accurately  explains  striking  devi¬ 
ations  from  Lambertian  behavior  [21],  [18],  [171.  Using 
the  geometry  depicted  in  Figure  3,  if  light  is  incident  with 
radiance,  L,  at  incidence  angle,  through  a  small  solid 
angle,  du,  on  a  smooth  dielectric  surface,  then 


pL  X  (1  —  F'(^,  n))  X  cos  V’  x  (1  —  I^(sin  l/n))dti; 


describes  the  diffuse  reflected  radiance  into  emittance  an¬ 
gle  (i.e.,  viewing  angle),  <^.  The  terms  F  refer  to  the  Fres¬ 
nel  reflection  coefficients  [11],  n,  is  the  index  of  refraction 
of  the  dielectric  medium,  and,  p,  is  the  diffuse  albedo.  This 
diffuse  reflectance  formula  accurately  describes  the  depen¬ 
dence  of  diffuse  reflection  on  viewing  angle,  falling  off  to 
zero  as  viewing  approaches  grazing.  This  formula  also  ac¬ 
curately  shows  that  diffuse  reflection  falls  off  faster  than 
predicted  by  Lambert’s  law  as  a  function  of  angle  of  inci¬ 
dence,  particularly  as  angle  of  incidence  approaches  close 
to  90”.  For  exact  details  see  [22]  in  these  proceedings. 
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When  the  Fresnel  reflection  coefficients  are  close  to 
zero,  formula  1  becomes  almost  identical  to  Lambert’s 
Law,  and  this  is  true  to  good  approximation  when  in¬ 
cidence  and  emittance  angles  are  both  within  the  range  of 
0”  —  50”  This  explains  why  Lambert’s  Law  has  generally 
been  accepted  as  a  reasonably  good  approximation.  How¬ 
ever,  if  either  one  or  both  incidence  and  emittance  angles 
are  outside  this  range,  major  deviations  from  Lambert’s 
Law  occur  as  the  Fresnel  coefficients  become  significant. 
In  [22]  (these  proceedings)  experiments  are  shown  that 
illustrate  striking  non-Lambertian  effects  near  occluding 
contours  under  oblique  iUumination  that  are  accurately 
explained  by  our  new  formula. 

Formula  1  has  bearing  on  virtually  any  technique  in 
computer  vision  and  image  understanding  that  relies  upon 
the  Lambertian  assumption  applied  to  dielectric  surfaces, 
including  shape  from  shading,  shape  and/or  roughness  de¬ 
termination  from  multiple  light  source  illumination  (e.g., 
photometric  stereo)  and  shape  from  intereflection.  It  is 
impossible  to  reference  all  of  the  related  works  but  the 
book  by  Horn  and  Brooks  [6]  contains  a  number  of  ap¬ 
plicable  papers.  This  result  makes  it  possible  to  precisely 
analyze  the  conditions  under  which  it  is  reasonable  to  as¬ 
sume  the  Lambertian  model  for  a  particular  technique, 
and  the  conditions  under  which  the  Lambertian  model 
breaks  down.  In  turn  our  formula  can  be  utilized  to 
precisely  analyze  when  certain  image  understanding  tech¬ 
niques  are  vaJid.  This  more  general  diffuse  reflectance 
model  provides  a  more  solid  physical  foundation  upon 
which  to  develop  accurate  object  feature  extraction  tech¬ 
niques  in  computer  vision. 

2.2  Relative  Strength  of  Specular  and  Dif¬ 
fuse  Reflection:  How  Bright  is  a  Specu¬ 
larity  ? 

An  additional  advancement  that  our  diffuse  reflectance 
model  for  inhomogeneous  dielectrics  makes  is  that  it  di¬ 
rectly  relates  the  diffuse  albedo,  p,  to  phvsical  surface  pa¬ 
rameters.  As  explained  in  [21],  [18],  [17],  [[22]  contained 
in  these  proceedings],  inhomogeneous  dielectric  material 
is  modeled  as  a  collection  of  scatterers  contained  in  a  uni¬ 
form  dielectric  medium  with  index  of  refraction  different 
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from  that  of  air.  Diffuse  reflected  intensity  results  from 
the  process  of  incident  light  refracting  into  the  dielectric 
medium,  producing  a  subsurface  diffuse  intensity  distribu¬ 
tion  from  multiple  internal  scattering,  and  then  refraction 
of  this  subsurface  diffuse  intensity  distribution  back  out 
into  air.  We  show  that  the  diffuse  albedo,  g,  is  dependent 
upon  both  the  single  scattering  albedo,  p,  describing  the 
proportion  of  ener^  reradiated  upon  each  subsurface  sin¬ 
gle  scattering,  andV  the  index  of  refraction  n.  The  exact 
relationship  is  given  in  [18],  [17],  [22]. 

Whereas  before  diffuse  albedo  was  just  an  ad  hoc 
scaling  coefficient  our  diffuse  reflection  model  explains 
the  physical  origin  of  diffuse  albedo.  This  means  that 
combined  diffuse  and  specular  reflection  can  be  modeled 
purely  in  terms  of  physical  material  parameters  [20],  [22]. 
An  important  consequence  of  this  is  that  the  relative 
brightness  of  diffuse  and  specular  reflection  can  be  pre¬ 
dicted  in  terms  of  these  material  parameters.  It  has  al¬ 
ways  been  known  that  specularities  are  typically  brighter 
than  diffuse  reflection  and  there  are  a  number  of  im¬ 
age  understanding  techniques  that  segment  specularities 
using  contrast  thresholding  without  real  physical  moti¬ 
vation  for  selecting  such  thresholds.  Specularity-diffuse 
contrasts  can  be  precisely  predicted  from  our  physical 
model  as  described  in  [20],  [22].  Even  if  the  material  sur¬ 
face  and  illumination  conditions  are  completely  unknown, 
we  have  derived  physical  lower  bounds  for  specularity- 
diffuse  contrasts  below  which  it  is  physically  impossible  for 
specularity-diffuse  contrasts  to  occur.  This  may  be  useful 
as  an  added  feature  to  image  understanding  techniques  in 
discerning  specular  reflection  on  dielectric  surfaces. 

3  3-D  Stereo  Correspondence  Utilizing 

Photometric  Ratios  as  an  Invariant 

Correctly  corresponding  points  on  a  smooth  featureless 
surface  utilising  intensity  vdues  between  a  stereo  pair  of 
images  can  in  practice  be  very  difficult  to  do  for  a  variety 
of  reasons  having  to  do  with  the  nature  of  video  cam¬ 
eras  and  object  reflectance.  Influencing  image  grey  values 
are  F-stop,  image  plane-to-lens  distance,  angle  between 
incident  light  on  a  pixel  and  the  optic  axis  in  a  perspec¬ 
tive  image,  and,  camera  gain,  all  of  which  can  be  at  least 
slightly  variable  in  an  unpredictable  way  across  a  stereo 
pair  of  cameras.  In  addition,  as  was  seen  in  the  last  sec¬ 
tion  with  respect  to  our  research  on  diffuse  reflection  from 
dielectrics,  such  reflection  is  in  fact  viewpoint  dependent. 

The  work  of  Wolff  and  Angelopoulou  [24]  (in  these  pro¬ 
ceedings)  shows  that  photometric  ratios  produced  from 
diffuse  reflection  from  different  but  not  necessary  to  be 
known  illumination  conditions  are  reliable  for  accurate 
correspondence  of  object  points  along  epipolar  lines  across 
a  stereo  pair  of  images.  The  methodology  utilizes  multiple 
stereo  pairs  of  images,  each  stereo  pair  taken  of  exactly 
the  same  scene  but  under  different  illumination.  With 
just  two  stereo  pairs  of  images  taken  respectively  for  two 
different  illumination  conditions,  a  stereo  pair  of  photo¬ 
metric  ratio  images  can  be  produced;  one  for  the  ratio 
of  left  images,  and  one  for  the  ratio  of  right  images.  We 
show  that  such  photometric  ratio  images  are  invariant  to 
changes  in  video  camera  parameters  listed  above.  Be¬ 
cause  formula  1  above  is  a  separable  function  in  variables 
incidence  angle,  and  emittance  angle,  <f>,  photometric 
ratios  of  diffuse  reflection  are  invariant  to  varying  view¬ 
point  as  well.  We  show  that  object  points  having  the 
same  photometric  ratio  with  respect  to  two  different  il¬ 
lumination  conditions  comprise  a  well-defined  equivalence 
class  of  physical  constraints  defined  by  local  surface  ori¬ 


entation  relative  to  illumination  conditions.  Stereo  cor¬ 
respondence  of  equal  photometric  ratios  is  in  essence  the 
correspondence  of  equivalent  physical  construnts  across  a 
stereo  pair  of  images  without  ever  having  to  know  explic¬ 
itly  what  these  physical  constraints  actually  are. 

The  advantage  of  using  photometric  ratios  for  stereo 
correspondence  is  that  it  is  a  robust  way  of  obtaining 
a  dense  3-D  depth  map  of  smooth  featureless  surfaces, 
something  which  is  normally  hard  to  do  from  image  fea¬ 
ture  correspondence  (e.g.,  edge  correspondence).  While 
two  different  illumination  conditions  are  required,  these 
conditions  can  be  arbitrary  (e.g.,  extended  light  source, 
point  light  source,  etc...)  and  never  need  to  be  known, 
and  this  technique  works  in  full  perspective  views.  We 
demonstrate  in  [24]  experimental  3-D  depth  determina¬ 
tion  of  a  dense  set  of  points  using  our  stereo  technique 
on  smooth  objects  of  known  ground  truth  shape  that  are 
accurate  to  well  within  ±1%  relative  depth. 

We  have  also  noticed  that  isoratio  curves  formed  by 
image  pixels  with  equal  photometric  ratio  are  invariant 
to  surface  albedo  and  can  serve  as  a  useful  photometric 
invariant  for  object  recognition  [24].  Isoratio  curves  can 
be  used  to  distinguish  important  geometric  characteris¬ 
tics  between  different  objects  fairly  independent  of  diffuse 
reflectance  properties. 

Previous  work  on  using  intensity  values  to  determine 
surface  shape  from  stereo  correspondence  of  reflectance 
includes  the  work  of  Crimson  [5]  and  Smith  [12[.  Ikeuchi 
[7]  pioneered  a  technique  caUed  “dual  photometric  stereo” 
which  utilizes  photometric  stereo  to  determine  surface  ori¬ 
entation  from  a  stereo  pair  of  orthographic  views,  and 
then  corresponds  surface  orientation  constraints  making 
sure  to  preserve  consistency  between  surface  orientation 
and  depth. 

4  3-D  Shape  Recovery  From  Reflectance 

and  Range  Data 

The  pioblem  of  finding  3-D  shape  of  a  smooth  object 
from  a  single  intensity  image  is  a  very  difficult  problem 
even  when  light  source  incident  orientation  and  the  re¬ 
flectance  map  is  known  precisely  [6].  Researchers  have 
also  used  depth  information  from  range  finders  to  deter¬ 
mine  3-D  shape  [2].  We  [Mancini  and  Wolff]  [9],  [10]  have 
been  exploring  shape  recovery  methodologies  that  com¬ 
bines  range  and  reflectance  data  to  determine  local  surface 
orientation  more  accurately  than  is  possible  by  each  data 
source  individually.  Even  more,  we  have  extended  our 
technique  to  solve  simultaneously  for  initially  unknown 
point  light  source  position  a  finite  distance  away,  and  lo¬ 
cal  surface  orientation. 

The  first  step  is  to  get  initial  local  surface  orientation 
estimates  from  least  squares  fitting  of  local  quadric  sur¬ 
faces  to  range  data  which  has  a  specified  range  error.  For 
the  case  where  light  source  incident  orientation  is  known, 
the  next  step  is  to  make  these  initial  local  surface  orien¬ 
tation  estimates  consistent  with  reflectance  which  is  as¬ 
sumed  to  be  Lambertian  (a  good  assumption  for  smooth 
dielectrics  when  the  angle  of  incidence  of  the  light  source 
and  the  angle  of  emittance  are  both  within  the  range 
0®  —  50®,  See  Section  2).  To  do  this,  for  each  pixel  we 
project  the  initial  local  surface  orientation  estimate  onto 
the  closest  point  on  the  conic  section  in  gradient  space 
that  is  consistent  with  image  irradiance.  Then  we  en¬ 
force  surface  integrability  using  the  method  developed  by 
Frankot  and  Chellappa  [4].  The  procedure  iterates  be¬ 
tween  projection  onto  the  conic  section  nearest  point  and 
enforcing  integrability,  and  typically  converges  after  about 
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10-20  iterations.  In  simulations  we  have  found  that  aver¬ 
age  local  surface  orientation  error  of  initial  estimates  were 
better  than  cut  in  half  by  our  shape  recovery  procedure 
(e.g.,  from  over  7"  average  error  to  under  2.5"  average 
error). 

For  the  case  where  incident  point  light  source  orienta¬ 
tion  is  initially  unknown,  after  initial  local  surface  orienta¬ 
tion  estimates  are  generated  by  least  squares  local  quadric 
fitting  to  range  data,  initial  estimates  for  incident  source 
orientation  at  each  point  are  generated  by  a  least  squares 
fitting  to  a  local  neighborhood  of  refiectance  data.  The 
position  of  the  point  light  source  a  finite  distance  away 
is  then  least  squares  triangulated  by  light  source  incident 
orientation  rays  emanating  from  each  surface  point.  Then 
a  combination  of  projecting  light  source  incident  orienta¬ 
tion  and  local  surface  orientation  onto  nearest  points  of 
conic  sections,  consistent  with  image  irradiance,  together 
with  enforcing  integrability,  is  iterated.  In  simulations  we 
were  able  to  achieve  about  5*’  average  incident  orientation 
error  across  the  surface,  and  reduce  initial  local  surface 
orientations  errors  about  30%. 

We  want  to  apply  this  to  actual  experimental  data  us¬ 
ing  a  range  finder  and  imager. 
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Abstract 

Our  work  is  concerned  with  several  areas  of  im¬ 
age  understanding:  perceptual  organization, 
motion  processing,  model-based  vision,  image 
understanding  software,  and  applications  in 
domains  such  as  terrestrial  robotics,  industrial 
and  biomedical  image  processing.  This  paper 
is  an  overview  of  the  different  papers  from  our 
lab  in  the  lU  Workshop  Proceedings.  We  first 
describe  the  design  and  initial  prototyping  of 
the  user  interface  of  the  DARPA  Image  Un¬ 
derstanding  Environment  (lUE)  and  the  tools 
for  documentation,  tutorials,  and  publication 
that  will  facilitate  the  use  and  adoption  of  the 
lUE.  We  then  present  different  motion  pro¬ 
cessing  algorithms.  These  algorithms  involve 
generalizing  earlier  work  for  processing  trans¬ 
lational  image  sequences  to  less  restricted  mo¬ 
tions;  extensions  to  factorization  methods  to 
allow  for  linear  features  which  are  less  depen¬ 
dent  on  precise  feature-point  matching;  and 
the  incorporation  of  models  in  processing  dy¬ 
namic  images.  We  finally  present  a  set  of  new 
algorithms  for  range-free  qualitative  naviga¬ 
tion  which  enable  mobile  robots  with  limited 
recognition  capabilities  to  form  effective  spa¬ 
tial  maps  for  navigation  and  exploration. 

1  Prototyping  the  lUE  and  Tools  to 
Facilitate  Its  Use 

The  user  interface  of  the  Image  Understanding  Environ¬ 
ment  (lUE)  is  intended  to  provide  flexible,  simple,  and 
powerful  tools  for  exploring  data,  algorithms,  and  sys¬ 
tems.  The  general  principles  of  object  oriented  design 
used  in  developing  the  lUE  object  hierarchy  and  pro¬ 
gramming  constructs  have  also  been  applied  to  the  in¬ 
terface:  abstraction  over  common  operations  to  provide 
a  small  number  of  interface  objects  which  can  be  freely 
combined  by  a  user.  The  interface  has  been  designed  to 
have  a  consistent  interaction  with  lUE  objects  and  their 
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semantics,  especially  the  abstraction  in  the  lUE  object 
hierarchy.  Thus,  the  display  and  browsing  operations 
express  the  class  similarities  for  objects  such  as  images, 
image-registered  features,  and  spatial  objects.  We  are 
hopeful  that  using  and  becoming  comfortable  with  the 
interface  will  not  involve  understanding  a  large  number 
of  unrelated  things. 

An  equally  important  part  of  the  user  interface  is  what 
we  do  not  intend  to  develop.  The  lUE  user  interface 
must  leverage  extensively  off  existing  (and  emerging)  in¬ 
terface  and  graphics  packages  and  standards.  The  in¬ 
terface  must  be  supported  by  ongoing  and  future  de¬ 
velopments  in  software  environments  and  graphical  user 
interfaces.  This  is  critical  for  the  long  term  use  of  the 
lUE  because  we  can  depend  on  continuous  advances  in 
these  areas  that  we  will  want  to  take  advantage  of  in 
terms  of  capabilities  and  cost. 

To  realize  this,  the  interface  is  being  developed  in 
terms  of  three  levels.  The  Graphics  Level  is  the  un¬ 
derlying  “machine  independent”  package  for  display  and 
graphic  operations  which  tell  the  screen  what  to  do.  Ex¬ 
amples  would  be  X,  GL,  OpenGL,  and  Phigs.  The  In¬ 
terface  Support  Level  involves  packages  for  the  cre¬ 
ation  and  rapid  prototyping  of  user  interfaces  and  related 
tools  which  are  built  on  top  of  graphics  level  software. 
This  also  includes  the  tools  found  in  the  selected  soft¬ 
ware  development  environment  such  as  editors  and  de¬ 
buggers.  The  Image  Understanding  Environment 
User  Interface  (lUEUI)  Level  consists  of  the  inter¬ 
face  objects  specialized  for  image  understanding.  This 
includes  such  things  as  object  displays,  plotting  displays, 
several  types  of  browsers,  and  structures  for  describing 
the  interface  context.  The  lUEUI  consists  of  a  small  set 
of  objects  which  can  be  freely  combined  for  very  powerful 
results.  The  specifications  of  these  objects  are  relatively 
independent  of  the  other  two  levels  although  much  of  the 
current  prototyping  and  design  activities  are  directed  to¬ 
wards  understanding  how  to  best  realize  the  functional¬ 
ity  of  the  lUEUI  objects  with  respect  to  these  two  levels, 
especially  for  accessibility  and.  limiting  the  eventual  cost 
of  the  lUE  for  users. 

The  basic  functional  components  of  the  lUE  interface 
are: 

•  Displays:  These  deal  with  mapping  spatial  objects 
and  images  (or  sets  of  spatial  objects  and  in  ages) 
onto  two-dimensional  display  windows.  There  are 
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several  types  of  displays  for  displaying  images  and 
image-registered  features,  for  plotting  functional  re¬ 
lations  between  attributes  and  components  of  spa¬ 
tial  objects;  and  for  displaying  surfaces. 

•  Browsers;  These  deal  with  presenting  textual  and 
symbolic  information  about  objects.  There  are  dif¬ 
ferent  types  of  browsers  for  such  operations  as  in¬ 
specting  the  values  in  a  spatial  object,  for  perform¬ 
ing  interactive  queries  with  respect  to  databases 
and  sets  of  objects,  and  for  inspecting  relational 
graphs  and  networks. 

•  Interface  Context  Descriptors:  These  are  for 
describing  the  state  of  the  interface  and  interface 
objects.  Examples  are  such  things  as  the  current 
color-look-up  table  for  a  given  display;  the  current 
display  window  or  browser;  and  links  between  in¬ 
terface  objects  which  describe  related  views. 

•  Command  Language  and  Command  Buffer: 
Users  can  control  their  interaction  with  objects  us¬ 
ing  an  interactive  command  language.  This  also 
provides  a  complete  description  of  the  functionality 
of  the  user  interface. 

•  Simplified,  programmable  access  to  GUI  ob¬ 
jects;  This  is  intended  to  provide  programmer  ac¬ 
cess  to  several  of  the  objects  commonly  found  in 
Graphical  User  Interface  Construction  Kits  such  as 
knobs,  sliders,  text  buffers,  menus.  These  can  then 
be  used  in  applications  and  to  extend  the  interface 

We  know  look  at  these  in  slightly  more  detail  and  refer 
the  reader  to  the  respective  paper  in  the  proceedings  for 
a  more  complete  discussion. 

1.1  Object  Displays 

Object  Displays  are  for  viewing  and  interacting  with  ob¬ 
jects  by  mapping  them  onto  a  two-dimensional  display 
window.  This  involves  nearly  all  lUE  objects:  images, 
curves,  regions,  object  models,  surfaces,  vector  fields, 
etc.  Object  displays  support  several  types  of  operations 
for  controlling  the  mapping  of  an  object  onto  a  win¬ 
dow  such  as  the  viewing  transformation;  mapping  val¬ 
ues  through  pixel-mapping  functions  and  color  look-up 
tables;  the  specification  of  overlay  planes;  transparency 
effects;  interacting  with  displayed  objects  through  selec¬ 
tion  operations  and  interactive  function  application. 

There  are  different  types  of  object  displays: 

•  The  image  display  is  for  viewing  images  and  image- 
registered  features. 

•  The  local  graphics  display  displays  objects  by 
mapping  their  values  onto  parameterized  graphic 
objects  such  as  lines  and  cubes.  Examples  are  dis¬ 
playing  vector  fields  and  edgels. 

•  The  surface  display  is  for  displaying  objects  that 
get  mapped  onto  mesh  or  rendered  surfaces. 

•  The  plot  display  is  for  displaying  functional  rela¬ 
tions  between  objects.  Examples 

are  one-dimensional,  two-dimensional,  and  three- 
dimensional  graphs;  histograms,  scattergrams,  and 
views  of  functions  and  tables. 


These  different  types  are  distinguished  by  specific 
methods  but  all  inherit  a  large  number  of  similar  meth¬ 
ods  from  the  general  display  class.  For  example,  overlay 
operations  are  similar  for  a  surface  display  and  for  an 
image  display,  although  they  can  look  quite  different  (In 
one  case  it  appear  like  drawing  in  solid  colors  in  im¬ 
age  registered  coordinates  on  top  of  a  displayed  image 
and  in  the  other  it  would  be  rendering  the  colors  onto  a 
displayed  surface).  Plot  displays  have  many  similarities 
with  object  displays  in  terms  of  such  things  as  overlays 
and  interaction  methods. 

1.2  Browsers 

Browsers  are  used  for  interacting  with  text-based  or  sym¬ 
bolic  descriptions  of  objects.  They  are  used  for  actions 
such  as  queries  over  set  of  objects,  determining  and  in¬ 
specting  relationships  between  objects,  process  moni¬ 
toring,  and  inspecting  values  in  an  object.  There  are 
two  general  types  of  browsers;  Field-Browsers  and 
Graph-Browsers  of  which  only  field  browsers  are  cur¬ 
rently  being  implemented. 

Field  Browsers  consist  of  a  regular  array  of  fields. 
Fields  can  be  filled  with  text,  icons,  colors,  colored  text, 
text  in  particular  fonts.  Fields  can  have  actions  associ¬ 
ated  with  them  when  they  are  selected  or  a  user  changes 
the  values  in  them.  We  distinguish  between  four  types 
of  field  browsers  which  inherit  from  the  general  Field 
browser  class; 

•  Set/Database  Browser;  This  is  presented  as  an 
array  of  fields.  Each  row  of  fields  corresponds  to 
selected  attributes  of  a  particular  object  and  each 
column  corresponds  to  common  attributes  over  the 
set  (or  database)  of  objects.  An  example  would  be 
browsing  the  database  which  describes  the  current 
active  objects  in  the  lUE  to  find  the  most  recently 
created  image  from  some  operations. 

•  Object  Attribute  Browser:  Each  row  corre¬ 
sponds  to  the  value  of  an  attribute  for  an  object. 
This  would  usually  be  used  for  inspecting  a  single 
object. 

•  Hierarcbical  Browser:  Useful  for  text-based  in¬ 
spection  of  graph  structures  and  trees.  When  an 
item  is  selected,  the  related  items  (along  some  rela¬ 
tional  dimension)  are  displayed  in  the  next  column. 

•  Object-Registered  Browser:  This  contains  val¬ 
ues  extracted  from  a  spatial  object,  such  as  the 
intensity  values  in  some  square  neighborhood  of 
an  image.  Depending  on  the  dimensionality  of 
the  object  (or  relationships  between  component  ob¬ 
jects),  this  can  be  presented  as  a  one-dimensional 
array,  a  two-dimensional  Array,  or  multiple  two- 
dimensional  arrays  and  be  used  to  describe  curves, 
images,  image  sequences,  pyramids. 

1.3  Command  Buffer  and  Command  Language 

Users  will  be  able  to  specify  all  interface  actions  through 
an  interactive  command  language  and  be  able  to  access 
all  the  functionality  of  the  interface.  Display  operations 
can  be  performed  interactively  through  the  command 
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buffer.  The  command  language  will  have  intelligent  de¬ 
faults  and  abbreviations  (such  as  displaying  to  the  cur¬ 
rent  window  if  none  is  specified).  In  addition,  the  com¬ 
mands  will  be  be  usable  in  non-interactive  code  for  cre¬ 
ating  scripts  and  general  display  routines. 

In  the  actual  operation  of  the  lUE,  it  is  not  necessary 
that  ail  interactions  take  place  through  this  command 
language:  some  w'’l  be  invoked  by  menus  and  special 
keys  and  refer  to  the  current  display  context.  An  im¬ 
portant  part  of  the  design  of  the  lUE  interface  entails 
how  commands  (and  which  commands)  are  mapped  onto 
menus  and  other  interactive  devices.  This  is  especially 
important  since  the  interface  will  support  a  wide  com¬ 
munity  of  users  ranging  from  naive  users  who  are  inter¬ 
acting  with  hardened  applications  to  developers.  Naive 
users  may  want  many  interactive  devices  such  as  spe¬ 
cialized  menus  while  experienced  users  will  want  more 
powerful,  abbreviated  commands.  Advanced  users  will 
become  very  adept  at  shortcuts  that  should  be  provided. 

1.4  Interface  Context  Description 

Contextual  descriptions  of  the  state  of  the  interface  and 
the  status  of  displayed  lUE  objects  are  used  for  intel¬ 
ligent  default  behaviors  so  users  needn’t  specify  every 
detail  of  interacting  with  an  object  and  can  build  com¬ 
plex  displays  incrementally.  Many  interface  operations 
involve  accessing  and  setting  the  attributes  of  data  struc¬ 
tures  which  describe  the  current  context  for  such  things 
as  the  current  or  active  window;  the  current  mapping 
from  objects  to  displays  (such  as  the  viewing  transforma¬ 
tion  and  color  look-up  table);  established  links  between 
windows  (for  specifying  the  relations  between  displays 
in  different  windows  such  as  window  to  window  panning 
and  zooming);  the  thickness  of  lines  in  graphic  overlays; 
and  the  layout  of  windows  and  browsers  on  a  screen. 

1.5  Simplified  Access  to  GUI  Objects 

GUIs  (Graphical  User  Interfaces)  generally  consist  of 
several  standard  types  of  interface  widgets  that  can  be 
used  in  the  interface.  The  lUE  interface  should  provide 
routines  that  allow  users  to  access  these  constructs  and 
use  them  in  their  programs  and  the  interface.  The  lUE 
should  provide  simplified,  interactive  access  to  the  inter¬ 
face  objects  found  in  GUI  Kits  such  as  sliders,  knobs, 
buttons,  menus,  and  text  input/output  fields.  This  in¬ 
cludes  methods  which  enable  user  code  to  access  infor¬ 
mation  from  specified  interface  objects  or  to  prompt  a 
user. 

1.6  Interface  Prototyping  Activities 

We  are  currently  prototyping  many  different  parts  of 
the  user  interface  to  complete  the  functional  specifica¬ 
tion  and  to  answer  basic  implementation  questions  about 
choices  regarding  GUIs  and  user  interface  toolkits.  This 
will  help  to  simplify  the  job  of  the  eventual  integrating 
contractor.  For  reasons  of  rapid  development,  current 
implementation  is  taking  place  in  C  and  C-1— I-  on  Silicon 
Graphics  machines  using  the  GL  graphics  library.  Motif, 
and  the  FORMS  user  interface  toolkit.  We  have  been 
able  to  put  up  the  general  display  object  and  the  differ¬ 
ent  browsers  and  hope  to  use  these  as  initial  browsers 


and  displays  specialized  for  use  with  the  Data  Exchange 
Format.  We  are  exploring  extensions  to  GNUPlot  so  it 
is  compatible  with  the  methods  associated  with  the  gen¬ 
eral  display  class  and  can  provide  an  inexpensive  plotting 
package.  We  are  also  evaluating  OPENGL  as  a  possible 
machine  independent  graphics  package. 

1.7  Documentation  and  Tutorial  Tools 

The  lUE  will  be  supported  by  on-line  documentation 
and  tutorials.  The  tools  for  implementing  these  will  also 
be  available  for  enhanced  communication  and  publica¬ 
tion  by  scientists  and  developers  who  use  the  lUE.  While 
there  is  significant  activity  in  developing  documentation 
and  hypermedia  toolkits,  they  remain  largely  machine 
dependent  with  no  clear  standardization.  We  are  de¬ 
veloping  a  simple  documentation  tool  called  Knowledge 
Weasel  (KW)  which  is  based  on  Lucid  Emacs  19  and 
existing  media  editing  tools. 

Knowledge  Weasel  (KW)  is  a  presentation  and  author¬ 
ing  system  designed  to  support  annotation  using  several 
different  types  of  media.  A  simple  analogy  for  KW  is 
reading  a  book  or  attending  a  lecture  and  being  able 
to  make  diverse  types  of  comments  and  annotations  on 
the  material.  In  reality,  such  unrestricted  annotations 
and  comments  made  with  respect  to  real  books  and  lec¬ 
tures  could  create  a  significant  mess  (especially  if  made 
by  several  different  people),  so  in  developing  KW  we 
have  extended  this  simple  metaphor  in  several  ways.  The 
first  is  to  provide  a  general  format  for  annotations  that 
can  include  several  different  types  of  media.  An  anno¬ 
tation  is  a  common  record  structure  wrapped  around 
instances  of  different  types  of  media  such  as  text  files, 
sound,  drawings,  postscript  files,  GNU-plots,  code  run¬ 
ning  in  the  GDB  debugger,  and  others.  Annotations  are 
implemented  much  as  a  property  lists  in  Lisp  with  at¬ 
tributes  and  values  and  are  displayed  as  buttons  with 
an  associated  region  of  support.  When  an  annotation  is 
selected  it  performs  an  operation  specific  to  the  type  of 
annotation  selected.  Annotations  are  created  using  exit¬ 
ing  media  editing  tools  for  operations  such  as  recording 
a  sound,  drawing  packages,  calls  to  other  branched  pro¬ 
cesses,  grabbing  a  portion  of  the  screen.  The  second 
extension  has  been  to  develop  different  types  of  naviga¬ 
tion,  organization  and  presentation  tools  to  keep  users 
from  being  overwhelmed  with  a  great  deal  of  possibly 
irrelevant  information.  Users  can  prune  the  set  of  anno¬ 
tations  that  they  want  to  deal  with  and  also  how  they  are 
displayed.  Annotations  are  structured  to  make  possible 
intelligent  processing,  perhaps  eventually  including  rule- 
based  processing  for  automatic  presentation  and  “ferret¬ 
ing”  of  information  (hence  the  name). 

We  are  implementing  KW  on  Lucid  Emacs  19  which 
is  in  turn  based  on  the  X  window  system.  Lucid’s  im¬ 
plementation  of  Emacs  Lisp  provides  primitives  for  han¬ 
dling  display  attributes  such  as  windows,  fonts,  and  col¬ 
ors.  Lucid  Emacs  version  19  has  a  built-in  Lisp  inter¬ 
preter  for  Emacs  Lisp  and  this  Lisp  variant  provides  a 
wide  variety  of  primitives  that  are  useful  for  manipu¬ 
lating  text,  processes,  and/or  files.  It  is  available  via 
anonymous  FTP  on  the  Internet,  zuid  is  also  the  basis 
of  a  commercial  product.  Knowledge  Weasel  is  chiefly 
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written  in  Emacs  Lisp  but  some  parts,  such  as  the  part 
which  interacts  with  the  operating  system’s  lock  dae¬ 
mon  (lockd),  is  in  C  and  communicates  via  pipes  with 
the  Emacs  Lisp  portion  of  the  implementation. 

We  have  begun  using  an  initial  version  of  KW  to  de¬ 
velop  an  on-line  version  of  the  Low  Level  Vision  course 
taught  at  Georgia  Tech.  We  also  plan  to  use  it  as  part  of 
a  computer  vision  algorithms  course  where  students  will 
select  a  paper  from  the  literature,  implement  the  corre¬ 
sponding  algorithms  and  use  KW  to  develop  a  tutorial 
presentation  of  the  paper. 

1.7.1  CD-ROM  version  of  DARPA  lU 
Workshop  Proceedings 

A  significant  instance  of  technology  transfer  is  the 
DARPA  lU  Proceedings  and  workshop.  For  the  next 
meeting,  we  hope  to  enhance  this  by  having  the  work¬ 
shop  proceedings  available  on  CD-ROM,  and  integrated 
with  the  Data  Exchange  Format,  a  documentation  and 
browsing  tool  such  as  Knowledge  Weasel,  and,  possi¬ 
bly,  the  lUE  itself.  This  will  enable  an  extraordinary 
type  of  paper  which  includes  data,  code,  additional  refer¬ 
ences,  animations,  and  extensive  annotations  and  cross- 
references. 

2  Dynamic  Image  Processing 

2.1  Translational  Decomposition  of  Flow  Fields 

This  paper  presents  a  set  of  algorithms  for  processing 
optic  flow  fields  by  approximating  them  as  local  trans¬ 
lations  of  the  corresponding  portions  of  the  environ¬ 
ment.  This  is  theoretically  interesting  since  it  dramat¬ 
ically  simplifies  the  equations  for  inferring  motion  pa¬ 
rameters  from  optic  flow  and  also  supplies  a  low  level 
representation  of  image  motion  that  might  be  useful  for 
inferring  motion  properties  from  non-rigid  motions.  Its 
practical  use  involves  its  robust  nature  for  motion  con¬ 
strained  to  an  unknown  plane  which  characterizes  much 
of  terrestrial  robotics.  It  can  also  use  a  small  number 
of  points  for  inferring  motion  parameters  from  an  optic 
flow  field. 

In  previous  work  [Lawton,  1982],  we  developed  a  tech¬ 
nique  to  process  relative  translational  motion  of  a  sen¬ 
sor  with  respect  to  a  stationary  environment  or  inde¬ 
pendently  translating  objects.  This  and  related  algo¬ 
rithms  [Burger  and  Bhanu,  1989;  Jain,  1983]  are  based 
on  the  strong  geometric  constraints  on  image  motion  in 
the  case  of  translation  -  radial  motion  of  image  features 
from  a  focus  of  expansion  determined  by  the  intersec¬ 
tion  of  the  axis  of  translation  with  the  imaging  surface. 
The  technique  in  [Lawton,  1982]  was  based  on  a  mea¬ 
sure  which  described  the  quality  of  feature  displacements 
along  the  radial  flow  paths  associated  with  a  potential 
axis  of  translation.  The  measure  was  then  optimized  by 
searching  over  the  surface  of  a  unit  sphere  where  each 
point  corresponded  directly  to  a  possible  direction  of 
translation.  The  optimization  combined  the  determina¬ 
tion  of  the  direction  of  translation  and  the  corresponding 
image  displacements  into  a  single,  mutually  constraining 
computation.  It  was  possible  to  determine  the  direction 
of  translation  to  within  a  few  degrees  in  small  image 
areas  with  a  few  features. 


We  can  extend  the  translational  processing  algorithm 
to  work  with  more  general  cases  of  motion  by  applying 
the  translational  procedure  to  local  portions  of  a  flow 
field.  This  allows  us  to  associate  a  direction  of  rela¬ 
tive  environmental  motion  with  the  corresponding  lo¬ 
cal  portion  of  a  flow  field.  We  call  this  description  of 
image  motion  the  local  translational  decomposition 
(LTD).  This  is  a  low  level  representation  of  environmen¬ 
tal  motion  which  considerably  simplifies  the  recovery  of 
the  sensor  motion  parameters. 

Computing  the  LTD  begins  by  decomposing  a  flow 
field  into  small  overlapping  neighborhoods  and  then  ap¬ 
proximating  the  motion  for  each  neighborhood  as  being 
produced  by  translational  motion  of  the  corresponding 
portion  of  the  environment.  This  is  accomplished  by  ap¬ 
plying  a  procedure  which  extracts  the  relative  direction 
of  translational  motion  within  small  image  areas  over  a 
flow  field.  This  approximates  more  general  motion  as  an 
array  of  local  environmental  translations,  and  interprets 
local  image  motions  as  if  they  resulted  from  translational 
motion  of  the  corresponding  portions  of  the  environment. 
This  associates  with  local  portions  of  a  flow  field  a  unit 
vector  corresponding  to  the  direction  of  motion  relative 
to  the  sensor  of  the  corresponding  portion  of  the  en¬ 
vironment.  Each  unit  vector  has  an  associated  fit-value 
reflecting  the  validity  of  the  translational  approximation. 

Once  the  directions  of  motion  have  been  established, 
we  can  then  use  these  as  constraints  to  determine  the  ac¬ 
tual  parameters  of  motion  and  to  recover  the  structure 
and  layout  of  environmental  surfaces.  This  is  broken 
into  four  different  cases:  motion  constrained  to  a  known 
plane  (the  normal  to  the  plane  is  known);  motion  con¬ 
strained  to  an  unknown  plane  (the  normal  is  not  known); 
motion  constrained  to  surfaces  which  are  locally  planar; 
and  arbitrary  motion  with  no  assumptions. 

2.2  Interactive  Model  Based  Vehicle  TVacking 

While  most  work  in  motion  processing  has  involved  very 
minimal  assumptions  about  objects  such  a  rigidity,  a 
very  important  area  for  future  work  is  motion  process¬ 
ing  which  incorporates  object  models.  We  have  begun  to 
investigate  this  in  the  restricted  domain  of  tracking  ve¬ 
hicles  from  a  stationary  camera  in  outdoor  road  scenes. 
The  key  idea  is  that  motion  is  a  critical  source  of  infor¬ 
mation  for  instantiating  object  models  and  that  motion 
processing  is  in  turn  simplified  by  the  constraints  sup¬ 
plied  by  object  models. 

Processing  begins  with  a  human  forming  a  rough  inter¬ 
pretation  of  a  scene  by  interactively  manipulating  mod¬ 
els  of  objects  such  as  terrain  surface  patches,  roads,  grav¬ 
ity,  and  vehicles.  This  initial,  human-directed  interpre¬ 
tation  consists  of  incompletely  specified  two  dimensional 
drawings  of  expected  image  features  and  associated  three 
dimensional  object  models  which  are  also  initially  in¬ 
completely  specified.  Once  an  interpretation  is  in  place, 
tracking  algorithms  then  autonomously  refine  and  ex¬ 
tend  the  interpretation.  For  example,  a  human  will  indi¬ 
cate  that  a  particular  area  is  a  road  as  a  two-dimensional 
drawing.  The  system  will  then  track  movement  along  the 
road  and  fit  a  constraint-based  description  of  a  vehicle 
to  this  movement.  As  vehicles  are  tracked,  the  three- 
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dimensional  shape  of  the  road  can  be  recovered.  The 
system  can  determine  that  a  vehicle  has  just  gone  off  the 
road  (or  that  it  is  behaving  inconsistently  with  respect 
to  the  model  of  a  vehicle)  and  report  back  to  a  human 
about  unusual  occurrences  or  behavior  that  cannot  be 
accounted  for. 

Object  models  are  related  by  constraints  specifying 
necessary  geometrical  properties  and  relationships  be¬ 
tween  objects.  The  use  of  constraints  allows  for  flexible 
object  instantiation.  A  user  can  indicate  a  vehicle  and 
this  directs  perceptual  processing  routines  to  determine 
the  corresponding  local  surface  orientation  and  roads,  or 
he  can  instantiate  a  road  segment  to  direct  the  extrac¬ 
tion  and  tracking  of  vehicles. 

The  work  with  the  local  translational  approximation 
described  above  as  been  found  to  be  useful  for  tracking 
vehicles  and  determining  three  dimensional  information. 
Moving  vehicles  can  often  be  treated  as  rigid  objects 
which  are  translating  over  short  periods  of  time.  For 
example,  as  a  vehicle  goes  around  a  curve,  because  of 
turning  radii  constraints,  the  axis  of  rotation  is  often 
far  away  from  the  vehicle  itself  and  the  vehicle  motion 
can  be  treated  as  a  sequence  of  small  translations  cor¬ 
responding  to  tangents  of  the  curve  of  motion.  The  lo¬ 
cal  translation-based  tracker  determines  the  direction  of 
motion  of  a  set  of  extracted  image  points  over  time,  and 
fits  their  motion  to  an  estimate  of  the  current  direction 
of  motion  of  the  corresponding  vehicle  in  three  dimen¬ 
sions.  The  effect  of  this  tracker  can  be  visualized  as  a 
unit  sphere  with  an  axis  corresponding  to  the  current  di¬ 
rection  of  motion.  As  the  vehicle  and  the  corresponding 
set  of  points  move,  the  position  of  the  axis  changes  with 
respect  to  the  sphere.  We  expect  that  this  processing 
will  work  well  with  temporal  filters  since  there  are  con¬ 
straints  on  how  quickly  a  vehicle  can  change  its  direction 
of  motion.  Vehicle  rotation  is  indicated  by  areas  of  the 
image  which  show  differences  over  time,  but  for  which  no 
clear  axis  of  translation  can  be  determined.  Conversely, 
if  there  is  an  instantiated  three-dimensional  road  model 
and  a  rough  estimate  of  the  position  of  the  vehicle  along 
the  road  has  been  established,  the  tangent  information 
associated  with  the  road  model  can  be  used  to  initial¬ 
ize  the  search  for  the  axis  of  translation.  If  there  is  an 
instantiated  vehicle  model,  it  restricts  the  features  that 
the  local  translational  tracker  uses. 

This  work  will  be  useful  for  applications  such  as  teler- 
obotic  monitoring  systems  where  low  bandwidth  commu¬ 
nication  is  critical.  The  human  would  produce  a  rough 
scene  interpretation  from  sensory  information  from  a 
telerobots.  The  resulting  interpretation  is  a  model  of 
the  world  that  the  telerobot  would  refine,  use  to  control 
their  behavior,  or  report  back  to  a  human.  In  this  way, 
the  human  directs  the  telerobots  by  initializing  and  con¬ 
straining  their  processing  and  communication  between 
the  robot  and  the  human  takes  place  in  the  context  of 
a  shared  model  of  the  world  which  makes  possible  infre¬ 
quent,  semantically  meaningful,  and  very  low  bandwidth 
communication. 


2.3  Shape  and  Motion  from  Linear  Features 

The  extraction  of  environmental  structure  and  motion 
from  a  sequence  of  two-dimensional  images  is  a  com¬ 
mon  problem  in  computer  vision.  Typically  solutions  to 
this  problem  are  expressed  in  camera-centered  coordi¬ 
nate  systems  where  environmental  geometry  is  specified 
by  the  depth  along  an  image  feature’s  ray  of  projection. 
Unfortunately,  parameters  computed  from  this  camera- 
centered  representation  are  dependent  upon  the  depth  to 
environmental  features.  This  leads  to  erroneous  results 
for  objects  located  far  from  the  camera. 

The  recently  introduced  factorization  method  [Tomasi 
and  Kanade,  1990;  Tomasi  and  Kanade,  1992;  Boult  and 
Brown,  1992]  has  attempted  to  overcome  the  disadvan¬ 
tages  associated  with  a  camera-centered  representation. 
This  method  uses  a  world-centered  coordinate  system, 
along  with  an  orthogonal  projection  assumption,  in  order 
to  compute  shape  and  motion  without  the  intermediate 
calculation  of  depth.  A  matrix  of  image  measurements 
is  constructed  by  making  point  correspondences  between 
image  frames.  The  matrix  is  then  factored  into  a  shape 
matrix  and  a  motion  matrix  using  Singular  Value  De¬ 
composition. 

One  problem  with  the  factorization  method  is  that  it 
relies  upon  accurate  point  correspondences  between  im¬ 
age  frames.  This  paper  introduces  a  method  of  extract¬ 
ing  shape  and  motion  from  directionally  selective  lin¬ 
ear  feature  correspondences.  This  line-based  algorithm 
is  capable  of  reconstructing  shape  and  motion  without 
computing  depth  as  an  intermediate  step.  In  addition  to 
the  orthogonality  assumption,  we  assume  that  the  three- 
dimensional  direction  of  gravity  is  known  relative  to  each 
image  in  a  motion  sequence. 

The  algorithm  begins  by  searching  for  the  orientation 
of  one  of  the  lines  in  the  environment.  This  is  a  one 
dimensional  search  over  180®,  constrained  by  the  projec¬ 
tion  of  the  line  on  one  of  the  image  planes.  Each  candi¬ 
date  line  orientation,  along  with  the  position  of  gravity, 
forms  a  set  of  quadratic  equations  which  constrain  all 
the  other  lines,  as  well  as  the  rotation  between  image 
frames.  An  error  measure  is  computed  from  the  derived 
line  orientations  and  used  to  evaluate  each  shape  and 
motion  configuration.  Once  the  line  orientations  and 
parameters  of  rotation  have  been  derived,  the  relative 
positions  of  the  lines  can  also  be  computed  from  simple 
linear  equations. 

3  Range-EVee  Qualitative  Navigation 

Qualitative  Navigation  [Kuipers  and  Byun,  1987;  Levitt 
and  Lawton,  1990]  concerns  spatial  learning  and  path 
planning  in  the  absence  of  a  single  global  coordinate  sys¬ 
tem  for  describing  locations  and  the  positions  of  land¬ 
marks.  It  is  based  on  a  multi-level  representation  of 
space,  which,  at  its  most  abstract  level,  is  based  on  topo¬ 
logical  properties  which  allow  a  robot  to  describe  a  loca¬ 
tion  using  the  directions  of  visually  salient  patterns  (with 
no  associated  range  measurements)  and  then  navigating 
using  the  occlusions  that  occur  among  them  as  a  basic 
cue  to  control  movement  through  the  environment.  An 
advantage  is  that  the  robot  can  use  landmarks  for  which 
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exact  positions  can  not  be  determined.  Thus,  if  a  robot 
sees  a  building  in  the  distance,  it  may  not  know  or  be 
able  to  recognize  the  structure  as  a  building  or  determine 
its  exact  position  in  space  but  it  can  still  incorporate  this 
to  form  an  effective  spatial  memory.  This  is  actually 
quite  intuitive:  it  is  doubtful  that  animals  navigate  by 
detecting  landmarks,  determining  ranges  to  them,  and 
then  storing  everything  in  a  single  frame  of  reference. 
It  also  removes  the  effects  of  incremental  errors  due  to 
drift. 

Our  work  [Lawton  and  Levitt,  1989;  Levitt  and  Law- 
ton,  1990]  in  qualitative  navigation  developed  while  try¬ 
ing  to  produce  basic  navigation  and  recognition  capabil¬ 
ities  in  an  autonomous  land  vehicle.  Initially  we  worked 
with  a  terrain  representation  based  upon  an  a  priori  ter¬ 
rain  grid,  which  describes  terrain  in  terms  of  a  regular 
square  grid  of  features  referenced  with  respect  to  a  sin¬ 
gle  global  coordinate  system.  There  are  several  prob¬ 
lems  associated  with  this  involving  difficulties  with  up¬ 
dating  a  terrain  grid;  difficulties  in  establishing  exact 
three-dimensional  positions  of  landmarks;  and  dealing 
with  the  limited  recognition  capabilities  of  robots. 

In  this  paper  we  describe  qualitative  navigation  al¬ 
gorithms  which  work  completely  at  the  topological  level, 
dealing  with  landmarks  for  which  there  are  no  range  esti¬ 
mates.  In  addition,  we  introduce  several  distinctions  for 
qualitative  navigation  algorithms.  One  type  of  distinc¬ 
tion  concerns  landmarks.  We  consider  two  basic  types: 
distinct  landmarks  which  can  always  be  recognized  as 
the  same  from  wherever  they  are  seen  and  non-distinct 
landmarks  which  may  not  be  recognized  as  being  the 
same  when  seen  from  different  points  of  view.  We  as¬ 
sume  that  once  landmarks  are  seen,  they  can  be  tracked 
over  time  until  they  disappear.  The  other  distinction 
involves  whether  or  not  the  navigation  algorithms  use  a 
compass  to  yield  a  fixed  direction.  We  also  distinguish 
two  different  types  of  compass.  The  direction  associated 
with  a  local  compass  can  change  from  place  to  place, 
but  at  a  given  place,  it  will  always  point  in  the  same 
direction.  An  example  is  a  compass  which  is  effected  by 
fixed  magnetic  influences  at  different  locations.  The  lo¬ 
cal  compass  can  also  be  a  very  strong  landmark  which  is 
visible  from  a  wide  set  of  views.  A  global  compass  will 
always  point  in  the  same  direction  regardless  of  where 
the  robot  is  located.  We  can  express  these  distinctions 
as  a  table  corresponding  to  the  different  types  of  topo¬ 
logical  navigation  algorithms  we  have  developed: 


Topological  Qualitative  Navigation  Algorithms 


Compass 

No  Compass 

distinct  landmarks 

Very  Good 

Good 

non-distinct  landmarks 

Good 

Difficult! 

to  a  freely  navigating  robot  which  can  build  maps  and 
navigate  using  landmarks  which  are  based  on  simple  vi¬ 
sual  features,  such  as  colored  regions  and  edges  aligned 
with  gravity. 
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For  example,  qualitative  navigation  without  a  com¬ 
pass  and  in  a  world  of  identical,  non-distinct  land¬ 
marks  is  very  difficult  and  depends  critically  on  matching 
viewframes  based  exclusively  upon  the  angular  orienta¬ 
tions  of  landmarks.  More  practical  algorithms  are  those 
which  are  based  upon  the  use  of  a  local  compass  and  a 
limited  number  of  distinct  landmarks.  This  corresponds 
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Recent  progress  in  image  understanding 
research  at  the  University  of  Washington  2 
is  described  in  this  paper.  The  main  fo¬ 
cus  of  the  research  has  to  do  with  perfor¬ 
mance  characterization  of  computer  vision 
algorithms.  We  provide  an  overview  of  our 
approach  to  performance  characterization 
and  discuss  ongoing  theoretical  and  exper¬ 
imental  work. 

1  Introduction 

Our  current  focus  in  Image  Understanding 
research  at  the  University  of  Washington  is 
on  performance  characterization  of  com¬ 
puter  vision  algorithms.  Our  present  re¬ 
search  objective  is  to  develop  the  method¬ 
ology  for  evaluating  the  performance  of 
image  understanding  algorithms  and  sys¬ 
tems.  In  the  first  section  of  this  paper  we 
summarize  the  general  approach  we  use  for 
performance  characterization.  Subsequen- 
t  sections  discuss  accomplished  and  ongo¬ 
ing  work  in  performance  characterization 


Performance  Character¬ 
ization 

Image  Understanding  systems  employ  d- 
ifTerent  algorithms  applied  in  sequence. 
Determination  of  the  performance  of  the 
complete  lU  algorithm  is  possible  if  the 
performance  of  each  of  the  sub-algorithm 
constituents  is  given.  An  algorithm  em¬ 
ployed  at  any  stage  in  the  image  analysis 
sequence  employs  a  representation  for  the 
data  with  which  it  works.  In  our  approach, 
we  address  questions  such  as:  what  kinds 
of  conditions  exceed  the  limits  of  the  rep¬ 
resentation?  When  is  reality  not  covered 
by  the  representation?  What  condition- 
s  make  numerical  computations,  that  the 
algorithm  performs,  to  be  unstable  ?  Fi¬ 
nally,  since  the  algorithm  works  with  noisy 
data,  data  which  has  been  perturbed  from 
its  ideal  form,  the  results  of  the  algorith- 
m  will  be  perturbed  from  their  ideal  form 
too.  To  what  degree  will  a  perturbation 
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of  the  input  data  affect  the  accuracy  of 
the  output,  in  a  qualitative  sense  and  in  a 
quantitative  sense? 

The  methodology  involves  both  black¬ 
box  and  white-box  perspectives.  In  the 
black-box  perspective,  the  problem  is  one 
of  determining  what  the  requirements  of 
a  computer  vision  task  are,  and  perform¬ 
ing  empirical  evaluations  to  verify  whether 
these  requirements  are  met.  In  the  white- 
box  perspective,  the  algorithm  is  exam¬ 
ined  from  the  inside  out.  Under  a  given  set 
of  well-defined  assumptions,  does  the  algo¬ 
rithm  provide  guaranteed  answers  ?  This 
is  done  by  performing  a  theoretical  evalu¬ 
ation  of  the  algorithm.  The  consistency  of 
these  assumptions  used  in  the  theoretical 
analysis  with  the  reality  that  the  algorith- 
m  is  supposed  to  handle  is  established  by 
a  experimental  reality-assumption  valida¬ 
tion  test. 

What  does  performance  characteriza¬ 
tion  mean  for  an  algorithm  which  might 
be  used  in  a  lU  system?  Each  algorithm 
is  designed  to  accomplish  a  specific  task. 
If  the  input  data  is  perfect  and  has  no 
noise  and  no  random  variation,  the  out¬ 
put  produced  by  the  algorithm  should  also 
be  perfect.  Otherwise,  there  is  something 
wrong  with  the  algorithm.  So,  measuring 
how  well  an  algorithm  does  on  perfect  in¬ 
put  data  is  not  interesting.  Performance 
characterization  has  to  do  with  establish¬ 
ing  the  correspondence  of  the  random  vari¬ 
ations  and  imperfections  which  the  algo¬ 
rithm  produces  on  the  output  data  caused 
by  the  random  variations  and  the  imper¬ 
fections  on  the  input  data.  This  means 
that  to  do  performance  characterization. 


we  must  first  specify  a  model  for  the  ide¬ 
al  world  in  which  only  perfect  data  exist. 
Then  we  must  give  a  random  perturbation 
model  which  specifies  how  the  imperfec- 
t  perturbed  data  arises  from  the  perfect 
data.  Finally,  for  the  last  algorithm  in  a 
vision  sequence  we  need  a  criterion  func¬ 
tion  which  quantitatively  measures  the  d- 
ifference  between  the  ideal  output  arising 
from  the  perfect  ideal  input  and  the  calcu¬ 
lated  output  arising  from  the  correspond¬ 
ing  randomly  perturbed  input. 

The  difliculty  in  performance  evaluation 
is  in  deciding  what  the  appropriate  ran¬ 
dom  perturbation  model  must  be  for  each 
input  or  output  a  vision  algorithm  com¬ 
ponent  may  have.  Sometimes  the  algo¬ 
rithm  component  may  change  data  struc¬ 
tures  from  input  to  output.  This  means 
that  the  random  perturbation  model  must 
be  different  from  input  to  output.  And  of 
course,  the  choice  of  the  random  perturba¬ 
tion  model  for  a  vision  algorithm  compo¬ 
nent  output  must  be  suitable  for  the  input 
to  the  subsequent  vision  algorithm  compo¬ 
nent.  Very  quickly  one  finds  that  the  clas¬ 
sical  Gaussian  models  are  not  appropriate. 


3  Current  and  Ongoing 
Work 

Re  nt  work,  [1],  was  focussed  on  the 
whi  ?-box  perspective.  We  were  interested 
in  setting  up  random  perturbation  model- 
s  for  typical  computer  vision  algorithms 
and  relating  the  model  parameters  to  per¬ 
formance  measures  of  algorithms.  In  the 
past  year,  we  theoretically  analyzed  an  ex- 
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ample  vision  an  example  vision  algorithm 
sequence  that  involves  edge  finding,  edge 
linking,  and  line/arc  fitting.  By  starting 
with  an  appropriate  noise  model  for  the 
input  data  we  derived  random  perturba¬ 
tion  models  for  the  output  data  at  each 
stage  of  our  example  sequence.  These  ran¬ 
dom  perturbation  models  are  useful  for 
performing  model  based  theoretical  com¬ 
parisons  of  the  performance  of  vision  al¬ 
gorithms.  Parameters  of  these  random 
perturbation  models  are  related  to  mea¬ 
sures  of  error  such  as  the  probability  of 
misdetection  of  feature  units,  probability 
of  false  alarm,  and  the  probability  of  in¬ 
correct  grouping.  In  Ramesh  and  Haral- 
ick  [1],  we  described  a  theoretical  model 
by  which  pixel  noise  can  be  successively 
propagated  through  an  edge  labeling  al¬ 
gorithm,  an  edge  linking  algorithm  and  a 
boundary  gap  filling  algorithm.  Assuming 
an  edge  idealization  of  a  linear  ramp  edge 
and  i.i.d  Gaussian  random  perturbations 
on  pixel  gray  values  it  was  shown  how  one 
could  model  the  breakage  of  a  true  line  seg¬ 
ment  as  a  renewal  process  with  alternat¬ 
ing  segment  and  gap  intervals.  It  was  also 
shown  that  if  one  ignores  the  dependencies 
between  adjacent  gradient  estimates  then 
the  segment  and  gap  interval  lengths  are 
exponentially  distributed  with  parameters 
that  are  related  to  the  true  edge  gradien- 
t,  the  neighborhood  operator  size  and  the 
gradient  threshold  employed.  It  was  also 
shown  how  the  output  after  a  gap  filling 
operation  could  also  be  modeled  as  an  al¬ 
ternating  renewal  process  and  we  derived 
the  expressions  for  the  probability  distri¬ 
butions  of  the  lengths  of  segment  and  gap 


intervals  obtained  after  the  filling  opera¬ 
tion. 

In  reality,  there  is  an  overlap  between 
the  edge  detector  neighborhoods  centered 
around  pixels.  Hence  there  is  some  de¬ 
pendence  between  gradient  estimates  ob¬ 
tained  for  neighboring  windows.  In  addi¬ 
tion,  if  one  assumes  that  the  noise  at  each 
pixel  is  locally  dependent  then  the  corre¬ 
lation  in  the  noise  would  introduce  corre¬ 
lation  in  the  gradient  estimates.  In  ad¬ 
dition,  the  analysis  in  [1]  did  not  include 
positional  errors.  These  positional  errors 
are  of  significance  if  one  wishes  to  analyze 
higher-level  matching  algorithms.  Hence, 
we  extended  the  results  presented  in  [1]  to 
handle  dependencies  between  gradient  es¬ 
timates  for  neighboring  edge  pixels.  Under 
the  assumption  that  the  gradient  across 
the  edge  is  constant  along  the  entire  mod¬ 
el  line  segment,  we  illustrated  how  the  de¬ 
pendencies  between  neighboring  pixels  can 
be  captured  by  modeling  the  sequence  of 
labeled  edge  and  non-edge  pixels  along  the 
true  model  line  as  a  binary  Markov  Chain 
of  a  particular  order.  The  transition  prob¬ 
abilities  for  the  Markov  chain  are  shown  to 
be  related  to  the  true  edge  gradient,  the 
edge  operator  width,  the  noise  variance, 
and  the  edge  operator  threshold. 

In  other  work,  [2],  we  focussed  on  per¬ 
forming  theoretical  model-based  compari¬ 
son  of  gradient  based  edge  finding  schemes 
and  mathematical  morphology  based  edge 
finding  schemes.  The  performance  analy¬ 
sis  was  done  by  assuming  an  ideal  edge 
model  and  a  noise  model  and  by  deriv¬ 
ing  expressions  for  probability  of  false  alar- 
m  and  probability  of  misdetection  of  edge 
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pixels.  Under  the  Gaussian  noise  mod¬ 
el  assumption,  the  theory  indicated  that 
the  morphological  edge  detector  is  superi¬ 
or  to  conventional  gradient  based  edge  de¬ 
tectors,  that  label  edges  based  on  gradient 
magnitude,  when  a  size  3  by  3  window  was 
used.  We  performed  experiments  to  vali¬ 
date  our  theoretical  results  and  the  empir¬ 
ical  plots  indicated  that  the  morphological 
edge  operator  was  also  superior  when  a  5 
by  5  window  is  used.  However,  the  theo¬ 
retical  plots  did  not  confirm  this  because 
the  theory  provided  only  an  upperbound. 
In  [2]  we  also  included  comparisons  of  re¬ 
sults  obtained  for  real  images.  A  simple 
analysis  of  hysteresis  linking  was  also  done 
in  this  paper  and  it  was  shown  that  hys¬ 
teresis  linking  improves  the  performance 
of  the  edge  operators. 

Our  recent  work  involved  the  design  and 
implementation  of  an  experimental  proto¬ 
col  to  compare  the  accuracy  of  the  edge 
locations  obtained  for  the  two  operators. 
We  generated  synthetic  images  containing 
ramp  edges  of  various  orientations  and  ad¬ 
ditive  noise  of  varying  degree.  We  defined 
the  edge  pixel  location  error  as  the  dis¬ 
tance  along  the  gradient  direction  from 
the  true  edge  pixel  to  the  nearest  labeled 
edge  pixel  (if  one  exists)  in  the  edge  detec¬ 
tor  output.  We  applied  the  morphological 
edge  operator  and  the  gradient  based  op¬ 
erator  on  these  images  and  we  are  in  the 
process  of  evaluating  the  results. 

We  are  also  in  the  process  of  evaluat¬ 
ing  line  finding  schemes.  We  have  de¬ 
vised  an  experimental  protocol  for  evaluat¬ 
ing  line  finders  and  are  conducting  the  ex¬ 
periments  now.  Specifically,  we  have  gen¬ 


erated  synthetic  image  data  consisting  of 
ramp  edges  with  varying  orientations  and 
additive  Gaussian  noise  of  different  level- 
s.  We  are  in  the  process  of  estimating  the 
random  perturbation  model  parameters  at 
each  stage  of  the  line  detection  schemes 
employed.  The  results  from  these  experi¬ 
ments  will  be  used  to  validate  the  theory 
discussed  in  [1].  We  plan  to  perform  simi¬ 
lar  evaluation  using  RADIUS  imagery. 

In  the  black-box  mode  of  performance 
characterization,  we  have  defined  the 
meaning  of  an  experimental  protocol.  We 
have  set  up  an  experimental  protocol  for 
evaluating  the  performance  of  an  algorith- 
m  which  computes  the  exterior  orientation 
given  a  set  of  3D  model  points  with  its 
corresponding  2D  perspective  projection- 
s  [3].  The  exterior  orientation  algorith- 
m  computes  the  rotation  and  translation 
by  which  the  model  reference  frame  can 
be  transformed  into  the  camera  reference 
frame.  The  experimental  protocol  in  [3] 
illustrated  how  ideal  data  could  be  ran¬ 
domly  generated  and  how  this  ideal  da¬ 
ta  was  randomly  perturbed.  The  random 
perturbations  employed  included  both  s- 
mall  perturbations,  that  affected  most  of 
the  generated  perspective  projection  da¬ 
ta  points,  and  large  perturbations  that  af¬ 
fected  a  small  part  of  the  generated  da¬ 
ta  points.  Experiments  were  conducted 
to  compare  the  standard  iterative  equal¬ 
ly  weighted  least-squares  against  the  iter¬ 
ative  reweighted  least-squares  technique. 

Planned  future  work  includes  evaluation 
of  algorithms  for  circle  finding,  ellipse  find¬ 
ing,  and  rectangle  finding,  particularly  as 
these  algorithms  are  employed  in  the  un- 
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derstanding  of  aerial  images.  In  this  appli* 
cation  rectangles  correspond  to  roof  top- 
s  and  circles  and  ellipses  correspond  to 
spherical  holding  tanks  or  circular  chim¬ 
neys. 
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Abstract 

A  goal  of  this  research  is  effective  recognition  of 
complex  objects  in  realistic,  operational  scenes 
with  moderate  complexity,  using  general  meth¬ 
ods  with  a  rigorous  scientific  basis.  This  re¬ 
search  is  intended  to  contribute  to  ATR,  map¬ 
ping  and  site  monitoring.  Major  research  issues 
are:  1)  robust  recognition  despite  image  com¬ 
plexity;  2)  exploitation  of  multi-spectral,  multi¬ 
sensor  data;  3)  low-complexity  algorithms;  and 
4)  automation  of  development  of  recognition 
programs. 

The  technologies  of  this  research  that  con¬ 
tribute  to  resolving  these  major  issues  are:  a) 
quasi-invariants;  b)  structured  Bayesian  infer¬ 
ence;  c)  segmentation  and  measurement,  and 
d)  shape  representation. 

From  the  point  of  view  of  computational  com¬ 
plexity,  the  most  important  problem  in  recog¬ 
nition  is  figure-ground  discrimination.  In  the 
last  lU  Workshop,  effective  figure-ground  dis¬ 
crimination  was  demonstrated  based  on  quasi¬ 
invariants  derived  for  Generalized  Cylinder 
parts  (GC).  The  shape  of  3d  objects  was  in¬ 
ferred  from  monocular  images  [Sato  and  Bin- 
ford  92a,92b]. 

New  theoretical  results  have  been  achieved  in 
quasi-invariants,  including:  1)  a  new  mathe¬ 
matical  definition  of  quasi-invariants;  2)  deriva¬ 
tion  and  proof  of  two  new  strong  quasi¬ 
invariants  for  four  coplanar  points;  3)  results 
about  the  taxonomy  of  quasi-invariants. 
Recognition  was  demonstrated  for  an  aircraft 
at  San  Francisco  airport  and  canisters  in  a  com¬ 
plex  crash  image,  based  on  quasi-invariants. 
Progress  has  been  made  toward  the  goal  of 
recognition  by  Bayesian  networks  that  are  gen¬ 
erated  automatically  from  object  models.  A 
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eration”  and  supported  in  part  by  a  contract  with  NASA- 
AMES  Research  Laboratory  NCC  2-574:  “Nap  of  the  Earth 
Flight  by  HeUcopter”. 


new  object  and  constraint  system.  Classics,  was 
implemented  to  facilitate  the  geometric  reason¬ 
ing  necessary  to  generate  Bayesian  networks 
automatically 

Extended  edges  for  a  variety  of  scenes  were 
generated  using  a  preliminary  local  linking  of 
edgels  from  a  new  Wang-Binford  operator.  Ex¬ 
tensive  performance  evaluation  was  done  to 
build  a  statistical  model  for  the  new  Wang- 
Binford  operator  to  incorporate  in  Bayesian 
networks.  Image  measurements  of  position  and 
orientation  are  an  order  of  magnitude  more  ac¬ 
curate  than  those  in  ACRONYM,  Stanford’s 
system  from  1980,  making  possible  an  order 
of  magnitude  more  accurate  estimates  of  part 
dimensions  and  stereo  measurement.  Effective 
measurement  appears  possible  to  a  few  percent 
for  surfaces  with  images  extending  only  5x10 
pixels. 


I:  Introduction 

The  goal  of  this  research  is  to  develop  effective  meth¬ 
ods  to  contribute  to  ATR,  cartography  and  surveillance. 
This  requires  effective  recognition,  interpretation  and 
measurement  of  complex  objects  in  realistic,  operational 
scenes  with  moderate  complexity,  using  general  methods 
with  a  rigorous  scientific  basis.  Considerable  progress 
has  been  achieved. 

Major  research  issues  are:  1)  robust  recognition  de¬ 
spite  image  complexity;  2)  exploitation  of  multi-spectral, 
multi-sensor  fusion;  3)  development  of  low-complexity 
algorithms;  and  4)  automation  of  development  of  recog¬ 
nition  programs. 

The  technologies  of  this  research  that  contribute  to  im¬ 
plementing  mechanisms  that  solve  those  problems  are: 
a)  quasi-invariants;  b)  Bayesian  inference;  c)  segmen¬ 
tation  and  measurement;  and  d)  shape  representation. 
This  research  takes  an  integrated  system  approach  to 
building  component  technology  and  the  Successor  sys¬ 
tem. 

Robust  recognition  depends  on  accurate  measurement 
of  object  dimensions,  based  on  both  accurate  measure- 
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ment  of  image  dimensions  and  effective  interpretation  of 
image  features  as  object  surfaces.  Effective  segmentation 
based  on  minimum  variance  methods  enables  accurate 
measurements  of  image  dimensions.  A  benefit  of  this 
research  is  effective,  robust  recognition  with  fewer  pix¬ 
els  on  target  than  possible  with  other  methods.  Quasi¬ 
invariants  enable  generation  of  hypotheses  assigning  im¬ 
age  features  to  object  surfaces,  and  allow  estimating 
shape  parameters  in  space.  Quasi-invariants  also  enable 
new  stereo  correspondence. 

Bayesian  inference  networks  that  incorporate  quasi¬ 
invariants  [Binford  87a]  with  physical  constraints  were 
motivated  multi-sensor  and  information  fusion. 

Recognition  is  fundamentally  limited  by  computa¬ 
tional  complexity.  Complexity  of  brute-force,  view- 
sensitive  methods  is  enormous  (e.g.  aspect  graphs). 
View-sensitive  methods  for  pose  estimation  match  all 
combinations  of  image  features  with  all  views  of  all  ob¬ 
jects.  The  dominant  contribution  to  computational  com¬ 
plexity  is  matching  all  combinations  of  image  features. 
For  typical  aerial  imagery,  scene  complexity  dominates 
over  the  number  of  views  of  objects,  i.e.  the  num¬ 
ber  of  combinations  of  m  features  from  all  image  fea¬ 
tures  is  much  larger  than  the  number  of  views  of  all  ob¬ 
jects.  Our  research  provides  quasi-invariant  mechanisms 
to  reduce  this  combinatorial  complexity  enormously  by 
figure-ground  discrimination  in  clutter  to  match  only 
groups  of  features  belonging  to  a  single  object  to  avoid 
matching  all  combinations  of  image  features.  From  the 
point  of  view  of  computational  complexity,  figure-ground 
discrimination  (grouping)  is  a  central  problem  in  recog¬ 
nition. 

A  smaller  yet  significant  part  of  the  computational 
complexity  in  recognition  by  view-sensitive  methods  is 
matching  the  number  of  views  of  objects.  Invariants, 
where  available,  are  view-invariant;  they  avoid  the  com¬ 
putational  complexity  of  view-sensitive  methods  and  en¬ 
able  indexing  for  object  hypotheses.  Quasi-invariants 
extend  invariants  greatly,  making  these  view-insensitive 
methods  much  more  widely  applicable.  Avoiding  match¬ 
ing  all  views  of  all  objects  is  the  problem  of  hypothesis 
generation  and  indexing  for  similar  objects. 

From  another  standpoint,  the  inability  of  view- 
sensitive  methods  and  pose  estimation  to  accomodate 
variability  of  natural  scenes  is  also  very  important. 
Qu2tsi-invariants  allow  considerable  variation  within  ob¬ 
ject  class  and  viewing  conditions. 

Considerable  progress  has  been  made  toward  automat¬ 
ing  the  building  of  recognition  programs  from  object 
models. 

II:  Approach 

The  system  achieves  recognition  by  a  hypothesis 
generation  and  verification  paradigm  implemented  in 
Bayesian  decision  networks  There  are  four  modules: 
1)  figure/ground  discrimination  of  generalized  cylinder 
primitive  parts  (GC)  from  background  based  on  quasi¬ 
invariant  relation  among  extended  curves  along  image 
di.scontinuities;  2)  estimation  of  3D  shape  of  GC  primi¬ 
tive  parts  from  2D  image  data  based  on  quasi-invariants; 


3)  generation  of  object  hypotheses  by  indexing  based  on 
estimation  of  shape  of  compound  objects  from  shapes 
of  GC  parts;  and  4)  verification  or  refutation  of  object 
hypotheses  by  Bayesian  networks. 

At  the  last  lU  Workshop,  results  were  demonstrated 
for  figure-ground  discrimination  for  a  variety  of  images. 
The  methods  rely  on  rigorous  quasi-invariant  relations 
among  curves  on  cross  sections  (parallels)  and  meridians 
of  GCs.  The  estimation  of  shape  of  cross  sections  was 
demonstrated  also.  Those  results  have  been  extended  in 
two  ways:  1)  A  new  algorithm  for  determining  relations 
among  curves  has  been  designed  and  implemented.  It 
is  used  for  several  recognition  examples  and  cuts  the 
computational  complexity  dramatically.  An  extended 
algorithm  for  figure-ground  discrimination  has  been  de¬ 
signed  but  not  implemented;  it  extends  the  applicable 
class  of  GCs  greatly.  2)  New  quasi-invariants  and  im¬ 
proved  probabilistic  characterization  of  quasi-invariants 
extends  estimation  of  3d  shape  from  2d  image  data. 

Substantial  progress  has  been  made  toward  automat¬ 
ing  the  building  of  Bayesian  decision  networks  from  ob¬ 
ject  models.  A  new  object  and  constraint  system.  Clas¬ 
sics,  was  implemented  to  facilitate  the  geometric  reason¬ 
ing  necessary  to  generate  the  Bayesian  networks  auto¬ 
matically.  Examples  of  recognition  presented  at  the  last 
lU  Workshop  will  soon  be  fully  automated,  based  on 
Classics  models. 

Extended  edges  for  a  variety  of  scenes  were  generated 
using  a  preliminary  local  linking  of  edgels  from  a  new 
Wang-Binford  operator.  Extensive  performance  evalua¬ 
tion  was  performed  to  build  a  statistical  model  for  the 
new  Wang-Binford  operator  to  enable  its  use  in  Bayesian 
networks.  Linking  edgels  into  extended  edges  has  proved 
very  difficult  in  lU.  Complexity  of  linking  is  exponential 
in  the  error  of  position  and  orientation  of  edgel  measure¬ 
ments.  There  is  a  high  priority  and  payoff  in  accurate 
measurement  that  is  typically  overlooked. 

Improved  estimation  improves  recognition  by  making 
more  reliable  discrimination  of  object  parts  from  clutter 
and  to  make  more  accurate  estimates  of  part  dimensions. 
These  dimension  measurements  are  typically  an  order  of 
magnitude  more  accurate  than  in  typical  lU  systems, 
typically  1%  error  in  for  an  image  region  10  pixels.  It  is 
quite  feasible  to  work  with  regions  5x10  pixels  with  im¬ 
age  measurements  to  a  few  percent.  With  order  of  mag¬ 
nitude  better  resolution  now,  routine  counting  of  types 
of  aircraft  at  airports  seems  at  hand.  Stereo  reliability 
and  accuracy  in  height  measurement  are  aided  in  the 
same  way. 

Ill.l.a:  Quasi-Invariants:  Theory 

Quasi-invariant  observables  are  also  called  semi¬ 
invariants  or  local  invariants.  Quasi-invariant  observ¬ 
ables  are  locally  invariant  about  some  observation.  The 
explanation  follows,  an  observation  is  a  measurement 
made  from  some  viewpoint.  An  observable  is  a  measure¬ 
ment  repeatable  by  different  observers  with  coordinate 
frames  that  differ  by  rotation  and  translation.  Observ¬ 
ables  are  invariant  under  rotation  and  translation,  i.e. 
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isometries. 

Invariant  observables  under  perspective  projection  in 
imaging  are  observables  that  have  a  constant  value  from 
all  projections.  E.g.  for  four  colinear  points,  the  cross 
ratio  is  invariant.  Quasi-invariant  observables  under  per¬ 
spective  projection  are  observables  that  are  constant  lo¬ 
cally  about  some  projection,  i.e.  the  gradient  of  the  ob¬ 
servable  with  respect  to  projection  parameters  vanishes 
at  that  projection.  E.g.  for  three  colinear  points,  the 
ratio  of  intervals  is  quasi-invariant  about  projection  for 
a  distant  line  parallel  to  the  image  plane. 

Quasi-invariants  were  introduced  by  Binford  in  the 
late  1960s  to  extend  algebraic  invariants  and  were  used 
numerous  times  since.  The  definition  given  in  [Binford 
93]  is  a  refinement  of  the  mathematical  definition  used 
in  the  past.  Over  the  twenty  years  of  their  use,  a  large 
number  of  useful  quasi-invariants  were  invented  by  the 
author. 

Several  new  invariants  were  derived.  The  most  power¬ 
ful  are  2  strong  quasi-invariants  for  4  points  in  a  plane. 
These  results  extend  invariants;  there  are  known  to  be 
two  invariants  for  five  points  in  the  plane.  There  are 
known  to  be  no  invariants  for  four  points  and  three 
points  in  the  plane.  There  were  known  to  be  two  quasi¬ 
invariants  for  three  points  in  the  plane  [Binford  87a].  For 
four  points  in  the  plane  there  are  four  quasi-invariants. 
The  two  new  quasi-invariants  are  strong.  For  strong 
quasi-invariants  all  second  derivatives  with  respect  to 
viewpoint  parameters  vanish  [Binford  93].  Strong  quasi¬ 
invariants  are  nearly  invariant.  These  quasi-invariants 
are  expected  to  be  extremely  valuable. 

A  taxonomy,  a  classification  scheme,  for  quasi¬ 
invariants  was  developed.  The  taxonomy  includes:  in¬ 
variants,  generic  observables,  strong  quasi-invariant  ob¬ 
servables,  and  quasi-invariant  observables. 

We  are  investigating  not  only  the  local  behavior  of 
quasi-invariants  but  their  global  behavior  as  well.  If 
quasi-invariants  were  stable  only  in  a  small  region  around 
the  observation,  they  would  have  limited  value.  It  turns 
out  that  quasi-invariants  investigated  to  date  are  “sta¬ 
ble  in  large  measure”,  e.g.  they  are  constant  to  30% 
over  I  of  the  viewing  sphere.  Strong  quasi-invariants 
are  much  more  nearly  constant.  The  global  analysis  of 
the  colinear  ratio,  a  strong  quasi-invariant,  showed  that 
the  standard  deviation  is  1%  over  a  typical  human  limit, 
that  the  projected  colinear  ratio  is  invariant  to  within 
30%  over  almost  the  full  range  of  viewing  for  an  object 
2  miles  long  in  imagery  from  aircraft  flights,  i.e.  for  ex¬ 
tremely  large  objects.  The  projected  colinear  ratio  is 
much  more  nearly  invariant  for  smaller  objects. 

These  results  are  very  favorable.  There  is  reason 
to  believe  that  these  results  are  widely  true  for  quasi¬ 
invariants  and  strong  quasi-invariants.  Systematics  of 
global  analysis  of  quasi-invariants  are  under  investiga¬ 
tion  now  [Binford  92]. 

III.2.b:  Quasi-Invariants:  Exploitation 

[Mundy  et  al  92]  demonstrate  recognition  of  buildings 
using  invariants  for  five  points  in  a  plane.  A  rectan¬ 


gular  building  has  only  4  points,  but  there  might  be 
small  structures  on  the  roof.  An  L-shaped  building  has 
6  points  in  a  plane;  a  U-shaped  building  has  8  points  in 
a  plane.  In  these  examples,  there  are  typically  10,000 
points  or  lines  in  a  large  image.  The  number  of  combi¬ 
nations  of  5  points  is  «  10‘*  calculations  of  2 

invariants  and  2d  index  operations.  That  number  is  pro¬ 
hibitive.  Simple  grouping  was  used  to  reduce  the  number 
of  combinations,  corresponding  to  curvilinearity. 

By  contrast,  for  10,000  objects,  there  might  be  about 
10^  total  views.  That  is  a  very  large  number  but  minis¬ 
cule  in  comparison  with  the  number  of  combinations  of 
image  features  [Binford  93].  Conclusion;  image  com¬ 
plexity  dominates  over  the  number  of  views  of  objects. 
FYom  the  computational  point  of  view,  figure-ground  dis¬ 
crimination  is  the  most  important  operation  in  computer 
vision. 

Quasi-invariants  based  on  generalized  cylinders  pro¬ 
vide  the  basis  for  figure-ground  discrimination.  Quasi¬ 
invariants  also  enable  the  estimation  of  shape  parameters 
for  parts  and  objects,  enabling  indexing  to  generate  ob¬ 
ject  hypotheses  to  reduce  the  complexity  of  matching  all 
views  of  objects. 

An  obvious  way  to  recognize  objects  is  to  make  mea¬ 
surements  of  object  dimensions  in  space  and  compare 
them  with  dimensions  in  tables  of  objects.  These  mea¬ 
surements  of  object  dimensions  are  Euclidean  invariants, 
i.e.  invariant  under  rotations  and  translations.  There 
is  obviously  no  quasi-invariant  for  length,  a  usual  Eu¬ 
clidean  invariant,  but  there  are  quasi-invariants  for  ratios 
of  lengths.  For  many  Euclidean  invariants,  such  quasi¬ 
invariants  have  been  found.  The  investigators  believe 
that  it  will  be  possible  to  develop  a  systematic  method 
to  generate  quasi-invariants,  just  as  has  been  developed 
for  some  types  of  algebraic  invariants. 

One  big  advantage  of  quasi-invariants  is  that  there  are 
many  of  them  and  they  appear  to  be  widely  available,  i.e. 
quasi-invariants  are  there  when  needed,  for  most  objects 
in  most  situations.  There  are  few  invariants. 

Another  advantage  of  quasi-invariants  is  that  exploita¬ 
tion  is  intuitively  clear  once  the  paradigm  switch  has 
been  made.  Recognition  is  done  by  interpreting  quasi¬ 
invariants  as  approximate  measurements  of  objects  in 
space.  Quasi-invariants  determine  figure-ground  dis¬ 
crimination,  i.e.  generating  hypotheses  of  object  parts. 
Quasi-invariants  determine  hypotheses  about  propor¬ 
tions  of  object  parts  and  objects,  generating  hypotheses 
of  object  shape  that  enable  indexing. 

The  paradigm  shift  is  from  brute  force  matching  fea¬ 
ture  sets  from  models  to  all  combinations  of  feature  sets 
in  the  image  domain,  to  generating  object  hypotheses, 
generating  3  space  descriptions  of  objects  from  images, 
indexing  to  generate  hypotheses  of  matching  objects, 
and  detailed  verification. 

As  a  simple  example,  invariants  are  not  possible  for 
any  plane  figure  with  three  or  four  points,  e.g.  a  tri¬ 
angular  cross  section  or  quadrilateral.  There  are  two 
quasi-invariants  for  three  points  in  a  plane;  there  are 
two  additional  strong  quasi-invariants  for  four  points  in 
a  plane.  Thus,  approximate  measurement  of  ratio  of  di¬ 
mensions  of  a  triangular  face  and  recognition  based  on 
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these  measurements  are  possible  for  monocular  images, 
and  very  good  approximate  measurements  for  quadrilat¬ 
eral  faces  are  possible.  For  five  points  or  more  in  a  plane, 
invariants  are  possible  also.  These  results  are  very  pow¬ 
erful.  They  are  complete  in  a  mathematical  sense  in  that 
any  polygon  can  be  recognized  approximately  with  these 
methods.  Quasi-invariants  give  shape  measures  that  can 
be  used  for  indexing.  New  results  are  expected  for  non¬ 
poly  gonal  faces  to  extend  completeness  results. 

A  major  use  of  quasi-invariants  has  been  in  stereo  vi¬ 
sion  [Arnold  and  Binford  80,  Baker  and  Binford  81].  Pro¬ 
jected  angles  in  two  stereo  views  of  an  edge  in  space  are 
nearly  equal,  i.e.  there  is  a  narrow  distribution  of  the  dif¬ 
ference  of  angles  in  two  views,  with  1  degree  full-width 
half-maximum  for  human  stereo  at  1  meter.  There  is 
much  more  to  be  done  in  stereo  using  quasi-invariants. 

Generalized  cylinders  are  used  to  describe  a  very 
large  class  of  objects.  Generalized  cylinders  (GCs)  ex¬ 
press  powerful  quasi-invariants  for  grouping  image  fea¬ 
tures  that  define  a  mechanism  for  discrimination  of  GC 
parts,  figure-ground  discrimination.  Generalized  cylin¬ 
der  primitives  are  defined  by  cross  section  and  sweeping 
rule.  Quasi-invariants  permit  estimates  of  shapes  of  GC 
parts  and  objects  formed  of  GC  parts  that  enable  gener¬ 
ation  of  hypotheses  that  are  verified  by  globally  coherent 
interpretation  by  a  Bayes  net  for  evidential  inference. 

An  important  part  of  this  paradigm  of  use  of  quasi¬ 
invariants  follows.  Quasi-invariants  are  inexact;  a  quan¬ 
titative  probabilistic  interpretation  enables  efficient  use 
of  quasi-invariants  in  recognition  in  a  paradigm  of  hy¬ 
pothesis  generation  and  verification.  The  statistical 
analysis  is  for  hypothesis  generation,  not  accuracy  of 
the  final  match.  This  is  an  important  distinction.  For 
a  majority  of  cases,  quasi-invariant  observables  are  ap¬ 
proximations  to  corresponding  body  measurements  in 
space;  those  quasi-invariants  generate  good  approximate 
hypotheses.  Distributions  of  deviations  are  known  or 
can  be  derived  for  use  in  evidential  inference.  For  a  mi¬ 
nority  of  cases,  quasi-invariant  observables  give  bad  esti¬ 
mates  of  corresponding  body  measurements;  those  quasi¬ 
invariants  generate  false  hypotheses  that  are  rejected  by 
a  verification  phase,  in  most  cases  at  low  cost.  This  hy¬ 
pothesis  generation  and  verification  mechanism  depends 
on  making  some  accurate  measurements  and  some  ac¬ 
curate  hypotheses;  the  mechanism  tolerates  local  errors. 
Computational  complexity  can  be  low. 

III. 2:  Bayesian  Networks  and  SUCCESSOR  System 

The  objective  of  this  part  of  the  effort  is  provide  a  com¬ 
prehensive  way  of  integrating  available  information,  per¬ 
forming  sensor  fusion.  A  further  objective  is  to  solve  the 
associated  software  problem  by  automating  the  building 
of  Bayesian  decision  networks  from  object  models. 

III.2.a:  SUCCESSOR  System 

We  are  working  toward  automated,  model  based  3-D 
interpretation  of  imagery.  The  algorithms  are  intended 
for  multi-sensor  fusion;  here  algorithms  are  tested  with 


monocular  gray  scale  images,  a  difficult  problem.  The 
algorithms  described  in  this  section  begin  with  linked 
edges,  group  them  into  GC  surfaces,  and  ultimately  es¬ 
tablish  a  3-D  object  interpretation  of  the  surfaces.  There 
are  three  primary  components  in  this  section  of  our  inter¬ 
pretation  system;  the  Bayesian  network  interpretation 
kernel,  the  Volume-Surface-Curve-Point  (VSCP)  mod¬ 
elling  system,  and  Classics,  a  highly-typed  object  and 
constraint  system. 

The  Bayesian  network  is  a  graph  structure  of  ob¬ 
ject  model  hypotheses  and  conditional  probabilities  be¬ 
tween  the  hypotheses.  Geometric  and  physical  models 
determine  geometric  constraints  with  detailed  probabil¬ 
ity  models  for  measurements  and  estimation  algorithms. 
Bayesian  networks  are  at  the  core  of  our  interpretation 
approach;  they  have  been  demonstrated  successfully  in 
the  3-D  interpretation  of  an  aircraft  in  optical  imagery 
from  San  Francisco  airport  and  in  3-D  interpretation  of  a 
plastic  elbow,  a  plumbing  part.  Work  will  be  completed 
soon  that  will  generated  Bayesian  inference  networks  au¬ 
tomatically  from  !ie  VSCP  modelling  system. 

The  VSCP  graph  (Volume-Surface-Curve-Point)  rep¬ 
resents  the  topology  of  neighbor  relations  among  the 
topological  types  of  objects.  It  is  an  intermediate  level  of 
modeling  that  will  be  derived  from  object  models.  The 
goal  of  the  VSCP  modelling  system  is  to  define  geomet¬ 
ric  information  to  permit  automated  computer  reasoning 
and  interpretation,  and  do  it  in  a  way  that  is  concise  and 
usable.  To  achieve  this  goal  we  have  built  our  models  up 
using  mathematically  sound  definitions.  In  addition,  we 
have  developed  a  special  constraint  system.  Classics,  to 
permit  the  necessary  level  of  symbolic  expression. 

The  Classics  system  is  a  highly  typed  constraint  sys¬ 
tem.  It  supports  the  definition  of  object  classes  in  terms 
of  other  classes  using  constraints. 

We  have  achieved  3-D  interpretation  of  an  aircraft  in 
an  airport  scene  and  a  plastic  elbow,  a  plumbing  part. 
We  used  Bayesian  networks  to  guide  the  recognition,  ac¬ 
cumulate  evidence  and  provide  a  measure  of  the  most 
likely  interpretation  of  the  data.  These  results  are  shown 
in  the  figures  below.  These  results  demonstrate  the  vi¬ 
ability  of  aspects  of  our  approach;  a)  generating  object 
hypotheses  from  quasi-invariants;  b)  accurate  probabil¬ 
ity  models;  and  c)  using  Bayesian  networks  to  aggregate 
evidence  both  in  support  of  correct  hypotheses  and  in 
refutation  of  incorrect  hypotheses. 

Interpretation  proceeds  by;  1)  generating  part  hy¬ 
potheses  by  finding  corresponding  curves  from  quasi¬ 
invariants  that  make  generalized  cylinder  part  hy¬ 
potheses;  2)  estimating  part  measurements  from  quasi¬ 
invariants;  3)  generating  object  hypotheses  from  part  hy¬ 
potheses  and  indexing  based  on  object  shape  from  quasi¬ 
invariants  to  determine  object  model  hypotheses;  4)  ver¬ 
ification  in  Bayesian  networks.  All  of  these  steps  1-4  use 
generic  models  that  utilize  information  available  at  that 
stage.  None  of  these  steps  is  specific  to  object  models;  all 
steps  accomodate  great  variation  in  the  generic  models. 
This  approach  avoids  the  combinatorics  of  attempting 
to  match  every  combination  of  image  features  with  ev¬ 
ery  view  of  every  known  object  model.  In  addition,  at 
every  level  of  interpretation,  the  model  is  used  to  pre- 
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diet  and  refine  edge  and  image  information  to  aggregate 
more  data.  Strong  data  give  rise  to  hypothesized  models, 
which  are  used  to  predict  and  include  other  data. 

Probabilistic  relationships  between  the  models  and  the 
data  are  expressed  by  the  Bayesian  Network.  Probabilis¬ 
tic  relationships  lead  to  a  mathematically  sound  algo¬ 
rithm  and  avoid  the  many  ad-hoc  decisions  common  to 
recognition  algorithms.  Generic  methods  for  creating  3- 
D  interpretations  from  2-D  data  avoid  the  computational 
problem  of  too  many  false  hypotheses. 

III.2.b:  Classics 

We  have  completed  an  implementation  of  Classics,  a 
highly  typed  object  system  and  constraint  system.  Clas¬ 
sics  can  be  thought  of  as  a  combination  of  an  object- 
oriented  data  base  and  a  symbolic  declarative  program¬ 
ming  language.  The  goal  of  Classics  is  to  support  the 
creation  of  a  geometric  modelling  system  with  suffi¬ 
cient  power  and  symbolic  representation  to  perform  true 
model-based  recognition.  Classics  is  written  in  Com¬ 
mon  LISP  on  top  of  the  Common  LISP  Object  System 
(CLOS)  as  implemented  by  Portable  Common  Loops 
(PCL).  Classics  is  unique  in  that  it  not  only  defines  an 
object  class  hierarchy,  but  does  it  in  a  way  that  solu¬ 
tion  algorithms  can  be  derived  automatically  from  the 
constraints  that  define  classes. 

We  are  building  the  Volume-Surface-Curve-Point 
(VSCP)  object  modelling  system  using  Classics.  From 
the  VSCP  the  algorithm  will  automatically  derive  the 
Bayesian  network  in  real  time  and  control  the  interpre¬ 
tation.  The  goal  is  to  provide  automatic  image  interpre¬ 
tation  from  object  models.  At  this  intermediate  step, 
the  system  builds  a  Bayesian  network  from  VSCP  mod¬ 
els.  The  user  creates  high  level  models  of  specific  man¬ 
ufactured  parts.  All  of  the  geometric  relationships  and 
observable  features  of  the  part  will  be  extracted  auto¬ 
matically.  The  system  will  automatically  integrate  the 
part  into  the  probabilistic  characterization  of  the  low 
level  models. 

III. 3  Segmentation 

What  seems  like  an  incomprehensible  variety  of  imag¬ 
ing  problems  breaks  down  into  a  small  set  of  segmen¬ 
tation  components  bcised  on  careful  analysis  of  physics 
and  differential  geometry  of  observation  [Binford  87b]. 
There  is  much  in  common  between  segmenting  depth 
data,  segmenting  intensity  data,  and  segmenting  SAR 
and  IR  data.  Multi-sensor  segmentation  is  achievable. 
Complete  segmentation  is  feasible  in  an  interesting  sense, 
not  in  the  sense  of  perfect  or  faithful  segmentation,  for 
which  information  might  not  be  available.  Instead,  a 
complete  segmentation  is  possible  in  wthat  physics  and 
differential  geometry  make  possible  a  small  enumeration 
of  local  image  configurations. 

Extended  edges  for  a  variety  of  scenes  were  generated 
using  a  preliminary  local  linking  of  edgels  from  a  new 
Wang-Binford  operator.  Extensive  performance  evalua¬ 
tion  was  done  for  the  new  Wang-Binford  operator.  Im¬ 


proved  estimation  accuracy  results  in  improved  and  more 
reliable  discrimination  of  object  parts  from  clutter  and 
more  accurate  estimates  of  part  dimensions. 

III.3.a:  Local  Discontinuities 

Binford  and  Wang  [Wang  and  Binford  93]  have  de¬ 
veloped  a  step  edge  estimation  operator,  starting  from 
a  modified  Canny  operator.  They  have  improved  sensi¬ 
tivity  by  a  factor  of  4,  i.e.  decreased  the  threshold  for 
gradient  magnitude  by  a  factor  of  4.  The  only  parameter 
used  is  a  measured  parameter,  i.e.  sensor  noise  variance. 
There  are  no  free  parameters,  no  tweaking.  Those  pa¬ 
rameters  are  measured  from  the  sensor  or  measured  from 
image  content.  Analysis  has  eliminated  biases  from  ori¬ 
entation  and  position,  biases  that  are  so  strong  that  sev¬ 
eral  researchers  do  not  use  orientation  from  the  Canny 
operator.  With  the  Wang-Binford  operator,  orientation 
is  accurate  to  a  few  degrees  over  a  wide  range  of  condi¬ 
tions. 

A  range  image,  SAR  or  IR  image,  or  optical  intensity 
image  is  a  compound  mathematical  surface.  Differential 
geometry  applied  to  the  physics  of  image  formation  al¬ 
lows  us  to  classify  the  local  classes  of  image  surface.  A 
small  image  patch  corresponds  to:  1)  a  single  continuous 
image  surface;  2)  a  pair  of  image  surfaces  with  an  edge 
discontinuity  at  their  intersection;  a  single  image  sur¬ 
face  with  a  spot  discontinuity;  3)  three  or  more  image 
surfaces  with  curve  discontinuities  meeting  at  a  vertex 
discontinuity;  4)  more  complex  configurations  that  may 
not  be  discriminable  with  fixed  image  resolution. 

The  conclusion  is  that  robust  segmentation  for  a  va¬ 
riety  of  imagery  can  be  achieved  with  a  small  number 
of  local  segmentation  components.  The  objective  of 
this  part  of  the  effort  is:  a)  to  design  and  implement 
the  complete  set  of  local  segmentation  components;  b) 
to  design  and  implement  the  linking  method  that  con¬ 
structs  quasi-local  and  global  segmentations  from  local 
segmentations.  Segmentation  corresponds  to  generat¬ 
ing  hypotheses  about  the  image  surface.  Interpreta¬ 
tion  corresponds  to  generating  hypotheses  about  surfaces 
in  space  and  objects  from  image  surfaces.  An  impor¬ 
tant  part  of  this  work  is  building  an  accurate  statistical 
model  of  behavior  of  segmentation  that  is  valuable  in  the 
Bayesian  network  for  combining  evidence,  especially  in 
multi-sensor  problems. 

Edge  discontinuity  detection  extracts  boundaries  of 
image  surfaces  and  discontinuities  of  orientation  of 
boundaries,  etc.  It  has  played  an  important  role  in  the 
early  process  of  a  vision  system,  and  has  a  direct  impact 
on  performance  of  subsequent  processes.  When  we  look 
at  the  output  of  typical  edge  segmentation  algorithms  on 
simple  images,  they  look  adequate,  superficially.  When 
we  take  a  close  look  at  their  performance  in  complex  im¬ 
ages  we  find  important  faults  that  can  be  overcome.  For 
example,  the  Canny  operator,  which  depends  on  the  gra¬ 
dient  of  the  image  values,  is  sensitive  to  shading.  This 
causes  false  edgels  where  there  is  no  discontinuity,  on 
curved  surfaces  like  the  fuselage  of  an  aircraft.  The  ef¬ 
fect  of  gradients  also  causes  large  biases  in  the  estimate 
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of  orientation  and  position  of  edgels.  These  biases  con¬ 
tribute  to  inaccuracy  of  measurement  that  induces  an  ex¬ 
ponential  complexity  to  linking  and  to  recognition.  The 
reason  for  the  biases  is  that  the  Canny  operator  uses  an 
inaccurate  and  incomplete  model  of  the  local  image  sur¬ 
face.  The  Canny  operator  is  designed  for  a  step  edge 
between  flat  image  surfaces.  Such  edges  are  a  minority. 
On  any  other  edges,  the  operator  has  biases  from  the 
inappropriate  model. 

The  Wang-Binford  algorithms  are  two  directional  op¬ 
erators  designed  to  minimize  inaccuracy  including  biases 
in  edgel  orientation  and  position  estimation  from  a  raw 
image.  The  second  of  the  two  algorithms  will  be  de¬ 
scribed  here.  It  uses  2-dimensional  fitting,  instead  of  1- 
dimensional  fitting,  that  is  less  sensitive  to  factors  which 
are  not  included  in  a  step  edge  model.  The  operator  also 
uses  the  logarithm  of  the  gradient  magnitude  to  map  a 
nonlinear  fitting  problem  to  a  linear  problem,  thus  re¬ 
ducing  the  computational  complexity  greatly. 

This  operator  is  effective  for  step  edges  in  intensity  im¬ 
ages.  It  is  not  complete  in  the  sense  that  the  operator  is 
not  effective  for  many  image  features,  e.g.  lines  or  spots. 
The  author  htis  promoted  development  of  a  complete  set 
of  operators  [Herskovits  and  Binford  69] . 

A  Gaussian  derivative  is  convolved  with  the  intensity 
image  to  estimate  gradient  components  in  i  and  j  direc¬ 
tions.  An  estimate  is  made  of  the  local  surface  curvature 
near  local  maxima  of  the  log  of  the  gradient  magnitude. 
The  gradient  magnitude  along  the  transverse  direction 
of  a  step  edge  is  a  Gaussian;  its  logarithm  is  parabolic. 
A  linear  2-dimensional  quadratic  surface  is  fit  to  every 
maximum  with  a  3  by  3  support  to  get  the  position, 
orientation,  and  contrast  information  of  the  edgel. 

Estimate  of  improvement  of  the  Wang-Binford  algo¬ 
rithm  is  underway  by  building  statistical  models  of  this 
algorithm  and  others,  i.e.  the  Toberts-Cross  operator, 
the  Canny  Operator,  etc. 

Conclusion 

Quasi-invariants  from  GC  shape  representation  pro¬ 
vide  a  basis  for  figure-ground  discrimination  of  GC 
parts  hypotheses.  Queisi-invariants  also  enable  estima¬ 
tion  of  3d  shape  of  parts  and  objects  that  enables  view- 
insensitive  hypothesis  generation  and  indexing.  These 
two  mechanisms  reduce  drastically  the  first  and  second 
most  computationally  complex  operations  in  recognition 
in  moderately  complex  imagery.  They  also  enable  recog¬ 
nition  with  targets  with  a  high  degree  of  variability. 

Classics  implement  an  object  system  with  a  strong 
mathematical  type  heirarchy.  The  shape  representation 
implemented  in  Classics  enables  automation  of  target- 
specific  recognition  from  VSCP  object  models. 

Improvements  in  segmentation  demonstrating  ex¬ 
tended  edges  are  important  to  robust  recognition  and 
measurement. 

We  plan  to  demonstrate  image  interpretation  of  more 
complex  object  models,  including  multiple  parts  with 
occlusion,  within  the  next  quarter.  The  recognition  will 
be  completely  automatic  and  model-driven. 
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Abstract 

We  ducuss  the  need  for  a  new  series  of  benchmarks 
in  the  vision  field,  to  provide  a  direct  quantitative  mea¬ 
sure  of  progress  understandable  to  sponsors  of  research 
as  well  as  a  guide  to  practitioners  in  the  field.  A  first 
set  of  benchmarks  in  two  categories  is  proposed  (1)  static 
scenes  containing  manmade  objects,  and  (2)  static  nat¬ 
ural/outdoor  scenes.  The  tests  are  “end-to-end”  and 
involve  determining  how  well  a  system  can  identify  in- 
stances  (an  item  or  condition  is  present  or  absent)  in 
selected  regions  of  an  image.  The  scoring  would  be  set 
up  so  that  the  automatic  setting  of  adjustable  parame¬ 
ters  is  rewarded  and  manual  tuning  is  penalized.  To 
show  how  far  machine  vision  htis  yet  to  go,  a  Benchmark 
2000  problem  is  also  suggested  using  children’s  “what  is 
wrong”  puzzles  in  which  defective  objects  in  a  line  draw¬ 
ing  of  a  scene  must  be  found. 

1  Introduction 

Speech  and  natural  language  researchers  at  DARPA  have 
made  extensive  use  of  a  benchmarking  approach  to  ob¬ 
tain  a  measure  of  the  progress  in  these  fields.  This  ex¬ 
ercise  has  had  two  distinct  positive  effects  on  the  fields. 
First,  since  the  results  of  the  benchmarks  provide  a  di¬ 
rect  quantitative  measure  understandable  by  people  out¬ 
side  these  fields,  they  have  been  of  great  use  “politically” 
within  DARPA.  Second,  through  the  rigorous  compari¬ 
son  among  various  techniques  the  benchmarking  exercise 
has  spurred  advances  in  the  field.  This  paper  discusses 
the  strategy  for  devising  and  using  a  new  series  of  bench¬ 
marks  in  the  vision  field.  We  believe  that  the  vision  field 
requires  such  benchmarking  efforts  to  objectively  mea¬ 
sure  its  progress.  There  have  been  some  previous  bench¬ 
marking  attempts  in  vision  at  DARPA  and  elsewhere, 
but  they  dealt  mainly  with  measuring  the  performance  of 
computer  architectures  running  vision  algorithms,  rather 
than  with  the  performance  of  the  vision  algorithms  and 
systems. 

The  goal  of  creating  challenging  benchmark  problems 
in  machine  vision  is  to  pose  a  reasonably  comprehensive 


set  of  vision  problems  to  which  proposed  advances  can 
be  subjected  to  experimental  evaluation.  The  problems 
should  be  formulated  in  terms  of  tasks,  for  a  module 
or  for  a  whole  system,  independent  of  any  specific  tech¬ 
niques  -  e.g.,  evaluation  of  three-dimensional  shape  re¬ 
covery  rather  than  evaluation  of  a  shape- from-X  method, 
or  evaluation  of  natural  scene  understanding  rather  than 
evaluation  of  an  expert  system-based  interpretation  sys¬ 
tem.  There  would  be  challenging  problems  in  various 
categories,  e.g.,  outdoor  scenes,  manmade  objects,  time- 
varying  scenes,  and  so  on.  After  discussing  issues  and 
methodologies  in  creating  vision  benchmark  problems, 
this  paper  presents  a  few  example  problems  in  the  do¬ 
main  of  static  natural  scenes  and  in  static  scenes  con¬ 
taining  manmade-objects. 

2  Previous  Benchmarking  in  Vi¬ 
sion 

The  DARPA  benchmark  carried  out  from  1986-89  [1,2] 
was  an  attempt  to  characterize  the  performance  of  ma¬ 
chine  architectures  running  lU  algorithms.  As  such,  it 
is  not  directly  applicable  to  the  current  lU  benchmark 
effort.  However,  there  were  several  lessons  learned,  pri¬ 
marily  the  time-consuming  and  somewhat  expensive  na¬ 
ture  of  the  operation. 

A  more  pertinent  benchmark  approach  is  the  Un¬ 
manned  Ground  Vehicle  set  of  evaluations  for  all  sub¬ 
systems,  including  stereo,  LADAR,  road-following,  and 
path  planning.  The  stereo  evaluation,  described  in  these 
proceedings  [3],  is  of  particular  interest  in  this  regard. 
The  overall  plan  was  to  pursue  a  three-pronged  ap¬ 
proach,  including  analytic  models,  qualitative  “behav¬ 
ioral”  models,  and  statistical  performance  models.  The 
analytic  models  are  used  to  estimate  the  expected  depth 
precision  computable  with  a  specific  camera  configura¬ 
tion.  The  qualitative  models  are  used  to  identify  key 
problems  for  future  research.  The  statistical  model  is 
used  to  produce  quantitative  estimates  of  such  key  fac¬ 
tors  as  the  smallest  obstacle  detectable  at  a  specified  dis¬ 
tance.  Data  gathering  and  preparation  required  a  large 
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amount  of  effort.  Imagery  was  collected  from  five  groups; 
49  image  pairs  were  selected  for  analysis  and  converted 
to  a  standard  format.  An  interesting  initial  result  of  the 
evaluation  was  the  identification  of  the  strengths  and 
weaknesses  of  the  various  stereo  techniques,  leading  to 
the  possibility  of  combining  them  in  a  system  that  pro- 
duces  more  complete  and  accurate  results  than  any  of 
the  individual  techniques. 

3  Issues  in  Vision  Benchmark 

Design 

There  are  a  few  important  issues  that  arise  in  vision 
benchmarking : 

•  specifying  the  scope  of  a  problem, 

•  balancing  competitive  and  collaborative  aspects  of 
benchmarking,  and 

•  devising  cin  evolutionary  problem  selection  mecha¬ 
nism  for  future  benchmarks. 

3.1  Vision  Problem  Specification 

The  critical  difference  between  producing  vision  bench¬ 
marks,  and  producing  those  for  language  and  speech, 
is  that  the  field  of  machine  vision  does  not  yet  have 
widely  acceptable  specifications  for  generic  problem  do- 
mmns  (or  representations)  on  which  to  base  the  problem 
definitions.  Further,  the  number  of  sample  images  and 
supporting  data  necessary  to  cover  any  given  problem 
domain  without  artificial  (unknown)  biases  seems  to  be 
far  larger  than  in  language  and  speech.  In  language, 
topics  and  languages  certainly  vary  a  lot;  yet  a  large 
enough  number  of  news  articles,  novels,  etc.,  will  reason¬ 
ably  cover  the  problem  variations,  and  text  files  contain 
everything  the  benchmarking  algorithms  need  to  employ. 
In  speech,  there  are  variations  in  frequency,  dialects,  etc, 
yet  a  “numeral  digit  recognition”  problem  or  a  limited 
vocabulary  problem  provides  some  reasonable  bound  to 
a  problem  domain,  and  a  large  enough  number  of  speech 
samples  will  cover  the  variations.  A  high-quality  tape 
of  speech  (with  various  types  of  background  noise)  is  a 
good  universal  basic  representation  for  the  input  data. 

In  image  understanding,  the  direct  analogies  do  not 
work  as  well.  Outdoor  natural  scenes  do  not  seem  to 
have  an  accessible  technical  definition,  except  that  peo¬ 
ple  can  probably  classify  a  given  image  as  depicting  a 
natural  scene  or  not.  A  universal  representation/media 
for  sensed  data  describing  a  scene  does  not  exist.  More¬ 
over,  the  nature  of  the  input  devices,  the  way  we  acquire 
images  and  specify  the  resolution,  the  measurable  infor¬ 
mation,  etc.  can  themselves,  singly  or  in  combination, 
constitute  major  research  problems. 


3.2  Competitive  and  Collaborative  As¬ 
pects 

The  principal  goals  of  the  proposed  vision  benchmarks 
are  to  evaluate  scientific  progress  in  specific  problem  ar¬ 
eas,  and  to  make  the  extent  of  such  progress  apparent  to 
the  sponsors  of  the  research  as  well  as  to  the  scientists 
working  in  this  field.  Evaluation  of  a  set  of  alternative 
solutions  to  a  problem  naturally  involves  comparing  the 
resultant  scores  and  to  thus  rank  the  techniques.  We 
can’t  avoid  competition.  Making  the  results  of  the  eval¬ 
uation  diflScult  to  interpret  or  keeping  the  identity  of  the 
participants  secret  eliminates  the  incentive  to  enter  the 
evaluation  and  exert  the  necessary  effort  to  do  well. 

Nonetheless,  it  is  very  important  to  make  sure  that  we 
are  competing  on  the  right  problems,  that  the  competi¬ 
tion  is  fair,  and  that  we  don’t  poison  the  currently  excel¬ 
lent  cooperative  atmosphere  that  exists  in  the  DARPA 
vision  research  community. 

Among  its  positive  benefits,  benchmarking  will  pro¬ 
mote  collaboration.  Many  researchers  will  not  be  able  to 
afford  to  develop  all  the  system  components  themselves 
in  order  to  enter  the  evaluation.  A  module  or  compo¬ 
nent  that  has  been  proven  to  have  high  performance  will 
be  transferred  from  the  hands  of  the  developer  to  other 
sites  whose  main  research  focus  is  not  the  module,  but 
rather  access  to  its  functionality. 

In  the  specific  case  of  algorithms  that  are  intended  to 
run  autonomously,  i.e.,  without  manual  tuning,  it  is  crit¬ 
ical  for  the  purposes  of  believability  that  the  test  data 
NOT  be  given  to  the  contestants  prior  to  the  bench¬ 
mark.  Further,  the  problem  of  automatically  finding 
settings  for  adjustable  parameters( present  in  almost  ev¬ 
ery  vision  algorithm)  is  a  key  vision  problem  in  which 
progress  should  be  encouraged  -  the  benchmark  could 
be  a  positive  infiuence  in  this  regard. 

3.3  Evolution  of  Benchmarks 

Since  there  is  enough  diversity  of  opinions  about  what 
constitutes  the  correctness  of  the  output  of  any  com¬ 
ponent,  practical  benchmarking  tends  to  be  performed 
on  “end-to-end”  systems  performing  a  well-understood 
task.  This  emphasis  on  system  evaluations  can  have  both 
positive  and  negative  impacts  on  the  field.  Positive  ef¬ 
fects  are:  the  promotion  of  research  because  there  exist 
accepted  criteria  of  progress;  the  establishment  of  some 
priorities  on  problems  to  be  solved;  and  an  increased 
awareness  of  the  availability  and  usefulness  of  a  broader 
range  of  techniques  for  performance  improvement.  Po¬ 
tential  negative  effects  are:  the  temptation  to  use  any 
trick  that  improves  performance  on  the  evaluated  task; 
the  premature  stifling  of  new  directions  in  the  field,  and 
the  reliance  on  some  statistical  methods  which  tends  to 
produce  better  “average”  results.  Some  of  these  phe¬ 
nomena,  positive  and  negative,  have  appeared  in  the  lan- 
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guage  and  speech  fields  since  the  introduction  of  bench¬ 
marks. 

To  achieve  the  positive  effects  and  avoid  the  negative 
ones,  the  vision  benchmarks  should  allow  for  evolution 
and  expansion  as  we  improve  our  understanding  of  the 
field.  This  is  especially  important  in  vision  because  vi¬ 
sion  tasks  are  not  static  -  they  expand.  The  benchmark 
problems  cannot  be  a  casually  controlled  ad-hoc  collec¬ 
tion  of  problems,  or  a  set  of  problems  carefully  tailored 
for  small  cliques,  each  with  a  special  view  of  how  the 
problem  should  be  solved.  If  a  group  of  investigators 
wants  to  pursue  a  promising  new  approach  which  can¬ 
not  be  evaluated  appropriately  within  the  current  set  of 
benchmarks,  there  must  be  a  mechanism  to  define  a  new 
DARPA  benchmark  if  appropriate  criteria  are  satisfied. 

4  Deriving  Benchmark  Prob¬ 
lems 

4.1  Problem  Selection 

We  must  first  define  the  problem  domains,  such  as  static 
outdoor  scenes,  static  scenes  of  manmade  objects,  out¬ 
door  image  scene  sequences,  and  image  sequences  of 
manmade  objects.  A  panel  will  be  organized  to  carefully 
divide  the  vision  field  into  categories  and  subcategories 
because  this  categorization  is  one  of  the  most  critical  is¬ 
sues  in  the  design  of  the  benchmark.  A  relatively  small 
number  of  problems  (less  than  four  initi2illy)  in  each  cat¬ 
egory  would  be  carefully  selected  by  community  consen¬ 
sus.  These  problems  should  be  based  on  some  important 
vision  function,  NOT  some  vision  architecture,  represen¬ 
tation,  or  technique. 

These  should  generally  be  retained  in  the  benchmark 
until  they  are  “solved”  or  no  longer  of  scientific  impor¬ 
tance. 

The  complete  benchmark  should  cover  the  vision  field 
by  defining  between  five  to  ten  separate  problems  for 
the  competition;  it  may  be  necessary  to  have  two  prob¬ 
lems  in  some  categories  to  separately  deal  with  the  main 
dichotomy  of  strong  vs.  weak  models  (e.g.,  man-made 
environments  vs.  natural  outdoor  scenes).  Each  prob¬ 
lem  category  may  require  separate  subcategories  for  dif¬ 
ferent  sensing  modalities,  viewing  conditions,  and  envi¬ 
ronmental  factors;  in  the  case  of  sensing  modalities,  the 
subcategories  could  be 

•  intensity  images  vs  range  images 

•  black-and-white  vs  color  (or  multispectral) 

•  significant  perspective  distortion  vs.  essentially  or¬ 
thographic  projection 

The  problems  listed  in  Appendix  A  are  a  few  ab¬ 
stracted  versions  of  possible  benchmark  entries  for  the 
static  outdoor  scenes  and  man-made  object  scenes.  They 


are  offered  for  discussion  in  the  light  of  all  the  above 
sentiments,  but  the  task  of  choosing  the  actual  problems 
still  remains.  It  is  hoped  that  for  an  initial  benchmark 
a  total  of  no  more  than  four  problems  will  be  selected 
from  the  set  of  all  submissions.  This  would  allow  us  to 
work  out  the  details  of  the  process  before  an  excessive 
amount  of  effort  is  expended. 

Finally,  it  is  important  to  remember  that  the  proposed 
benchmark  will  only  cover  a  small  subset  of  the  impor¬ 
tant  problems  in  machine  vision.  We  have  not  done  away 
with  all  the  traditional  methods  of  reporting  and  evalu¬ 
ating  progress. 

4.2  Evaluation  Method 

Human  examiners  would  select  (but  not  necessarily  re¬ 
veal  to  the  contestants)  a  few  locations  in  each  im¬ 
age  that  contain  obvious  instances  (item  or  condition 
is  present  or  absent)  of,  for  example,  the  existence  of  a 
road  or  a  material  like  grass  or  rock.  The  scoring  at  each 
location  is  binary  -  correct  or  incorrect. 

Problems,  for  example,  in  recognizing  natural  objects 
are  believed  to  be  difficult  enough  so  that  no  currently 
known  technique,  or  brute  force  approach,  can  perform 
well  (i.e.,  within  50%  of  human  performance  on  the 
recognition  problems  and  somewhat  higher  on  the  ge¬ 
ometry  problems  depending  on  the  avrdlability  of  cali¬ 
bration  data  and  the  nature  of  the  prior  models)  without 
additional  constraints  on  the  problem  (or  the  provision  of 
auxiliary  information,  such  as  manual  parameter  adjust¬ 
ment  to  match  the  given  imagery).  An  obvious  advance 
would  be  a  performance  improvement  of,  say  5  to  10% 
over  that  of  the  previous  best  known  technique.  When 
performance  of  a  computer  vision  technique  reaches  (say) 
90-95%  of  human  performance,  the  corresponding  prob¬ 
lem  is  considered  to  have  a  reasonable  scientific  solution 
and  further  advance  is  now  also  considered  in  terms  of 
engineering  criteria  (cost,  speed,  complexity,  etc.). 

4.3  Competition  Procedure 

It  is  intended  that  there  would  be  a  competition  once 
a  year  to  choose  the  best  performing  program  in  some 
(or  all)  problem  categories.  The  programs  would  have  to 
run  on  specified  machine  configurations,  must  take  the 
input  images  in  a  specified  format,  and  must  produce 
answers  in  a  specified  time  interval.  To  insure  an  initial 
reasonable  baseline  of  performance  for  the  most  difficult 
problems,  an  operator  would  be  allowed  to  place  a  speci¬ 
fied  number  of  labeled  markers  in  an  overlay  of  the  given 
test  image  and/or  be  given  (in  advance)  a  small  window 
from  the  test  image  to  allow  system  calibration  and  pa¬ 
rameter  adjustment.  Typical  images  from  each  category 
would  be  provided  in  advance,  and  would  not  change  in 
nature  or  difficulty  from  year  to  year. 

Entry  in  the  competition  implies  the  entrant  is  willing 
to  make  public  the  theory  (and  possibly  pseudo-source 
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code)  for  his  algorithms  and  allow  the  use  of  his  object 
code  (at  least)  for  scientific  purposes. 

4.4  Specific  Proposal 

1.  A  list  of  5-10  problem  domains  will  be  selected  for 
the  benchmark. 

2.  A  panel  of  experts  would  be  chosen  to  define  the 
problems  and  select  the  sample  and  test  imagery 
and  the  contextual  information  to  be  made  avail¬ 
able.  Sample  imagery  would  be  available  prior  to 
the  competition.  The  actual  test  imagery  would  be 
provided  to  all  interested  parties  after  the  competi¬ 
tion. 

3.  The  nominal  approach  would  be  for  the  panel  to  se¬ 
lect  a  few  locations  in  each  image  that  contain  obvi¬ 
ous  instances  (item  or  condition  absent  or  present) 
of  the  challenge  problem  subject  matter;  this  infor¬ 
mation  would  not  be  revealed  (for  the  test  imagery) 
until  after  the  competition;  scoring  at  each  location 
is  binary,  correct  or  incorrect. 

4.  The  test  could  be  held  yearly  (e.g.,  at  the  lU  meet¬ 
ing  or  at  some  selected  contractor  site)  on  machines 
provided  or  approved  by  DARPA.  Programs  must 
produce  answers  in  a  specified  time  interval.  It  is 
intended  that  the  programs  will  be  run  without  in¬ 
tervention  by  the  contestants,  but  some  provision 
might  be  made  to  allow  a  contestant  to  tune  his 
program  at  a  specified  penalty  to  his  test  score. 

5.  Theory  (and  possibly  pseudo  source  code)  must  be 
provided  in  report  form,  and  the  compiled  code  ac¬ 
tually  used  in  the  competition  made  available  (free, 
but  possibly  under  license)  for  scientific  use. 

5  Conclusion 

We  have  taken  the  initial  steps  in  developing  a  new  set 
of  machine  vision  benchmarks  in  the  areas  of  manmade 
object  scenes  and  natural  scenes.  The  next  steps  involve 
more  careful  delineation  of  the  experimental  protocol, 
selection  of  the  specific  problems,  and  the  gathering  of 
imagery  and  other  test  data.  We  welcome  comments  on 
this  new  DARPA  benchmarking  effort. 
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APPENDIX 

This  appendix  offers  a  set  of  abstracted  versions  of  pos¬ 
sible  benchmark  entries  for  discussion  in  the  light  of  the 
goals  and  issues.  The  task  of  specifying  the  actual  prob¬ 
lems  still  remains. 

A.l  Man-Made  Object  Scenes 

The  key  issues  in  setting  up  the  problems  for  man-made 
object  scenes  include: 

•  amount  of  clutter  in  a  scene 

•  amount  of  occlusion  of  objects 

•  class  of  shapes  of  objects  (e.g.,  polyhedral  vs. 
curved,  planar  vs.  3D,  fixed  vs.  articulated  or  de¬ 
formable) 

•  class  of  surfaces  (e.g.,  textured,  specular,  diffuse) 

•  lighting  conditions 

•  kind  of  imagery  (2D  vs.  3D,  grey-scale  vs.  color) 

•  class  of  transformations  applied  to  object  model 

Even  more  important  is  an  evaluation  method.  We  can 
use  the  ROC  (Receiver/Operator  Curve)  which  plots  the 
false  negative  rate  vs  false  positive  rate  as  the  overall 
indication  of  the  performance  of  a  system.  We  should 
evaluate  the  accuracy  of  computed  pose  as  well  as  the 
number  of  free  parameters  in  the  system.  We  should  2ilso 
define  a  series  of  increasingly  harder  problems,  such  as 
presented  below. 

PROBLEM  1:  Flat  parts  (with  little  or  no  texture 
on  the  parts  or  background)  and  known  camera  orienta¬ 
tion.  But  include  significant  clutter  (e.g.  as  little  as  10% 
of  the  features  in  the  image  are  associated  with  the  ob¬ 
ject)  and  significant  occlusion  (perhaps  as  little  as  25% 
of  the  object  is  visible)  as  well  as  noise.  The  goal  is  to 
identify  and  locate  as  many  instances  as  possible  from  a 
small  library  of  known  models.  This  problem  is  probably 
fairly  well  solved  by  several  existing  algorithms.  Open 
issues  include  how  to  provide/obtain  the  models  and  a 
range  of  shapes  for  each  of  the  models.  We  can  allow 
the  objects  to  scale,  rotate,  and  translate  in  the  image 
plane. 

Example  objects  include  2D  parts  (eg.  teletype  parts) 
and  took  (wrenches,  screw  drivers,  etc). 

PROBLEM  2:  Solid  rigid  objects  with  no  articula¬ 
tion,  but  with  significant  clutter  and  occlusion.  Texture 
is  allowed  on  the  objects  and  background.  One  version 
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of  the  problem  would  be  3D  shape  recovery  from  a  single 
2D  image;  a  second  version  could  be  3D  shape  recovery 
from  a  3D  image  (i.e.  from  range  imagery). 

The  objects  should  include  a  range  of  shapes  and  even 
several  shapes  that  differ  only  in  a  few  places,  so  that 
saliency  can  be  an  issue.  Examples  include  models  of 
vehicles  or  planes,  simple  office  scenes,  etc. 

PROBLEM  3: 

Generic  objects.  This  could  include  parameterized  ob¬ 
ject  classes  and  articulated  objects.  The  idea  is  to  allow 
for  objects  that  are  nonrigid,  while  still  structured.  A 
tank  is  an  obvious  example,  given  the  movable  turret. 
Classes  of  vehicles  in  general  provide  the  next  level  of 
complexity  (e.g.  categorizing  vehicles  as  a  sedan,  a  sta* 
tion  wagon,  a  van,  etc.,  as  well  as  localizing  it). 

PROBLEM  4:  Recognition  by  function.  A  classic 
example  is  a  “chair”,  where  the  recognition  system  has 
to  identify  objects  not  only  by  shape,  but  also  by  whether 
they  meet  certain  functional  constraints  (such  as  stable 
support,  a  flat  surface  on  which  to  sit,  etc.) 

A. 2  Static  Natural  Scenes 

1.  Generic  (natural)  Object  Recognition  or  Classifica¬ 
tion:  Recognize  (point  to  or  delineate)  rocks  and 
trees  in  single  images  of  outdoor  scenes.  Distinguish 
between  rocks  and  sand  in  a  desert  scene. 

2.  Specific  Object  Recognition:  Recognize  (point  to, 
or  delineate)  the  presence  of  a  specific  known  object 
in  an  image  (e.g.  a  particular  person).  The  object 
(preferably  non-rigid)  can  be  partially  occluded  or 
seen  from  any  aspect. 

3.  Feature  Extraction  and  Delineation: 

-Extract  a  road  network  from  an  aerial  image  that 
can  be  either  vertical  or  oblique  and  low  or  high 
resolution.  For  example,  the  image  could  even  be  a 
view  of  a  partially  occluded  freeway  from  a  window 
in  a  nearby  building. 

-Given  the  skeleton  of  a  tree  trunk  or  tree  limb  (in  a 
forest  scene),  accurately  delineate  the  corresponding 
edges  and  measure  the  width  of  the  trunk/Umb. 

4.  Geometric  Recovery  FVom  a  Single  Image  of  a  Nat¬ 
ural  Scene: 

-Given  a  line  overlaying  an  image,  determine  rela¬ 
tive  depth  (from  the  camera)  along  the  line. 

-  Determine  (scene)  surface  orientation  at  a  pven 
set  of  points  (locations)  in  an  image.  The  scale  of 
interest  will  be  provided. 

5.  Geometric  Recovery  FYom  Multiple  Views  Of  Some 
Specified  Object: 

-Model  (i.e.,  recover  the  3D  geometry)  a  building  in 
an  urban  scene  from  multiple  views. 


-Render  the  profile  of  the  skyline  seen  from  a  specific 
ground  location,  given  an  overhead  stereo  pair  of  a 
mountain  or  valley. 

-Given  a  dense  sequence  of  images  containing  an 
object  of  interest  (e.g.,  a  tank  or  a  rock)  taken  from 
a  camera  mounted  on  a  moving  truck,  recover  the 
geometry  of  the  object. 

6.  T^ack  a  Specific  Moving  Object:  For  example,  a 
specific  fish  in  a  fish-bowl  containing  many  fish  and 
other  objects  that  can  cause  occlusion  or  temporary 
disappearance  of  the  fish  being  tracked.  Other  pos¬ 
sibilities  are  a  person  in  a  crowded  store,  or  a  specific 
tank  in  a  formation  moving  through  a  wooded  area. 

7.  Generic  Problems  (edge  and  surface  classification) 
in  Image  Analysis: 

-Classify  specified  locations  in  an  image  as  either 
edge  or  not-edge;  if  edge,  then  further  classify  the 
nature  of  the  edge  as  either  occlusion,  orientation, 
illumination  (e.g.  shadow),  or  reflectance  edge. 

-Classify  the  material  type  of  selected  surface 
patches  in  an  image  as  either  wood,  metal,  glass, 
water,  vegetation,  sand/rock,  brick,  soil,  sky,  cloud, 
asphalt,  or  concrete. 

-Classify,  into  say  three  spectral  bands,  the  color 
spectrum  of  selected  surface  patches  in  a  natural 
outdoor  image. 

A.3  Benchmark  2000 

Some  day  we  would  like  vision  to  be  integrated  with 
knowledge  and  reasoning.  A  challenge  problem  to  eval¬ 
uate  this  capability  would  require  a  system  to  identify 
objects,  use  knowlege  about  these  objects  and  the  real 
world,  and  apply  reasoning  to  solve  a  visual  problem.  We 
suggest  a  challenge  problem  along  these  lines  for  the  year 
2000  using  a  class  of  children’s  picture  puzzles  in  which 
the  child  is  asked  to  find  “mistakes”  in  the  picture.  The 
problem  posed  to  a  vision  system  would  be:  Given  a  line 
sketch  taken  from  a  children’s  book  of  “what  is  wrong” 
puzzles,  see  Fig.  1,  devise  an  automatic  vision  program 
that  can  perform  as  well  as  a  five  year  old  child.  The 
absolute  score  for  any  algorithm  is  the  percent  of  defec¬ 
tive  objects  found  in  a  scene;  the  relative  score  can  be 
obtained  by  comparison  with  a  child’s  performance. 
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ABSTRACT 

Image  analysis  is  a  labor-intensive 
activity  that  grows  increasingly  intensive  due  to 
the  volume  of  imagery  and  collateral  being 
collected.  Image  analysts  (lAs)  and  photo 
interpreters  need  to  extract  accurate  yet  timely 
information  from  the  data.  Strategically 
automating  portions  of  the  processing  will  help 
analysts  achieve  both  objectives,  first  by 
eliminating  some  of  the  tedium  of  the  activity  then 
by  accelerating  the  process.  Model-supported 
exploitation  (MSE)  was  identified  as  the 
technology  that  will  provide  this  automation. 
This  paper  discusses  in  detail  the  various  MSE 
design  constraints,  first  as  they  pertain  to  the 
RADIUS  problem  donuiin,  and  then  in  context  of 
their  impact  on  the  design  of  an  MSE  workstation. 

1.  INTRODUCTION 

Image  processing  (IP)  and  image 
understanding  (lU)  have  been  the  subjects  of 
research  for  many  years,  both  in  the  academic  and 
the  industrial  worlds.  In  1990,  a  consortium  of  MSE 
experts  and  researchers  surmised  that  lU 
technology  was  sufficiently  mature  for 
implementation  in  an  operational  capacity. 
Specifically,  lU  was  thought  to  be  particularly 
well-suited  for  application  to  the  problem  of  image 
analysis.  The  lA  community  was  targeted  as  the 
end  user  of  this  technology,  and  the  MSE  concept 
was  designed  to  address  its  needs. 

The  goal  of  MSE  is  to  support  lAs  in  their 
work  by  using  three-dimensional  site  models  to 
produce  visual  aids  and  provide  geographically 
referenced  collateral  information.  The  design  of  an 
MSE  workstation  must  take  into  account  the 
following  three  constraints: 

[1]  Operational  needs 

[2]  State  of  technology 

[3]  Cost  and  development  schedule 

Fully-automated  prtKessing  has  often  been 


touted  as  the  theoretical  goal,  but  the  desirability 
of  this  is  questionable  since  the  ultimate  use  of  die 
end  product  is  of  such  import  that  skilled  LAs  will 
always  be  needed  to  verify  the  processing  results. 
Furthermore,  technology  that  is  both  robust  and 
reliable  enough  to  achieve  this  objective  is 
currently  not  realistic.  Practical  development  of  an 
MSE  workstation  also  requires  adherence  to 
budgetary  and  scheduling  constraints. 

1.1.  Image  Analysis 

Four  basic  activities  were  identified  as 
tasks  for  which  lAs  are  responsible: 

[1]  Change  detection  -  the  location  and 
identification  of  signiflcant  changes  in  an 
image,  both  physical  (e.g.,  construction) 
and  logistical  (e.g.,  an  increased  number 
of  vehicles  in  a  delineated  area) 

[2]  Negation  -  the  determination  of  when  a 
change  first  appeared  at  a  site 

[3]  Detection  and  counting  -  the  counting  of 
all  instances  of  specified  objects 
regardless  of  quantity,  location,  or 
orientation  in  an  image 

[4]  Trends  and  history  •  or  trend  anal)^is, 
the  chronological  documentation  of 
events  at  a  site 

Based  on  an  analysis  of  current  lA 
activities,  it  was  hypothesized  that  all  image 
analysis  activities  consisted  of  some  combination  of 
these  four  tasks.  In  addition,  the  consortium 
identified  four  tools  for  supporting  lAs  in  their 
work: 

[1]  Registration  -  the  establishment  of  a 
mathematical  point-to-point  mapping 
between  two  data  types  (e.g.,  images, 
site  models,  maps)  to  correlate  key 
objects,  regardless  of  collection  source 

[2]  Perspective,  geometric  modeling,  and 
orientation  (PG&O)  -  the  two- 
dimensional  or  three-dimensional 
rendering  of  data  types  (e.g.,  images. 
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site  models,  maps)  to  prixiuce  visual  aids 

[3]  Recognition  guides  -  digitized  image 
chips,  drawings,  and  text  that  assist  lAs 
in  identifying  objects 

[4]  Interpretation  aids  -  site  models  and 
recognition  guides  for  non-literal  image 
analysis 

Due  to  the  present  manual  nature  of  image 
analysis,  these  tools  do  not  currently  exist  in  the 
operational  environment.  Consequently,  their 
impact  on  lA  productivity  remains  to  be  validated. 
Finally,  two  databases  were  identified  as  being 
part  and  parcel  of  the  MSB  concept: 

[1]  Site  baselines  -  the  primary  source  of  site 
specific  information  (e.g.,  equipment 
normalcy  ranges,  functional  area  layouts, 
historical  information) 

[2]  Site  folders  -  contain  other  collateral 
data  specific  to  a  site  (e.g.,  reference 
imagery,  maps,  textual  reports) 

Site  baselines  and  folders  currently  exist  in 
hardcopy  form.  These  ten  items  are  collectively 
known  as  the  RADIUS  application  concepts.  The 
design  for  an  MSB  workstation  must  provide 
interactive  tools  for  addressing  each  of  these 
application  concepts,  either  in  part  or  in  whole. 

1.2.  Definition  of  MSB 

The  MSB  concept  is  grounded  in  the 
existence  and  exploitation  of  site  models  for  image 
analysis  and  consists  of  two  components; 

[1]  The  development  and  maintenance  of  site 
models  using  lU  techniques 

[2]  The  use  of  site  models  for  image 
exploitation  tasks,  either  directly  by  lAs 
or  via  a  set  of  model-based  lU 
exploitation  tools 

The  lA  community  has  expressed  great 
enthusiasm  for  the  exploitation  component  in 
general  and  for  the  exploitation  tools  in  particular. 
lU  researchers  recognize,  however,  that  an 
efficient  site  modeling  capability  must  exist  for 
automation  of  exploitation  to  be  possible. 

13.  User  Concerns 

Productivity,  accuracy,  and  timeliness  in 


image  analysis  are  issues  of  fundamental 
importance  to  the  lA  community.  Two  government- 
supported  projects  are  specifically  geared  toward 
addressing  these  issues.  They  are  the  Research  and 
Development  for  Image  Understanding  Systems 
(RADIUS)  project  and  Workstation  2000. 

1.3.1.  RADIUS 

RADIUS  is  planned  to  progress  in  two 
phases.  The  goal  of  Phase  1  was  to  characterize 
the  state  of  technology  and  to  define  the  MSB 
operations  concept,  resulting  in  the  design  of  a 
testbed  system.  Phase  2  is  geared  towards  testbed 
development  and  evaluation  in  an  operational 
environment.  In  total,  RADIUS  is  scheduled  to  rim 
five  years,  with  Phase  1  being  a  two-year  exercise 
and  Phase  2  being  a  three-year  endeavor. 

The  RADIUS  Phase  1  contract  was 
awarded  to  Hughes  Aircraft  Company  in  mid-1991, 
with  BDM  Federal,  Computing  Devices 
International  (CDI),  the  Hughes  Research 
Laboratories  (HRL),  and  the  University  of 
Southern  California  (USC)  forming  its 
subcontracting  team.  Hughes,  HRL  and  USC 
formed  the  lU  technology  assessment  subteam, 
while  BDM  and  CDI  worked  on  the  MSB  concept 
definition.  USC  offered  consultation  and  insist 
into  continued  MSB  research.  The  efforts  of  the 
Hughes  team  resulted  in  the  conclusions 
summarized  in  this  paper.  The  contract  is 
scheduled  to  end  in  June  1992,  with  the  Request  for 
Proposal  for  Phase  2  being  issued  in  late  July. 

1.3.2.  Workstation  2000 

Workstation  2000  is  an  umbrella  concept 
that  covers  the  gamut  of  lA  concerns  regarding 
implemention  of  MSB  in  an  operational 
environment.  One  concern  is  that  any  MSB 
workstation  designed  to  function  in  an  operational 
capacity  must  exploit  existing  collateral,  that  is, 
the  workstation  must  be  integrated  with  the 
databases  currently  available  to  the  lA  community. 
Another  concern  is  that  the  workstation  tools  must 
facilitate  image  analysis  as  defined  by  lAs,  that 
the  tiH)ls  be  perceived  as  unobtrusive  and  non- 
invasive.  A  third  concern  is  the  need  for  speed,  the 
desire  for  timeliness  as  well  as  accuracy.  These 
concerns  were  primary  factors  in  the  MSB 
workstation  design. 
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2.  OPERATIONAL  NEEDS 

The  two  operational  components  of  an  MSE 
system  are  the  modeling  subsystem  and  the 
exploitation  subsystem.  The  modeling  subsystem 
provides  a  means  for  generating  the  three- 
dimensional  site  models  and  for  linking  collateral 
data  to  these  models.  The  exploitation  subsystem 
provides  the  visual  aids  that  support  lAs  in  their 
activities.  A  series  of  experiments  were  conducted 
to  determine  what  lAs  wanted  from  site  modeling 
and  exploitation  tools.  The  requirements  the 
analysts  placed  on  the  subsystems  were  used  to 
derive  the  MSE  operational  needs. 

2.1.  The  Modeling  Subsystem 

Informal  site  models  are  currently  being 
used  by  lAs.  These  include  physical  entities  (e.g., 
sketches,  NEL-produced  diagrams)  as  well  as  the 
knowledge  an  analyst  internalizes  as  he 
familiarizes  himself  with  a  site.  For  a  site  model 
to  be  useful  to  an  MSE  workstation,  however,  all 
pertinent  data  must  be  stored  explicitly. 
Furthermore,  related  information  (i.e.,  collateral 
data)  must  be  easily  accessible. 

A  site  model  is  a  three-dimensional, 
georefetenced  wireframe  representation  of  the 
objects  at  a  site.  Included  are  both  physical  (e.g., 
buildings)  and  non-physical  (e.g.,  functional  areas) 
objects,  their  attributes  (e.g.,  labels),  and  links  to 
the  collateral  data  that  is  associated  with  the 
particular  site.  While  it  is  obvious  that  site 
models  must  include  objects  which  are  of  analyst 
interest,  it  is  equally  important  that  they  include 
objects  which  are  of  processing  utility.  By 
definition,  then,  site  models  contain  more 
components  than  an  analyst  would  want  to  see  or  an 
algorithm  could  be  designed  to  priKess. 

The  appropriate  level  of  model  detail  is  an 
issue  over  which  lAs  differ  and  for  which  the 
capabilities  of  lU  vary.  For  example,  an  analyst 
who  is  primarily  interested  in  detecting  gross 
changes  (e.g.,  demolition  of  a  building)  may  be 
satisfied  with  a  simple  block  diagram,  while  one 
who  is  interested  in  the  function  of  a  building  may 
require  that  substructures  (e.g.,  doors,  windows)  be 
included.  The  modeling  subsystem  must  therefore 
be  capable  of  generating  models  at  a  variety  of 
detail  levels  to  support  the  variety  of  I A  needs. 


Another  issue  of  concern  is  the  question  of 
which  objects  are  necessary  and  sufficient  for  a  site 
model.  Just  as  the  appropriate  level  of  model 
detail  was  a  function  of  the  exploitation  goals  of 
the  lA,  the  nature  of  the  objects  in  which  the  lA 
has  a  particular  interest  are  likewise  defined.  To 
support  the  variety  of  disparate  interests  wiUiout 
inundating  any  particular  lA  with  superfluous 
information,  the  concept  of  layers  was  introduced. 
Site  model  layers  are  subsets  of  the  site  model  and 
reflect  either  community-wide  standards  (e.g.,  the 
baseline  layer)  or  individual  interests  (i.e.,  user 
layers).  Access  to  the  various  layers  may  be  global 
or  privileged. 

From  the  user  perspective,  one  of  the  most 
important  conditions  for  site  models  to  be 
considered  useful  is  that  they  are  kept  reasonably 
up-to-date.  Thus,  in  addition  to  the  tools  for 
creating  the  initial  model,  the  modeling  subsystem 
must  also  provide  a  mechanism  for  site  model 
update:  for  creating  new  objects,  modifying  existing 
objects  and  their  collateral  links,  or  deleting 
destroyed  objects.  The  MSE  system  must  be  capable 
of  maintaining  the  change  history  of  a  site  to 
enable  lAs  to  trace  the  evolution  of  the  site. 

2.2.  Exploitation 

Those  lAs  interviewed  during  the  RADIUS 
Phase  1  experiments  summarized  their  exploitation 
responsibilities  in  the  following  sequence  of 
activities: 

[1]  Prioritize  the  imagery  to  be  exploited 

[2]  Locate  the  site  on  the  image 

[3]  Take  a  quick  look  around  the  site 

[4]  Detect  and  count  things  of  interest 

[5]  U)ok  for  site  activity 

[6]  Look  for  site  changes 

[7]  Discuss  findings  with  other  lAs 

[8]  Take  a  final  look  around  the  site 

[9]  Report  findings  and  conclusions 

[10]  Retask  the  site  if  necessary 

While  these  activities  seem  ill-correlated 
with  the  four  basic  lA  tasks  described  in  Section 
1.1,  upon  further  consideration,  it  will  be  seen  that 
the  fundamental  analysis  process,  that  of  assessing 
the  existence,  magnitude,  type,  and  timing  of 
change,  is  reflected  in  both  sets  of  task 
descriptions.  What  is  important  to  note  is  the  lA's 
need  to  validate  and  verify  change  personally.  In 
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essence,  then,  the  exploitation  subsystem  must 
provide  the  lA  with  interactive  tools  for  making 
these  assessments.  The  subsystem  must  also  offer 
mechanisms  for  generating  reports  of  lA  findings. 

3.  TECHNOLOGY  ASSESSMENT 

The  underlying  assumption  is  that  lU 
technologies  will  be  available  in  the  RADIUS 
Phase  2  timeframe  to  fulfill  the  MSE  operational 
needs  as  outlined  in  the  previous  section.  lU 
systems  were  assessed  during  Phase  1  to  project 
realistic  automation  levels  for  the  various  MSE 
tools  in  the  Phase  2  testbed  system.  The  assessment 
focused  on  technologies  that  addressed  the 
following  application  concepts:  site  model 
construction  and  up>date,  registration,  PG&O,  and 
site  baselines  and  folders. 

3.1.  Technology  Goals 

Any  MSE  technology  targeted  for  inclusicm 
in  the  Phase  2  testbed  system  must  have  the  basic 
objective  of  lightening  the  lA's  workload.  Because 
lAs  have  exploited  imagery  without  lU  tools  for 
years,  successful  introduction  of  MSE  to  the  lA 
community  depends  largely  .<n  the  cultivated 
perception  of  MSE  as  an  aid  rather  than  a 
hindrance.  With  this  caveat,  automation  levels 
(i.e.,  manual,  semi-automated,  or  automated)  and 
the  relevant  user  interface  issues  were  found  to  be 
useful  performance  measures. 

The  primary  isi’ue  with  manual  tools  is  the 
user  interface.  The  tools  must  provide  sufficient 
visual  feedback  to  the  lA  so  as  to  achieve  the 
required  accuracy.  Such  tools  must  permit  rapid, 
intuitive  operation  without  overwhelming  the  user 
with  options.  For  semi-automated  tools,  the  issue 
expands  to  include  real-time  response.  The 
objective  in  this  case  is  to  provide  accurate  results 
in  less  time  than  would  be  needed  by  an  lA  using 
manual  tools.  For  automated  algorithms,  run-time, 
accuracy  and  false  alarms  become  key  concerns. 

Automated  exploitation  algorithms  face 
far  more  stringent  requirements  than  those  for  site 
model  construction,  since  exploitation  is  often  a 
time-critical  activity  that  demands  quick 
turnaround.  Unfortunately,  pot)r  accuracy  and  high 
false  alarm  rates  are  often  obtained  at  the  price  of 
speed.  These  errors  undermine  the  value  of 
automation  by  forcing  the  lA  to  compensate  for  the 


inaccuracies.  Since  site  model  construction  is  less 
likely  to  be  bound  by  such  tight  timing  constraints, 
modeling  algorithms  can  be  designed  to  emphasize 
accuracy  over  speed. 

3.2.  Characterization  Methodology 

An  lU  system  qualified  for  Phase  1 
characterization  if  it  satisfied  one  of  two 
conditions: 

[1]  It  was  a  commercial  product 

[2]  It  was  the  subject  of  active  research 

The  literature  is  replete  with  descriptions 
of  systems  that  address  elements  of  the  RADIUS 
problem  domain.  Rarely,  howevt  have  the 
algorithms  been  subjected  to  extensive  testing. 
Success  of  the  Phase  2  testbed  system  therefore 
depends  on  the  availability  of  the  original 
developers  to  support  integration,  bug  fixes,  and 
system  enhancements.  While  other  highly 
relevant  systems  have  been  presented  in  the 
'•terature,  it  is  doubtful  whether  they  would  be 
ready  for  presentation  to  an  lA  within  the  RADIUS 
Phase  2  timeframe. 

The  last  RADIUS  report*  outlined  plans  to 
provide  test  sets  of  imagery  to  system  developers, 
along  with  ground  truth  and  code  for  initial  on-site 
ch.~"acterization.  Generating  representative  sets  of 
unclassified  image  data  and  having  developers 
characterize  their  work  without  compensation 
have  been  inordinately  time-consuming. 
Consequently,  this  paper  reflects  a  mixture  of 
formal  characterization  results  mixed  with  an 
internal  assessment  of  these  systems  and  the 
techniques  they  employ. 

A  decomposition  of  the  MSE  applications 
was  presented  in  the  RADIUS  Technology 
Development  Plan^.  That  decomposition  was  used 
as  the  framework  for  characterizing  the 
functionality  and  the  availability  of  the  relevant 
lU  systems.  Near-,  mid-  and  long-term 
availability  correspond  to  the  three  years  of 
RADIUS  Phase  2:  1994,  1995,  and  1996.  The 
following  subsections  offer  a  conservative  estimate 
of  the  technology  available  to  support  MSE.  It  is 
not  a  recommended  architecture  for  Phase  2,  but 
rather  a  response  to  the  assumptions  regarding  lU 
maturity  and  the  degree  to  which  lU  can  satisfy 
MSE  operational  needs. 
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33.  Site  Model  Construction 

Figure  1  shows  the  decomposition  of  site 
model  construction  and  the  technologies  as  they  are 
projected  to  be  available  during  Phase  2.  For 
example,  during  the  first  year.  Imagery  Selection 
is  supported  by  cloud  detection,  which  allows  the 
system  to  select  only  those  images  offering  a  clear 
view  of  the  site.  Image-to-lmage  Registration  can 
be  implemented  via  SMED  (Hughes)  or  MAIR 
(Harris  Corporation).  SMED  takes  advantage  of 
all  available  camera  parameters  and  other  system 
specific  data,  relieving  the  lA  from  having  to 
select  more  than  two  pairs  of  tie  points.  As  tested 
against  the  RADIUS  characterization  test  set, 
MAIR  automatically  registered  image  pairs  with 
a  mean  Y-disparity  of  less  than  2.5  pixels.  Terrain 
Extraction  is  automatically  performed  by  GLSM 
(GDE  Systems  Incorporated  -  GDE),  which  is  being 
incorporated  by  the  Defense  Mapping  Agency 
(DMA)  into  their  mapping  workstations  this  year. 
All  these  systems  were  designed  to  work  on 
classified  imagery  and  their  associated  collateral. 


Several  options  exist  for  Object  Modeling  in 
five  near-term.  For  manual  modeling,  three  systems 
offer  broad  support  for  the  various  MSE 
applications:  the  RADIUS  Conunon  E>eveIopment 
Environment  (SRI/GE),  GLMX  (CDI),  and  SOCET 
SET  (GDE).  RCDE  is  based  on  SRI's  Cartographic 
Modeling  Environment,  a  manual  site  modeling  and 
IP  package.  GLMX  was  the  basis  of  the  RADIUS 
concept  validation  experiments  and  has  been  used 
in  a  classified  operational  environment  to  rapidly 
generate  site  models  for  bomb  damage  assessment. 
SOCET  SET  is  a  commercially  available  manual 
site  modeling  package  that  draws  on  GDE's 
experience  with  DMA  mapping  workstations. 

These  same  three  sources  are  addressing 
near-term  semi-automated  Object  Modeling.  SRI, 
as  a  RADIUS-related  Broad  Area  Armouncement 
(BAA)  awardee,  is  extending  its  model-based 
optimization  technology*  to  extract  roads, 
railroads  and  buildings.  CDI  has  developed  a 
meai^  of  using  stereo  imagery  to  accurately  refine 
coarsely-specified  three-dimensional  wireframe 


LONG-TERM 


Figure  1.  Site  Model  Construction.  Each  algorithm  is  coded  according  to  its  automation  level.  Similarly  coded 
are  the  boxes  extending  below  each  task,  reflecting  the  automation  level  of  that  task  for  each  year  of  Phase  2. 
Tasks  for  which  no  technology  is  assigned  are  handled  manually  via  database  and  other  non-IU  technologies. 
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primitives.  GDE  has  enhanced  SOCET  SET  with 
algorithms  that  track  roads  and  railroads  and 
locate  rooftops  in  stereo  imagery  given  an  initial 
user  estimate.  The  constraint-based  semi- 
automated  algorithm  developed  at  GE/CRD'  is 
estimated  to  be  a  mid-term  capability.  The 
specification  of  model  primitives  for  use  by  the 
GE/CRD  system  may  require  more  mathematical 
expertise  than  possessed  by  a  typical  lA,  thus 
requiring  a  library  of  ready-made  primitives  or  a 
simplified  user  interface. 

Near-term  automated  Object  Modeling  is 
likely  to  be  available  as  systems  from  USC  and 
Camegie-Mellon  University  (CMU).  USC  has  two 
approaches  to  automated  structure  extraction,  one 
ba^  on  stereo*  and  the  other  on  shadow  evidence^. 
The  CMU  offering  is  based  on  the  BABE"  system. 
Each  of  the  three  systems  is  being  enhanced  to 
handle  oblique  imagery.  These  fully-automated 
systems  have  been  rated  as  near-term  technologies 
with  the  understanding  that  they  operate  with  an 
acceptably  low  false  alarm  rate.  The  long-term 
automated  building  extraction  algorithm  being 
developed  by  the  University  of  Maryland  (UMd) 
under  a  BAA  extracts  evidence  from  stereo  imagery 
and  uses  a  truth  maintenance  system  to  search 
among  the  building  hypotheses  generated'^ 

The  manual  site  modeling  systems 
mentioned  above  are  applicable  to  the  Edit  Site 
Model  task,  in  which  the  user  may  refines,  adds  or 
deletes  elements  of  the  model.  A  model  extension 
capability  is  being  developed  at  the  University  of 
Massachusetts,  Amherst  (UMass)  under  a  BAA. 
Positional  accuracy  of  site  model  elements  will  be 
retined  automatically  via  induced  stereo  generated 
over  multiple  overlapping  views  of  the  site.  This 
algorithm  is  considered  a  long-term  capability. 

A  body  of  two-dimensional  campus-style 
maps,  to  which  collateral  information  is  attached, 
is  currently  available  to  lAs.  Buildings,  parking 
areas,  points  of  entry  and  other  intelligence- 
related  site  features  are  typical  components  of 
these  maps.  More  importantly,  the  maps  include 
labels  and  identifiers  keyed  to  the  various  map 
elements.  Registration  of  this  map  to  imagery  of 
the  site  facilitates  automated  entry  of  these  labels 
and  identifiers  into  the  generated  site  mixlel.  After 
projecting  the  campus-style  map  into  the  geometry 
of  an  image  of  the  site,  SMED  offers  a  near-term 
manual  capability  for  aligning  the  map  with  the 


imaged  site.  An  automated  map-to-image 
registration  is  considered  a  mid-term  capability. 

3.4.  Site  Model  Update 

Many  of  the  tasks  and  technologies  for  Site 
Model  Update  are  shared  with  Site  Model 
Construction.  Consequently,  Figure  1  also  applies 
to  site  model  update.  The  primary  difference  is  the 
availability  of  the  site  model  as  a  starting  point  in 
Site  Model  Update.  New  images  are  registered 
initially  to  the  projected  site  model,  instead  of  to 
other  images.  SMED  provides  a  "drag-and-drop" 
near-term  manual  registration  capability.  As  is 
the  case  with  map-to-image  registration,  an 
automated  model-to-image  registration  capability 
can  be  developed  for  mid-term  testbed  iiKlusion. 

3.5.  Site  Baselines  and  Site  Folders 

Site  baselines  are  textual  descriptions  tiiat 
are  updated  annually  or  as  frequently  as  needed. 
They  document  normal  activity  and  highlight 
specific  areas  of  interest.  Site  folders  contain 
reference  data  (e.g.,  maps,  charts,  select  images) 
that  aid  lAs  in  exploiting  imagery  of  the  site.  lU 
support  for  baselines  and  folders  comes  in  the  form 
of  model-to-reference  data  registration  (Figure  2). 

3.6.  Exploitation  Tasks 

RADIUS  Phase  1  did  not  include  the  four 
basic  image  analysis  tasks  (Section  1.1).  While 
Phase  2  will  explore  these  application  concepts 
more  extensively,  change  detection  has  been 
identified  as  a  BAA  research  area.  Figure  3  shows 
the  preliminary  requirements  common  to  all 
exploitation:  Determine  Image  Utility  and 
Register  Image  to  Model.  The  combination  of  cloud 
detection  and  mixlel-to-image  registration  helps 
automate  the  image  prioritization  process,  therd^y 
streamlining  the  exploitation  process.  UMd  is 
developing  a  semi -automated  change  detection 
capability,  scheduled  for  completion  by  the  third 
year  of  Phase  2. 

4.  TECHNOLOGY  STUDIES 

As  shown  in  Section  3,  registration  is 
fundamental  to  the  success  of  RADIUS.  Several 
data  types  must  be  registered  to  one  aiwther  (i.e., 
imagery,  maps,  models).  While  pieces  of  these 
registration  tasks  have  been  addressed,  most 
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notably  image-to-image,  there  are  aspects  of  the  knowledge  of  the  camera  parameters, 

problem  specific  to  the  proposed  operational 

environment  and  data.  To  validate  the  concepts  Since  the  site  model  is  very  large,  the  first 

that  would  allow  timely  and  accurate  automated  step  in  registration  is  to  decompose  the  site  model 

registration,  the  Hughes  team  undertook  two  into  distinctive  objects.  By  using  only  the  most 

studies.  The  first  relates  to  image-to-model  distinctive  objects  in  the  model-to-image  matching 

registration,  specifically  the  registration  of  large  process,  computational  expense  is  reduced 

site  models  to  large  images.  The  second  study  significantly.  After  the  objects  are  matched  to 

investigated  the  utility  of  the  campus-style  features  extracted  from  the  image,  the  camera 

collateral  maps  mentioned  in  Section  33.  parameters  are  then  updated. 

4.1.  Registration  of  Large  Models  to  Large  Images  The  registration  system  was  developed 

using  unclassified  data  of  Fort  Hood.  CDI  provided 
The  problem  of  matching  three-  camera  models  and  a  site  model  containing  148 

dimensional  models  to  images  is  similar  to  object  buildings.  Two  images,  one  nadir  (FHl)  and  one 

recognition  where  an  object  in  an  unknown  position  oblique  (FH2),  were  used  in  die  experiments.  FHl 

and  orientation  is  recognized.  There  are,  however,  was  used  to  build  the  site  model;  FI^  was  not.  The 

significant  differences.  The  site  model  can  consist  image  data  was  scaled  from  16  to  8  bits  per  pixel, 

of  several  hundred  objects,  some  of  which  may  lie  and  the  images  were  subsampled  by  a  factor  of 

off  the  image.  The  model  is  constrained  to  lie  on  a  four.Since  the  camera  models  were  fairly  accurate 

known  ground  plane,  leaving  only  one  unknown  in  overlaying  the  site  model  onto  the  image,  two 

orientation  parameter.  Since  scale  is  given,  there  types  of  error  were  introduced  to  test  die  system 

are  only  two  degrees  of  freedom  for  position.  Translation  error  was  introduced  by  translating  die 

Matching  is  further  restricted  by  a  priori  projected  model  in  the  image  plane.  Model  error 


LONG-TERM  LONG-TEri 


Figure  2.  Site  Folders  and  Baselines.  Multi-source  regis-  Figure  3.  Exploitation.  Exploitation  depends  on  clear 
tration  is  vital  to  site  baselines  and  folders.  iinageri/  and  accurate  inodel-to-image  registration. 
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Figure  4.  Registration  of  Fort  Uooil  Site  Moilel  to  Fill,  the  regisfnitii'n  tirliiuijut'  uu'lilcil  a  robust  solution 


vv.is  introduced  t>v  niniiomK'  '■fretchini;  e.uli  >ide  of 
.1  buildinj;  in  one  dimension.  'Hie  l,itter  error  liid  not 
reflect  true  modeliny^  error,  but  it  did  [>ro\nie  .1 
me.ins  of  testing  the  effect  of  noise  in  tlu'  model. 
Sevend  tests  were  run  using  different  error  \  alues. 
Out  of  the  14H  buililings  in  the  site  model,  2tt 
distinctive  buililings  were  chosen  for  m.itihing. 
Volume,  .>re.),  tind  sh.tpe  .w  defini'd  In  the  tot.il 
number  of  roof  \erfices  were  used  to  compute 
distinctiveness.  Building  locutions  were  not 


consiilered,  .ilthough  they  turned  out  to  be  well 
distributeil  ''putiiillv.  LSC  used  extr.uted  and 
inteniled  lines  to  compute  junction  features.  Figures 
4  and  f>  show  the  full  registrations  of  FUl  and  FH2. 
Figure  6  shows  an  enl.irgi’il  portion  of  Figure  4. 

A  I  lough  transform  was  used  to  compare  the 
loc.itions  of  predicted  .md  extracted  junctions  over 
the  search  w  indow  .  Know  ledge  of  the  transl.ition 
error  was  useii  to  restrict  the  parts  of 
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Figure  5.  Registration  of  the  Fort  Hoot!  Site  Model  to  FH2.  The  tjeenty  most  distinct  buildings  are  highlighted 


in  while.  Half  of  them  extend  beyond  the  image  boundary. 


Figure  6.  A  Building  in  FH1.  The  site  model  overlay 
is  black,  the  extracted  corner  junctions  white. 


the  imaj’e  that  were  processed  and  to  restrict  the 
neij;hborh(H)d  around  the  projected  moilel  corner  in 


which  imaj;e  junctions  were  matched.  Results  were 
described  by  the  estimated  translation  error  and  by 
the  number  of  junctions  found  at  that  translation. 
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Figure  7.  Accumulator  Array  for  First  Test  Using 
FHl.  The  maximum  for  the  correct  peak  (165)  is 
nearly  double  that  of  the  next  closest  estimate  (85). 


In  the  first  test  usinj^  FHl,  no  error  was 
added  to  the  site  model,  but  a  translation  error  of 
T=(3()()  r()ws,  250  columns)  was  introduced.  The 
search  window  in  which  the  model  was  expected  to 
lie  was  400x400  pixels,  effectively  1600xl6(K) 
pixeh  in  the  original  image.  Figure  7  shows  the  top 
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five  entries  in  the  Hough  accumulator  array 
indicating  the  matched  junction  count,  the  count-to- 
peak  ratio,  and  the  estimated  translation  for  the 
first  test. 

In  the  second  test,  each  side  of  each 
building  in  the  model  was  stretched  by  10  pixels, 
while  T  was  held  at  (0,  0).  As  before,  the  search 
window  was  set  at  400  x  400  pixels.  Once  again, 
thepeak  was  very  strong  (Figure  8). 
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Figure  8.  Accumulator  Array  for  Second  Test  Usmg 
FHl.  A  robust  solution  is  generated  even  in  the 
presence  of  model  inaccuracies. 


The  same  two  tests  were  performed  on  FH2. 
Inherent  discrepancies  were  known  to  exist  between 
the  site  model  and  the  image  since  FH2  had  not 
been  used  to  generate  the  model.  Consequently, 
smaller  errors  were  added.  In  the  first  test  of  FH2, 
no  error  was  added  to  the  site  model  and  T  =  (-1(X),  - 
150).  The  search  window  was  2(K)  x  2(K)  pixels, 
effectively  800  x  800  pixels  in  the  original, 
unreduced  image. 


Count 

Count/Peak 

T 

56 

l.tX) 

(-29,-26) 

50 

0.89 

(-100,  -149) 

47 

0.84 

(-154,  -194) 

46 

0.82 

(8,  -102) 

45 

0.80 

(22,  -112) 

Figure  9.  Accumulator  Array  for  First  Test  Using 
FH2.  A  sharp  peak  is  not  produced,  since  half  of 
the  distinct  elements  extend  past  the  image. 


In  this  example,  the  estimated  translation 
error  closest  to  the  actual  translation,  (-l(X),  -149), 
was  not  indicated  by  a  strong  peak,  unlike  the 
examples  of  FHl  (Figure  9).  Nevertheless,  it  was 
within  the  top  five  maxima.  As  shown  in  Figure  5, 
a  large  number  of  distinct  buildings  used  for 
matching  lie  off  the  image,  increasing  the 
likelihood  of  false  matches. 

In  the  second  test  of  FH2,  each  side  of  each 
building  in  the  model  was  stretched  by  10  pixels  in 
a  single  direction,  while  T  was  held  at  (0,  0).  The 


search  window  was  again  set  at  200  x  200  pixels.  As 
in  the  first  test,  the  estimated  translation  error 
closest  to  the  actual  translation,  (1,  0),  was  not 
indicated  by  a  strong  peak  (Figure  10). 
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Figure  10.  Accumulator  Array  for  Second  Test  Using 
FHl.  Although  not  indicated  as  the  best  solution, 
the  correct  translation  was  within  the  top  five. 


These  experiments  demonstrate  that  model 
decomposition  followed  by  a  simple  matching 
procedure  is  a  promising  technique  for  registering 
large  site  models  to  large  images.  Decomposition 
reduces  the  search  space,  since  only  distinctive 
buildings  need  to  be  matched.  Decomposition  can  be 
improved  by  including  spatial  information.  One 
approach  is  to  treat  closely-situated,  similar 
structures  as  non-distinctive.  A  second  tact  is  to 
identify  distinctive  patterns  of  structures  and  to 
later  match  them  as  a  unit.  Finally,  additional 
priKessing  is  required  in  cases  where  only  a  portion 
of  the  site  is  imaged  (e.g.,  as  in  Figure  5). 

4.2.  Registration  of  Campus-Style  Maps  to  Imagery 

Registration  of  campus-style  maps  to 
imagery  is  the  problem  of  registering  two- 
dimensional  site  models  to  imagery.  This  problem 
was  solved  for  two  scenarios  on  the  SCORPIUS 
program.  The  goal  of  SCORPIUS  (the  Strategic 
Computing  Object-directed  Reconnaissance 
Parallel-processing  Image  Understanding  System) 
was  to  demonstrate  automated  exploitation  of 
aerial  imagery.  Registration  on  SCORPIUS  used 
image  acquisition  parameters  to  project  the  site 
model  into  the  image  plane  then  matched  pre¬ 
computed  tie  points  to  features  extracted  from  the 
image.  The  problem  was  one  of  determining  the 
rotation  and  translation  between  the  projected 
model  and  the  underlying  data. 

The  registration  system  was  developed 
using  a  campus-style  map  and  imagery.  Two 
oblique  images  of  a  site  were  selected,  one 
reflecting  summer  conditions  (e.g.,  clear  weather) 
and  the  other  showing  winter  conditions  (e.g.,  haze 
and  snow).  The  map  was  scanned  into  softcopy  and 
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stored  as  a  semantically-Iabelied  two-dimensional 
site  model.  For  study  purposes,  tie  pt)ints  were 
determined  and  labelled  manually.  The  site  model 
was  projected  by  the  government-furnished  camera 
simulator  into  the  geometry  of  the  selected  images. 
Using  SCORPIUS  software,  the  map  and  its 
related  collateral  were  registered  successfully  to 
the  two  images. 

The  hypothesis  is  that  information  from 
the  registered  map  will  provide  valuable  a  priori 
knowledge  to  subsequent  lU  algorithms.  More 
specifically,  object  modeling  systems  will  have 
access  to  cues  regarding  the  number,  location, 
orientation,  size,  and  footprints  of  important 
objects  in  an  image.  As  a  result,  automation  of  the 
site  modeling  process  is  simplified,  being  modified 
from  a  general  to  a  directed  search  problem.  This 
study  demonstrated  the  viability  of  using  pre¬ 
existing  campus-style  maps  to  obtain  these  cues. 

5.  CONCEPTUAL  DESIGN 

The  conceptual  design  for  an  MSE 
workstation  must  satisfy  the  operational  needs 
enumerated  in  Section  2  while  taking  into  account 
the  technological  constraints  outlined  in  Section  3. 
The  critical  concerns  for  lAs  are  those  of  confidence 
in  an  accurate  site  mixlel  and  in  the  results  of 
automated  processing.  The  key  caveats  issued  by 
the  lU  technologists  focus  on  the  boundary 
conditions  and  accuracy  rates  of  their  algorithms. 

5.1.  Operational  Concept 

The  success  of  an  MSE  workstation  design  is 
grounded  in  the  integrity  and  usability  of  the 
underlying  site  models,  as  required  by  both  the  lAs 
and  the  lU  algorithms.  The  initial  emphasis  had 
been  on  automation  as  the  end  goal  in  workstation 
technology.  In  crafting  a  realistic  operations 
concept,  however,  the  Hughes  team  found  that  lAs 
questioned  the  desirability  of  full  automation. 
Analysts  believe  that  ultimate  responsibility  for 
the  processing  results  was  still  theirs,  thereby 
making  a  more  interactive,  checks-and-balances 
system  more  appealing. 

Technology  characterization  (Section  3) 
validated  the  lAs'  concerns,  showing  that  systems 
which  were  said  to  be  fully  automated  still  fell 
short  of  meeting  operational  requirements.  To 
accommodate  the  wishes  of  the  lA  community  and 


the  limitations  of  existing  technology,  the  MSE 
operations  concept  includes  three  levels  of 
automation:  manual,  semi-automated,  and 
automated,  with  manual  capabilities  existing  as  a 
fallback  position  to  ensure  system  functionality 
under  all  operating  conditions.  The  RADIUS 
technology  studies  (Section  4)  demonstrated  that 
through  judicious  combination  of  existing  data  and 
technology,  such  an  operations  concept  is  a 
reasonable  and  realistic  objective. 

5.2.  Technical  Requirements 

The  technical  requirements  for  the  MSE 
workstation  stem  from  the  functional  flows  shown 
in  Figures  1,  2  and  3.  Although  the  flows  for  site 
modeling/update  and  exploitation  were 
conceptualized  prior  to  lA  consultation,  the 
RADIUS  Phase  1  experiments  proved  the  viability 
of  these  designs.  The  technical  requirements  that 
support  these  functional  flows  in  three  levels  of 
automation  focus  on  user  interface  issues  and  data 
exchange  standards. 

A  fundamental  characteristic  of  a  workable 
workstation  design  is  a  minimization  of  the 
difference  between  the  output  of  automated 
pnxressing  and  any  manual  effort.  The  end  product, 
whether  it  be  the  site  model  itself  or  an 
exploitation  report,  must  have  a  standard  look- 
and-feel  no  matter  which  portion  or  how  much  of 
the  product  was  generated  by  lU  or  a  human 
analyst.  A  significant  implication  of  this 
requirement  is  that  the  Various  processing 
algorithms  must  work  on  components  of  the 
modeling  or  exploitation  problem  that  are 
intuitively  self-contained  to  facilitate  human 
interaction.  Having  multiple  options  for  each 
processing  stage  also  mandates  the  formal 
definition  and  strict  enforcement  of  data  formats  to 
ensure  easy  system  integration  and  upgrade. 

6.  CONCLUSION 

As  with  any  system  supported  by  research, 
there  is  a  tension  between  the  needs  of  the  user 
community  and  the  capabilities  of  the  underlying 
technology.  By  including  multiple  automation 
options  in  conceptual  design  for  an  MSE 
workstation,  an  attempt  has  been  made  to  provide 
the  user  with  ultimate  control  over  the  system 
while  giving  automation  every  opportunity  to 
demonstrate  its  effectiveness.  The  insisterKe  upon 
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user  and  data  interface  standards  serves  to  simplify 
system  development  and  integration. 
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Abstract 

This  paper  presents  an  overview  of  a  system  for  site 
model  extension  being  developed  for  the  Radios  (Re¬ 
search  and  Development  for  Image  Understanding  Sys¬ 
tems)  project  by  the  University  of  Massachusetts.  An- 
tomaticaUy  registering  imagery  to  a  geometric  site 
model,  and  extending  that  model  to  include  new  fea¬ 
tures  of  interest  seen  from  multiple  views,  represents 
an  important  application  of  image  understanding  tech¬ 
nology  to  the  task  of  model-supported  image  exploita¬ 
tion.  The  completed  system  will  contain  modules  for 
performing  feature  extraction,  model  matching,  pose 
determination,  and  triangnlation. 

1  Introduction 

Acquiring  accurate  3D  site  models  bom  a  set  of  images 
is  a  difficult  task.  Due  to  the  ambiguity  inherent  in  the 
projection  of  three-dimennons  down  to  two,  general 
structure  recovery  is  not  possible  without  additional 
constraining  knowledge.  We  begin  by  assuming  that 
a  partial  scene  model  is  available  and  that  for  each  im¬ 
age  the  internal  (lens)  and  external  (pose)  parameters 
of  the  camera  are  approximately  known.  Our  models 
consist  of  sets  of  points,  lines,  and  planes,  represented 
in  a  three-dimensional  world  coordinate  system.  The 
model  is  partial  in  that  we  do  not  assume  all  the  im¬ 
portant  features  of  the  scene  are  modelled.  It  is  not 
necessary  that  the  features  be  connected,  confined  to 
a  single  object,  nor  is  it  even  necessary  that  they  be 
known  predsely.  In  this  sense  we  use  the  term  'model’ 
in  a  much  broker  way  than  is  commonly  used  in  the 
graphical  modelling  community,  although  our  defini¬ 
tion  covers  the  traffitional  usage  as  well  and  thus  in¬ 
cludes  wire-bame  and  surface-based  models. 

Given  an  initial  partial  model,  and  a  set  of  images  of 
the  site,  our  god  is  to  extend  the  model  to  include 
previously  unmodeled  site  features  (model  extension), 
and  to  reduce  the  inaccuracies  in  the  existing  model 
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(model  refinement).  This  process  can  be  repeated  as 
new  images  become  available,  each  updated  model  be¬ 
coming  the  initial  partial  model  for  the  next  iteration. 
Thus,  over  time,  the  site  model  wiD  be  steadOy  im¬ 
proved  to  become  more  complete  and  more  accurate. 

The  process  of  model  extension  and  refinement  can 
be  broken  into  four  important  snbtasks.  First, 
feature  extraction  routines  are  run  on  new  images 
to  reduce  the  vast  amount  of  incoming  data  into  a 
manageable  set  of  symbolic  geometric  descriptions. 
Second,  a  model  matching  procedure  uses  an  ini¬ 
tial  guess  of  the  camera  lens  and  pose  parameters  for 
each  image  to  guide  its  discovery  of  the  correspondence 
between  model  features  and  extracted  image  data  fea¬ 
tures.  These  correspondences  are  used  to  perform  a 
resection  of  the  pose  parameters  for  each  camera, 
to  find  a  more  accurate  description  of  the  transfor¬ 
mation  between  the  three-dimensional  nte  model  and 
each  two-dimensional  image.  The  updated  transfor¬ 
mation  parameters  enable  a  final  process  of  multi¬ 
image  triangnlation  to  determine  the  position  of  new 
features  in  the  model  coordinate  system,  and  to  update 
the  podtions  of  old  features  bas^  on  the  new  image 
data.  Images  of  significantly  disparate  viewpoint  can 
yield  a  quite  large  baseline,  resulting  in  very  accurate 
reconstructions. 

1.1  The  role  of  the  imagery  analyst 

The  imagery  analyst  (lA)  plays  a  key  role  in  the  sys¬ 
tem  being  developed.  Early  versions  will  requue  lA 
guidance  at  several  stages.  Most  obvious,  the  initial 
partial  site  model  we  require  has  to  come  bom  some¬ 
where,  and  the  most  likely  provider  in  the  short  term 
is  the  LA.  Since  our  mod^  are  required  to  be  merely 
a  collection  of  geometric  primitives  known  in  three- 
dimensions,  considerable  flexibility  is  available  in  theu 
specification.  The  model  may  consist  of  nothing  more 
than  the  known  3D  locations  of  a  set  of  visually  distinc¬ 
tive  scene  points.  Alternatively,  the  lA  could  fabricate 
wue-bame  ‘boxes’  of  the  appropriate  dimennons  and 
locations  to  fit  several  significant  buildings  at  the  site. 
Since  model  extension  locates  new  scene  features  with 
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respect  to  preexisting  ones,  the  best  results  are  ob¬ 
tained  when  initial  model  features  are  spread  out  over 
the  whole  area  of  interest,  both  in  terms  of  ground 
positions  and  in  height.  If  enough  initial  features  are 
given,  none  of  them  needs  to  be  specified  very  pre¬ 
cisely,  since  they  will  be  automatiudly  refined  later  in 
the  model  extension  process. 

Another  important  piece  of  information  that  will  need 
to  be  provided  to  the  system  is  an  estimate  of  the  cam¬ 
era  lens  and  pose  parameters  for  each  image,  which 
provides  the  system  with  an  initial  estimate  of  the 
transformation  mapping  three-dimensional  site  fea¬ 
tures  into  observed  two-dimensional  image  features. 
This  estimate  is  used  to  build  the  appropriate  geo¬ 
metric  template  for  matching  the  initial  site  model  to 
extracted  image  features,  ztnd  to  restrict  the  search 
for  feature  correspondences  to  a  manageable  set  of 
likely  candidates.  Some  subsets  of  parameters  may 
be  known  beforehand,  or  recorded  when  the  picture  is 
taken.  For  example,  it  is  customary  to  assume  the  in¬ 
ternal  camera  lens  parameters  are  known  and  remain 
constant  for  multiple  images  taken  by  a  single  camera. 
Other  parameters  related  to  camera  pose  such  as  alti¬ 
tude  or  look  angle  may  be  measured  at  the  time  the 
photo  is  tziken,  and  are  provided  with  some  estimate  of 
their  uncertainty.  If  at  the  time  of  interpretation  some 
parameters  still  have  no  recorded  values,  an  estimate 
will  need  to  be  provided  by  the  analyst.  It  should  be 
stressed  that  the  initial  camera  parameters  supplied  to 
the  system  do  not  have  to  be  accurate  enough  to  pro¬ 
vide  precise  3D  triangulation,  but  only  good  enough 
to  allow  model  matching  to  find  the  correct  correspon¬ 
dences  between  model  and  image  features  with  reason¬ 
able  computational  cost.  These  estimates  will  be  re¬ 
fined  via  camera  resection  following  the  determination 
of  model  to  image  feature  correspondences. 

One  final  area  where  analyst  interaction  can  greatly 
benefit  the  model  extension  process  is  in  selecting 
which  new  image  features  are  interesting  enough  to 
be  worth  adding  to  the  model.  The  automated  system 
could  conceivably  be  turned  loose  to  add  every  feature 
for  which  a  correspondence  is  found  across  at  least  two 
images  -  the  result  would  be  analogous  to  a  digital  ter- 
rziin  model,  being  refined  to  finer  and  finer  resolutions 
as  new  images  are  processed.  While  this  may  be  ex¬ 
actly  what  is  wanted  for  some  tasks,  it  would  not  be 
particularly  meaningful  for  interpreting  urban  and  in¬ 
dustrial  sites,  where  one  might  like  to  structure  the 
developing  model  in  terms  of  high-level  concepts  such 
as  ‘road’  and  ‘building’,  and  thus  derive  additional  de¬ 
tail  only  for  particular  areas  of  the  model. 

1.2  Paper  overview 

The  remainder  of  this  paper  describes,  section  by  sec¬ 
tion,  each  of  the  four  stages  of  model  extension;  fea¬ 
ture  extraction,  model  matching,  camera  resection  and 
triangulation.  The  aim  is  to  present  a  brief  overview 


of  the  entire  process,  rather  than  an  in-depth  analy¬ 
sis  of  any  one  piece.  Two  of  the  components,  feature 
extraction  and  model  matching,  are  currently  under 
evaluation  on  the  model  board  images,  and  illustra¬ 
tive  examples  are  included  in  those  sections. 

Future  research  will  focus  on  ways  to  further  automate 
the  model  extension  process.  For  example,  we  hope  to 
acquire  initial  partial  site  models  from  scratch.  This 
process  requires  determination  of  correct  feature  cor¬ 
respondences  across  multiple  images.  When  accurate 
initial  estimates  of  the  camera  parameters  for  each  im¬ 
age  are  available  this  is  indeed  feasible.  Otherwise, 
the  very  difficult  problem  of  finding  correspondences 
between  two  images  taken  by  different  cameras,  pos¬ 
sibly  &om  significantly  different  viewpoints,  must  be 
solved.  To  this  end,  we  are  pursuing  a  parallel  research 
track  investigating  the  use  of  projective  invariants  for 
image  to  image  registration  and  for  planar  and  ‘nearly 
planar’  scene  reconstruction.  The  benefit  of  this  ap¬ 
proach  is  that  the  dependence  on  prior  estimates  of 
camera  lens  and  pose  parameters  is  minimised.  One 
aspect  of  our  work  on  invariants,  automated  image  rec¬ 
tification  (unwarping  of  oblique  views)  for  matching 
coplanar  structures,  is  described  in  these  proceedings 
[Collin893]. 

2  Feature  Extraction 

In  order  to  extend  a  partial  site  model  from  a  set  of  im¬ 
ages,  it  is  necessary  to  determine  where  in  each  image 
the  model  appears.  A  model  will  normally  be  specified 
at  a  much  hi^er  level  of  geometric  abstraction  than 
the  image  intensity  values.  For  this  reason,  feature 
extraction  routines  are  first  run  on  the  images  to  pull 
out  higher-level  symbolic  features  of  a  type  compati¬ 
ble  with  the  features  represented  in  the  initial  model. 
The  type  of  geometric  features  that  can  potentially  be 
extracted  &om  the  image  includes  straight  line  seg¬ 
ments,  line  pencils,  rectilinear  line  groupings,  curves, 
corner  points,  regions  of  homogeneous  intensity,  and 
textured  areas.  Our  current  matching  algorithm  relies 
exclusively  on  straight  line  segments:  edges  of  a  wire¬ 
frame  model  are  matched  to  straight  lines  extracted 
from  the  image. 

2.1  Straight  line  extraction 

We  are  applying  two  straight  line  segment  extrac¬ 
tion  algorithms  to  the  Radius  model  board  imagery. 
The  Burns  algorithm  [Burns86]  begins  by  labeling  pix¬ 
els  in  the  intensity  plane  according  to  coarsely  quan¬ 
tised  gradient  orientation.  A  connect-components  al¬ 
gorithm  is  then  run  to  determine  line-support  regions, 
i.e.  a  set  of  pixels  with  an  intensity  surface  that  sup¬ 
ports  the  presence  of  a  straight  line.  Representative 
lines  are  extracted  by  intersecting  a  plane  correspond¬ 
ing  to  the  average  intensity  of  a  line-support  region 
with  a  least-squares  planar  fit  of  the  underlying  inten- 
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sity  surface  of  that  legion. 

In  contrast,  the  Boldt  grouping  algorithm  [Boldt89] 
extracts  local  edges  and  then  hierarchically  groups 
them  via  geometric  relations  that  were  inspired  by  the 
Gestalt  laws  of  perceptual  organisation.  The  initial 
edges  are  the  sero  crossing  points  of  the  Lapladan  of 
the  intensity  plane.  In  an  iterative  process,  two  edges 
are  linked  and  replaced  by  a  single  longer  edge  if  their 
end  points  are  close  and  their  orientation  and  contrast 
are  similar,  resulting  in  increasingly  longer  lines. 

Our  current  implementation  of  the  Burns  algorithm 
runs  much  faster  than  Boldt,  because  it  makes  fewer 
local  decisions  at  each  stage  of  processing,  and  main¬ 
tains  fewer  intermediate  data  structures.  Indeed,  a 
stripped  down  version  of  the  Burns  algorithm  has  been 
used  as  a  fast  line  finder  for  robot  navigation  exper¬ 
iments  [FennemaQO].  Our  current  implementation  of 
the  Burns  algorithm  also  works  on  larger  image  rises. 
The  one  drawback  is  that  oddly-shaped  support  re¬ 
gions  caused  by  slow  gradient  intensity  changes  in  the 
image  can  skew  the  orientation  of  the  resulting  lines. 
Current  work  is  aimed  at  correcting  this  known  prob¬ 
lem.  Figure  1  shows  a  portion  of  one  of  the  model 
board  images,  while  Figure  2  presents  a  set  of  lines 
extracted  &om  this  image  by  the  Burns  algorithm. 

2.2  Vanishing  point  detection 

Vanishing  points  can  be  an  important  source  of  in¬ 
formation  in  urban  and  industrial  scenes  where  build¬ 
ings  and  roads  are  iayed  out  in  a  rectangular  grid.  By 
grouping  together  pencils  of  lines  that  converge  to  a 
vanishing  point,  a  higher  level  of  geometric  abstraction 
and  data  reduction  is  achieved.  Under  known  camera 
lens  parameters,  vanishing  points  allow  the  computa¬ 
tion  of  three-dimensional  line  and  plane  orientations 
from  a  single  image,  and  thus  allow  the  orientation  of 
the  camera  with  respect  to  the  scene  to  be  inferred 
(see  Section  3.1).  When  the  camera  lens  parameters 
are  not  known,  they  can  be  determined  to  a  limited 
extent  from  vanishing  point  information  [Wang91]. 

A  practical  algorithm  for  finding  vanishing  points  from 
a  set  of  line  seg^nents  in  an  image  must  address  two 
issues:  how  to  cluster  line  segments  going  to  a  single 
vanishing  point,  and  how  to  estimate  an  accurate  van¬ 
ishing  point  from  a  ^ven  line  cluster.  The  former  is 
handled  elegantly  using  a  Hough  transform  that  maps 
line  segments  onto  great  circles  in  a  histogram  repre¬ 
senting  the  surface  of  a  unit  sphere  [Bamard83].  Po¬ 
tential  vanishing  points  are  detected  as  peaks  in  the 
histogram,  corresponding  to  areas  where  several  great 
circles  intersect.  While  this  approach  excels  at  quickly 
clustering  line  segments  into  convergent  groups,  the  fi¬ 
nal  estimate  of  vanishing  point  location  and  variance 
should  be  based  on  the  line  segments  themselves  rather 
than  the  arbitrary  bucket  boundaries  of  a  histogram 
data  structure.  We  therefore  use  the  Hough  transform 
only  as  an  initial  clustering  method  and  as  an  efficient 


spatial  access  mechanism.  The  final  vanishing  point 
locations  are  computed  using  a  statistical  estimation 
technique  that  estimates  each  vanishing  point  location 
as  the  polar  axis  of  an  equatorial  distribution  on  the 
unit  sphere  [Colliiis90].  The  resulting  polar  axis  points 
towards  the  location  in  the  image  where  the  converging 
lines  intersect,  and  points  paraQel  to  the  image  plane 
when  the  underlying  2D  line  segments  are  parallel. 

3  Model  Matching 

The  second  stage  of  the  model  extension  process  is 
model  matching.  Figure  3  shows  a  partial  wireframe 
model  constructed  from  the  model  board  ground  truth 
data.  This  model  encompasses  only  those  buildings 
where  enough  ground  truth  points  were  available  to 
determine  their  shape  and  location.  Note  that  some 
buildings  are  incompletely  specified. 

Given  a  partial  3D  wireframe  site  model,  and  a  set  of 
extracted  straight  lines,  the  goal  of  model  matching 
is  to  find  the  correspondence  between  model  lines  and 
data  lines.  To  find  this  correspondence  we  are  evaluat¬ 
ing  a  novel  model  matching  algorithm  due  to  Beveridge 
[Beveridge90].  Baaed  on  the  local  search  approach  to 
combinatorial  optimisation,  this  algorithm  seeks  the 
transformation  that  brings  the  projected  model  into 
snbpixel  alignment  with  the  nnderljring  image  data. 

The  local  search  matching  algorithm  searches  the 
discrete  space  of  correspondence  mappings  between 
model  and  image  features  for  one  that  minimises  a 
match  error  function.  The  match  error  depends  upon 
the  relative  placement  implied  by  the  correspondence, 
and  the  amount  of  coverage  of  the  model  by  the  data. 
More  particularly,  to  compute  the  match  error  the 
model  is  placed  in  the  scene  so  that  the  appearance 
of  model  features  is  most  similar  to  the  appearance 
of  corresponding  image  features.  The  more  similar  the 
appearance  the  lower  the  match  error.  The  mathemat¬ 
ical  transformation  mapping  model  features  to  scene 
features  is  essentially  a  module  of  the  system.  Our 
current  implementation  handles  the  four  parameter  2D 
similarity  transform  and  the  full  3D  pose  transform. 

To  find  the  optimal  match,  probabilistic  local  search 
relies  upon  a  combination  of  iterative  improvement 
and  random  sampling.  Iterative  improvement  refers  to 
a  repeated  generate-and-test  proc^nre  by  which  the 
algorithm  moves  from  an  initial  match  to  one  that  is 
locally  optimal  via  a  sequence  of  incremental  changes 
that  continually  reduce  the  match  error.  In  an  effort  to 
find  the  global  optimum,  the  algorithm  is  run  multiple 
times,  starting  with  different  initial  correspondences 
from  the  model  to  data  line  match  space.  Even  if  the 
probability  of  seeing  the  optimal  match  on  a  single  trial 
is  low,  the  probability  of  seeing  the  optimal  match  in  a 
large  number  of  trials  started  from  uniformly  random 
positions  in  the  match  space  is  high. 
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Figure  1:  Model  hoard  image  8 


Figure  3:  Partial  wire-frame  model 


Figure  5:  Candidate  data  correspondences 


Figure  2:  Burns  lines  for  image  8 


Figure  4;  Imtial  model  projection 


Figure  6:  Best  match  of  model  to  data 


III 


3.1  Initial  camera  model 

Essentially  all  modd-to-image  coiiespondence  prob¬ 
lems  involve  solving  both  a  discrete  correspondence 
between  model  and  image  features  along  with  an  asso¬ 
ciated  transformation  mapping  model  features  into  the 
image.  The  two  problems  together  constitute  model 
matching:  a  match  being  a  correspondence  plus  a 
transformation.  The  most  general  transformation  typ¬ 
ically  considered  involves  full  3D  pose:  a  rigid  3D 
model  is  rotated  and  translated  relative  to  the  cam¬ 
era  and  then  projected  into  the  image  using  a  known 
camera  model. 

However,  given  good  estimates  of  the  lens  and  approxi¬ 
mate  pose  parameters  of  the  camera,  the  3D  model  can 
be  projected  onto  the  image  before  matching  bepns, 
turning  the  problem  into  a  search  for  the  2D  transfor¬ 
mation  that  best  brings  the  tD  projected  model  lines 
into  correspondence  with  the  data.  This  is  the  under¬ 
lying  motivation  for  the  2D  similarity  version  of  the 
model  matcher.  Finding  the  best  2D  similarity  trans¬ 
form  between  model  and  data  is  much  faster  than  solv¬ 
ing  for  full  3D  pose.  Because  the  method  is  fast,  it  is 
possible  to  run  more  trials  in  a  given  amount  of  time, 
thereby  increasing  the  confidence  in  finding  the  best 
correspondence.  Figure  4  displays  a  projection  of  the 
wire-frame  site  model  onto  our  example  model  board 
image  using  initial  estimates  of  the  camera  lens  and 
pose  parameters. 

The  camera  parameters  to  be  estimated  for  each  image 
may  be  presented  in  the  followir3  matrix  form 
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which  maps  the  3D  coordinates  z,  y,and  z  of  a  model 
point  into  the  2D  coordinates  u  and  v  of  an  image 
point.  The  transformation  from  model  to  image  coor¬ 
dinates  is  broken  into  a  3  x  4  matrix  of  pose  parameters 
and  a  3  X  3  matrix  of  lens  parameters.  The  pose  pa¬ 
rameter  matrix  is  partitioned  into  a  3  x  3  orthonormal 
rotation  matrix  R  and  a  3  x  1  translation  vector  i. 
The  lens  parameters  considered  are  s,  and  s,,  the  fo¬ 
cal  length  in  pixels  along  each  of  the  image  axes,  and 
tio  and  Vo,  pixel  coordinates  of  the  principle  point. 

For  our  experiments,  initial  estimates  of  the  lens  and 
pose  parameters  for  the  Radius  model  board  imagery 
were  determined  as  follows.  First,  nominal  values  for 
the  camera  lens  parameters  were  filled  in  from  infor¬ 
mation  supplied  with  the  model  board  data  (namely 
the  focal  length  in  mm  and  the  dimensions  of  a  pixel  in 
mm),  and  by  assuming  the  principle  point  to  be  in  the 
numeric  center  of  the  image.  The  orientation  of  the 
camera  with  respect  to  the  scene  was  determined  by 
vanishing  point  analysis  up  to  a  four-fold  ambiguity. 


resolved  by  identifying  the  direction  of  true  north  in 
the  image  by  hand.  The  distance  of  the  camera  from 
the  ground  was  determined  from  the  reported  Ground 
Scale  Distance  (GSD);  to  date  our  experiments  have 
only  used  the  18  inch  GSD  images.  Finally,  the  inter¬ 
section  of  the  camera’s  line  of  sight  with  the  ground 
plane  was  estimated  manually  (see  Section  1.1  on  the 
role  of  the  imagery  analyst). 

3.2  Setting  up  the  mutch  space 

Model  matching  performs  a  search  through  the  space 
of  possible  model  to  data  correspondences.  This  space 
is  initially  set  up  by  deciding  which  data  lines  in  the 
image  are  to  be  considered  as  candidates  matches  for 
each  model  Une.  Careful  pruning  of  this  space  at  the 
start  is  cmdal  to  achieving  tractable  run  times.  Once 
again  the  problem  is  greatly  simplified  when  good  ini¬ 
tial  estimates  of  the  camera  transformation  parame¬ 
ters  are  available.  The  better  the  iiutial  estimates, 
the  tighter  the  filters  for  picking  out  possible  candi¬ 
date  data  lines  can  be,  both  in  terms  of  orientation, 
position  in  the  image,  and  length. 

Even  though  the  metric  used  to  score  potential  corre¬ 
spondences  is  purely  geometric,  photometric  expectar 
tions  such  as  the  sign  and  magnitude  of  contrast  across 
a  line  can  be  enforced  in  the  final  match  by  prefilter¬ 
ing  for  these  properties  in  the  initial  candidate  gener¬ 
ation  phase.  Our  tendency  has  been  to  underspecify 
rather  than  overspecify  filter  parameters,  however,  b^ 
cause  once  a  correct  line  pairing  has  been  excluded  by 
oversealous  filtering  in  the  candidate  generation  stage, 
that  correct  pairing  can  never  contribute  to  the  match 
that  is  eventually  found. 

For  the  experiments  we  have  run  on  the  model  board 
imagery,  all  lines  located  within  100  pixels  and  having 
an  orientation  within  10  degrees  of  a  projected  model 
line  are  selected  as  possible  matching  candidates.  Fig¬ 
ure  5  shows  the  complete  set  of  candidate  data  lines 
considered  in  our  example  matching  problem.  Fig¬ 
ure  6  displays  the  overlaid  model  after  application  of 
the  2D  simOarity  transform  associated  with  the  best 
correspondence  found. 

4  Camera  Resection 

The  result  of  model  matching  is  a  set  of  model  to  image 
feature  correspondences  between  3D  wire-frame  edges 
and  2D  image  hues.  The  next  stage  in  the  model  ex¬ 
tension  process  uses  these  correspondences  to  resect 
more  accurate  estimates  of  camera  pose.  These  up¬ 
dated  parameter  estimates  will  be  used  to  triangulate 
the  positions  of  new  scene  features. 

Kumar  has  developed  optimisation  techniques  for  find¬ 
ing  3D  camera  pose  from  point  and  line-based  feature 
correspondences  [Knmar92a,  Kumar92b,  Knmar92c]. 
His  line-baaed  constraints  are  MiwiUr  to  those  devel- 
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oped  by  Liu,  Huang  and  Fangeraa  [Liu88]  fiom  the 
observation  that  the  3D  lines  in  the  camera  coordi¬ 
nate  system  must  lie  on  the  projection  plane  formed 
from  the  corresponding  image  line  and  the  optical  cen¬ 
ter.  Using  this  fact,  Liu  et.al.  separated  the  constraints 
for  rotation  from  those  of  tran^tion,  leading  to  a  so¬ 
lution  in  which  rotation  is  solved  for  first,  and  then 
translation  is  obtained  using  the  rotation  results.  Un¬ 
fortunately,  small  errors  in  computing  the  rotation  are 
amplified  into  large  errors  in  translation. 

Kumar’s  pose  solution  differs  from  that  of  Lin  et.al.  in 
two  significant  ways.  First,  rotation  and  translation 
are  solved  for  simultaneously,  which  makes  more  effec¬ 
tive  use  of  the  constraints  and  is  more  robust  in  the 
presence  of  noise.  Second,  the  nonlinear  least-squares 
optimisation  algorithm  used  to  solve  for  rotation  and 
translation  is  adapted  from  Horn  [Hom90].  Horn’s 
method,  based  on  the  quaternion  representation  of  ro¬ 
tations,  provides  much  better  convergence  properties 
than  solution  methods  based  on  Euler  angles. 

It  is  well  known  that  least  squares  optimisation  tech¬ 
niques  are  prone  to  errors  when  there  are  outliers  in 
the  data.  Kumar  developed  a  second  suite  of  pose  op¬ 
timisation  methods  using  robust  statistics  in  order  to 
minimise  the  effect  of  outliers.  In  these  algorithms, 
the  median  of  the  error  function  is  minimised,  rather 
than  the  mean  squared  error.  This  approach  is  robust 
over  data  sets  containing  up  to  50%  outliers,  at  the  ex¬ 
pense  of  the  increased  computation  needed  to  sample 
multiple  subsets  of  data  to  find  one  devoid  of  outliers. 

Based  on  a  model  of  image  noise  and  the  assumption 
that  the  3D  model  data  is  accurate,  closed  form  expres¬ 
sions  for  the  uncertainty  in  the  pose  refinement  results 
(rotation  and  translation)  have  been  derived.  Kumar 
has  shown  analytically  that  the  error  in  the  output 
parameters  is  linearly  related  to  the  noise  in  the  input 
data  [Kumar92b].  He  also  studied  the  effect  of  er¬ 
rors  in  estimates  of  the  image  center  and  focal  length 
on  the  resulting  pose,  showing  that  incorrect  knowl¬ 
edge  of  the  camera  center  does  not  significantly  affect 
the  computed  3D  location  of  the  sensor  (although  the 
computed  rotation  is  affected),  and  that  incorrect  esti¬ 
mation  of  the  camera  focal  length  significantly  affects 
only  the  z-component  (depth)  of  the  computed  pose. 

For  images  where  the  lens  parameters  are  not  avail¬ 
able,  or  not  known  very  accurately,  the  pose  deter¬ 
mination  process  could  conceivably  be  extended  to 
solve  for  both  lens  and  pose  parameters.  The  resulting 
highly  nonlinear  set  of  equations  could  best  be  solved 
if  midtiple  images  taken  with  the  same  camera  were 
available,  in  which  case  a  joint  optimisation  procedure 
could  be  used  to  determine  the  single  set  of  lens  pa¬ 
rameters  at  the  same  time  the  pose  parameters  for 
each  view  were  computed.  We  are  investigating  the 
feasibility  of  this  approach  for  general  applications. 


5  IViangulation 

Our  approach  to  model  extension  began  with  the 
search  for  correspondences  between  a  partial  site 
model  and  geometric  image  features.  Finding  these 
correspondences  and  computing  the  camera  pose  re¬ 
lating  the  model  coordinate  system  and  the  image  co¬ 
ordinate  system  of  each  view  has  been  discussed.  Now, 
urag  the  computed  model  to  image  transformations, 
correspondences  of  nnmodeled  features  over  the  multi¬ 
ple  views  can  be  backprojected  to  locate  new  3D  model 
points  and  lines  in  the  model  coordinate  system  by  tri- 
angulation. 

Currently,  only  code  for  triangnlation  of  point  features 
is  implemented.  The  estimation  of  new  3D  points  can 
be  done  in  either  batch  or  iterative  sequential  mode, 
l^iangulation  requires  at  least  two  frames  and  there¬ 
fore  the  minimnm  batch  sise  is  two.  Results  from  batch 
to  batch  can  be  be  integrated  by  the  standard  Kalman- 
filter  covariance  based  updating  equations. 

Due  to  noise  both  in  image  measurements  and  cam¬ 
era  pose  estimates,  image  projection  rays  will  not  ex¬ 
actly  intersect  at  a  point.  Kumar  has  developed  a 
3D  pseudo-intersection  method  that  minimises  an  er¬ 
ror  equation  based  on  the  same  constraints  that  de¬ 
termine  the  pose  [Knmar92b].  The  criterion  underly¬ 
ing  this  error  equation  is  that  the  best  estimate  for 
any  model  point  location  is  the  point  that  minimises 
the  least-squares  distance  between  the  predicted  im- 
i^e  location  of  the  projected  model  point  and  its  ac¬ 
tual  image  location,  taking  into  account  covariances  in 
the  measured  image  positions  and  the  computed  pose. 
Two  non-linear  error  equations  are  obtained  for  each 
scene  point  for  each  image  frame,  thus  a  minimum 
of  two  frames  is  needed  to  solve  the  system  of  equa¬ 
tions.  Techniques  for  the  solution  of  nonlinear  systems 
of  equations  generaUy  require  an  initial  estimate  that 
is  close  to  the  true  solution.  The  initial  estimate  in 
this  case  is  chosen  as  the  point  that  minimises  the  sum 
of  squares  of  perpendicular  distances  to  all  the  image 
projection  rays,  a  point  that  is  easily  found  by  solving 
a  linear  system  of  equations.  Using  this  initial  guess, 
an  iterative  procedure  is  employed  to  solve  the  system 
of  non-linear  equations  for  ea^  point.  The  iterative 
procedure  is  repeated  until  there  is  convergence.  Usu- 
aUy  only  one  iteration  is  sufficient  for  accurate  results 
[Kumar 92b].  A  byproduct  of  this  calculation  is  an  ap¬ 
proximate  covariance  matrix  for  the  derived  3D  model 
point  position. 

The  method  described  above  can  also  be  used  for 
model  refinement.  In  this  case  initial  model  points 
have  input  covariances  associated  with  them.  The 
pseudo-intersection  method  is  used  to  calculate  a  new 
estimate  for  each  initial  model  point.  The  covariance 
matrices  of  a  new  estimate  and  an  initial  model  point 
are  used  to  fuse  the  two  estimates  and  provide  a  new 
uncertainty  matrix  using  the  standard  Kalman  filter¬ 
ing  equations. 
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Kamai  anomei  that  the  lena  paiameten  are  known, 
■o  only  nnceitainty  in  the  poee  parameter  estimates  is 
considered  when  computing  the  error  in  a  triangulated 
point  position.  We  are  extending  this  approach  to  han¬ 
dle  uncertainty  in  the  aunera  lens  parameters  as  welL 
We  are  also  extending  the  triangnlation  equations  to 
work  for  lines  as  well  as  points. 

6  Summary 

A  system  for  automated  site  model  matching  and 
extension  is  under  development  at  the  University  of 
Massachusetts.  A  sketch  of  the  final  system  has 
been  presented,  based  on  algorithms  that  have  al¬ 
ready  been  developed  for  performing  feature  extrac¬ 
tion,  model  matching,  pose  determination,  and  trian¬ 
gnlation.  These  algorithms  are  currently  being  tested 
on  the  Radius  model  board  imagery.  Given  the  excel¬ 
lent  model  extension  results  we  have  obtained  in  past 
robot  navigation  applications,  we  are  optimistic  about 
the  utility  of  these  routines  in  the  aerial  photogram- 
metry  domain. 
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Abstract 

The  University  of  Maryland  (with  TASC  as 
a  subcontractor)  is  one  of  the  group  of  insti¬ 
tutions  doing  research  on  aerial  image  under¬ 
standing  in  support  of  the  RADIUS  program. 

The  emphasis  of  our  research  is  on  knowledge- 
based  change  detection  (CD)  using  site  mod¬ 
els  and  the  domain  expertise  of  image  analysts 
(I As).  Change  detection  involves  classifying 
changes  in  the  imagery  as  being  due  to  site  up¬ 
dates  or  activity,  or  as  irrelevant  changes  due 
to  illumination  differences,  seasonal  variations, 
etc.  The  lA’s  expertise  is  crucial  in  identifying 
relevant  changes,  which  depend  on  the  site  and 
the  intelligence  agenda.  Our  focus  is  on  ways 
in  which  image  understanding  (lU)  techniques 
can  aid  the  lA  in  performing  CD.  We  are  de¬ 
signing  a  system  that  allows  the  lA  to  specify 
what  are  .to  be  considered  as  relevant  changes, 
and  to  select  appropriate  lU  algorithms  for  de¬ 
tecting  these  changes. 

Before  CD  can  be  attempted,  the  acquired  im¬ 
ages  have  to  be  registered  to  the  site  model. 

We  are  developing  efficient  constrained  search 
mechanisms  for  image-to-site  model  registra¬ 
tion,  using  techniques  based  on  non-monotonic 
reasoning  (Assumption-based  lYuth  Mainte¬ 
nance  Systems  (ATMSs)  and  their  variants). 

We  are  also  using  such  techniques  to  facilitate 
interactive  lA  guidance  for  CD  and  site  model 
updating. 

1  Introduction 

The  process  of  locating  and  identifying  significant 
changes  or  new  activities,  known  as  change  detection 
(CD),  is  one  of  the  most  important  imagery  exploita¬ 
tion  tasks  [1].  Previous  research  on  CD  has  emphasized 
the  development  of  general-purpose  methods  that  can 
be  employ^  to  screen  a  wide  variety  of  imagery  and  de¬ 
termine,  without  access  to  any  site-specific  model  infor¬ 
mation,  whether  any  significant  changes  or  events  have 
occurred  between  the  times  of  acquisition  of  the  imagery. 
These  methods  have  been  found  to  be  unreliable  because 
a)  CD  techniques  based  on  more  or  less  sophisticated  dif¬ 
ferencing  of  images  (possibly  after  attempted  corrections 


for  viewpoint  and  illumination  differences)  are  extremely 
sensitive  to  errors  in  registration  and  in  the  photomet¬ 
ric  models  (e.g.  reflectance,  illumination)  that  are  used; 
b)  too  many  inconsequential  changes  occur  in  any  natu¬ 
ral  environment.  Even  if  general-purpose  methods  could 
be  developed  for  screening  out  all  changes  due  to  varia¬ 
tions  in  viewpoint,  sensor  and  illumination,  there  would 
still  be  many  differences  between  the  images  whose  sig¬ 
nificance  could  only  be  determined  by  an  image  analyst 
(lA)  using  comprehensive  site  knowledge  and  the  rele¬ 
vant  intelligence  agenda.  Thus  the  goal  of  relieving  the 
lA  of  the  burden  of  screening  large  subsets  of  acquired 
imagery  is  unlikely  to  be  achieved  using  such  general- 
purpose  methods. 

We  plan,  instead,  to  develop  a  model-based  vision 
system  for  CD  incorporating  image  understanding  (lU) 
techniques  whose  primitives  are  specific  to  a  particu¬ 
lar  site  type,  and  that  can  be  employed  by  the  lA  to 
direct  the  lU  system  to  conduct  spatially  constrained 
analyses  whose  outcomes  may  be  indicative  of  occur¬ 
rences  of  changes  that  have  intelligence  significance.  The 
system  will  be  site-model  based,  employ  a  heavily  vi¬ 
sual  man/machine  interface,  and  will  be  based  on  three 
classes  of  primitives;  object  primitives,  which  corre¬ 
spond  to  the  specific  objects  that  occur  in  a  particular 
site  model  and  to  the  generic  object  classes  supported 
by  the  lU  system;  spatial  primitives,  for  the  construc¬ 
tion  of  search  locales  and  the  specification  of  constraints 
on  the  search  for  object  types  within  locales;  and  tempo¬ 
ral  primitives,  which  can  constrain  or  parameterize  the 
analysis  by  factors  such  as  time  of  day,  day  of  week,  time 
of  year,  etc.  The  system  will  assist  the  lA  by  highlight¬ 
ing  areas  on  an  image  where  there  are  relevant  activities, 
new  or  upgraded  facilities. 

As  reported  in  [1],  I  As  have  identified  two  ways  in 
which  lU  can  be  useful  in  CD:  the  “quick-look”  (QL) 
and  “final-look”  (FL)  modes.  In  the  QL  mode,  small 
areas  where  any  diange  would  be  considered  significant 
are  declared  a  priori,  and  when  the  system  is  presented 
with  a  series  of  images,  only  those  that  satisfy  the  ccm- 
ditions  in  the  QL  profile  are  marked.  In  the  FL  mode, 
a  set  of  less  important  areas  to  be  examined  for  change 
is  specified.  These  areas  are  less  important,  but  the  lA 
wants  to  examine  them  to  ensure  complete  coverage  of 
the  site.  As  the  lA  gains  experience,  both  the  QL  and 
FL  profiles  can  be  modified.  The  CD  system  that  we 
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plan  to  build  could  include  these  two  options. 

The  site  models  considered  in  the  current  phase  of 
RADIUS  encode  only  the  spatial  relationships  between 
fixed  objects  of  interest  in  a  site,  such  as  buildings,  roads, 
etc.  An  important  issue  in  training  new  analysts  or  re¬ 
viewing  infrequently  analyzed  sites  is  the  coding  of  the 
temporal  relationships  which  describe  changes  in  the  site 
such  as  movements  of  vehicles  under  normal  or  abnor¬ 
mal  circumstances — i.e.,  a  site  activity  model.  The  CD 
system  described  above  will  be  a  valuable  step  toward 
the  development  of  a  site  activity  modeling  capability. 

Generally  the  first  step  in  a  CD  task  is  the  registra¬ 
tion  of  an  image  to  an  existing  site  model.  Depending 
on  the  CD  task,  using  the  existing  site  model  and  cam¬ 
era  parameters,  regions  of  interest  in  the  given  image  will 
first  be  outlined.  Subsequently,  objects  such  as  buildings 
and  vehicles  that  are  characteristically  present  in  the  site 
can  be  extracted  and  analyzed  for  CD  purposes.  Such 
object  extraction  algorithms  cannot  be  purely  bottom- 
up.  For  example,  in  extracting  buildings  [2],  heuristics 
based  on  the  expected  shapes  of  roofs  (site-specific  infor¬ 
mation)  are  very  useful  for  completing  any  partial  roof 
hypotheses  that  result  from  imperfect  bottom-up  pro¬ 
cessing.  Likewise,  shadow  analysis  is  very  useful  for  ob¬ 
taining  height  information  [3,  4],  or  allowing  the  lU  sys¬ 
tem  to  explain  why  some  building  features  that  are  in 
the  field  of  view  cannot  be  identified  in  the  image.  Site 
models  can  also  be  very  useful  for  providing  geometric 
and  photometric  constraints  that  r^uce  matching  am¬ 
biguities. 

In  addition  to  image-to-site  model  registration,  we  are 
also  interested  in  image-to-image  registration  where  two 
images  acquired  from  possibly  severe  off  nadir  viewing 
conditions  need  to  be  registered.  Image-to-image  regis¬ 
tration  is  useful  for  building  site  models,  and  for  per¬ 
forming  the  subtask  of  tranrforming  a  given  image  to  a 
“favored  orientation”  [1].  The  images  to  be  analyzed 
as  part  of  the  RADIU^related  research  program  are 
high-resolution  images  of  complicated  sites.  In  many 
of  the  currently  used  image  registration  algorithms,  tie 
points  need  to  be  manually  selected.  This  can  be  a 
laborious  task.  Automatic  registration  of  the  two  im¬ 
ages  is  desirable.  Given  the  vuiability  of  viewing  di¬ 
rections,  illumination  conditions  and  resolution,  the  fea¬ 
tures  used  for  matching  may  be  poorly  localized  or  oc¬ 
cluded.  Automatic  image-to-image  registration  will  be 
accomplished  using  appropriate  cues  from  site  models 
and  camera  models.  Although  the  use  of  site  models 
for  image  registration  and  CD  has  attractive  features, 
there  are  some  problenu  associated  with  this  approach. 
Site  models  are  usually  not  complete;  only  a  few  im¬ 
ages  may  have  been  used  in  building  a  site  model.  In¬ 
ferences  based  on  incomplete  site  models  may  be  erro¬ 
neous  or  incomplete.  The  matching  algorithms  used  in 
image  registration  should  be  able  to  deal  with  uncer¬ 
tainties.  Currently  used  matching  algorithms  based  on 
relaxation,  interpretation  trees  [5]  and  constraint  sat¬ 
isfaction  networks  [6]  lead  to  inefficient,  repetitive  and 
uncontrolled  search  for  matches.  We  propose  to  use 
search  methods  based  on  non-monotonic  reasoning — in 
particular.  Assumption-based  Truth  Maintenance  Sys¬ 


tems  (ATMSs)  (7,  8].  ATMSs  achieve  efficiency  by  keep¬ 
ing  track  of  unsuccessful  or  undesirable  search  paths. 
Conditions  that  lead  to  unfruitful  search  are  declared 
and  stored  as  nogoods.  These  nogoods  can  be  declared  a 
priori  or  during  analysis,  based  on  the  lA’s  expertise  in 
terms  of  undesirable  matchings  or  in  terms  of  inconsis¬ 
tencies  of  features  and  their  groupings  as  predicted  by 
the  rendered  scenes  or  the  site  models  themselves. 

ATMSs  can  also  be  used  for  updating  a  site  model  once 
the  CD  module  and  the  lA  identify  the  changes  of  con¬ 
sequence.  This  process  involves  additions  and  deletions 
of  assertions  or  hypotheses  that  exist  as  part  of  the  site 
model.  The  subproblems  involved  are  generating  sym¬ 
bolic  descriptions  of  the  image  regions  where  changes 
have  occurred  and  incorporating  these  descriptions  into 
the  model. 

It  is  evident  that  the  lA  will  perform  a  crucial  role  in 
directing,  manipulating  and  correcting  the  results  of  lU 
algorithms.  An  important  part  of  our  approach  is  the 
inclusion  of  early  feedback,  by  users  familiar  with  the 
final  application,  as  to  the  usability  of  the  algorithms 
develop^  under  this  program.  Although  formal  usabil¬ 
ity  test  sessions  are  not  envisioned,  subjects  will  be  asked 
to  perform  routine  and  specialized  tasks  for  evaluation. 
These  evaluations  will  provide  valuable  information  with 
respect  to  the  likely  models  and  levels  of  interaction  to  be 
expected  from  lA’s,  the  clarity  and  intuitive  understand- 
ability  of  the  lU  algorithms,  and  whether  the  typical  lA 
is  able  to  tailor  the  responses  of  the  algorithm  to  his/her 
needs. 

The  interactions  of  the  various  modules  described 
above  are  illustrated  in  Figure  1. 


Figure  1;  Schematic  representation  of  the  change  detec¬ 
tion  process. 


2  Research  Areas 

2.1  Site-Specific  System  for  Change  Detection 

We  are  developing  a  system  whose  primitives  are  specific 
to  a  particular  site  type.  This  system  will  be  employed 
by  the  I A  to  direct  the  lU  system  to  determine  the  oc- 
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currence  of  changes  that  have  intelligence  significance. 
The  system  will  have  the  following  characteristics: 

1.  It  will  be  site-model  based — i.e.,  the  specifica¬ 
tion  of  changes  of  interest  will  be  in  ternu  of  object 
types  in  the  site  model,  type-specific  changes,  and 
three-dimensional  spatial  relations  that  can  be  used 
to  direct  and  constrain  the  lU  algorithms.  This  is 
critical  because  these  changes  must  be  detected  in 
future  imagery  whose  source  characteristics  (view¬ 
points,  spatial  resolution)  will  not  be  known  in  adr 
vance. 

2.  The  man/machine  interface  will  be  visual.  The 
lA  will  construct  a  change  specification  by  graph¬ 
ical  manipulation  of  the  current  elements  of  the  site 
model  and  of  generic  class  models  supported  by  the 
lU  system  (e.g.  vehicle  classes,  road  models),  and 
insertion  of  visual  tokens  into  the  site  modd  that 
will  provide  to  the  lU  system  information  concern¬ 
ing  where  to  search,  what  to  search  for,  and  the 
specificity  of  the  search. 

3.  The  system  design  will  be  based  on  three  classes 
of  primitives:  object  primitives,  which  corre¬ 
spond  to  the  specific  objects  that  occur  in  a  par¬ 
ticular  type  of  site  model  and  the  generic  object 
classes  supported  by  the  lU  system;  spatial  prim¬ 
itives  for  the  construction  of  search  locales  and  the 
specification  of  constraints  concerning  the  search  for 
object  types  within  locales;  and  temporal  prim¬ 
itives  which  might  constrain  or  parameterize  the 
analysis  by  factors  such  as  time  of  day,  day  of  week, 
time  of  year,  etc. 

We  illustrate  these  ideas  with  two  example  tasks: 

1.  Counting  objects  within  locales.  Many  types  of 
changes  involve  counting  instances  of  objects  within 
locales  and  comparing  these  counts  either  against  an 
absolute  standard  or  with  their  values  at  previous 
times.  For  example,  the  analyst  might  wish  to  mon¬ 
itor  the  number  of  vehicles  in  a  parking  area  and 
be  informed  if  their  number  is  ever  above  a  given 
threshold.  The  specification  of  this  change  would 
involve: 

(a)  Indicating  the  locale  of  interest  by  either  point¬ 
ing  to  it  in  a  visualization  of  the  site  model  (if 
it  is  an  explicit  component  of  the  site  model) 
or  using  a  graphical  tool  to  indicate  its  position 
and  extent.  For  example,  while  a  parking  lot 
may  be  an  element  of  a  site  model,  the  analyst 
may  only  be  interested  in  increases  in  vehicles 
in  the  part  of  the  lot  near  some  building  of  in¬ 
terest.  In  such  a  case,  graphical  tools  can  be 
used  to  construct  an  arbitrary  locale  for  search. 

(b)  Choosing  one  or  more  vehicle  model  types  from 
a  menu  of  object  classes. 

(c)  Choosing  a  "count”  option  from  an  analysis 
menu,  and  specifying  the  criteria  for  a  signif¬ 
icant  change. 

(d)  Specifying  temporal  constraints  on  when  to 
conduct  the  analysis.  For  example,  it  may  be 
of  interest  to  count  vehicles  only  on  weekdays. 


2.  Activity  modeling.  Detecting  changes  in  a  com¬ 
plicated  environment  often  involves  the  fusion  of 
multiple  change  cues.  The  individual  changes  may 
not  be  significant,  but  their  simultaneous  occurrence 
may  be  very  significant.  As  an  example,  consider 
the  mcmitoring  of  airfield  activity.  The  arrival  of 
a  significant  number  of  new  aircraft  at  an  airfield 
is  not  unusual  in  support  of  an  ongoing  training  or 
exercise  activity.  Increased  activity  at  the  weapons 
storage  facility  associated  with  the  airfield  is  also 
not  uncommon  in  support  of  resupply  and  training 
activities.  However,  the  simultaneous  occurrence  of 
the  arrival  of  a  large  number  of  aircraft  not  nor¬ 
mally  seen,  together  with  increased  activity  at  the 
weapons  storage  area,  might  indicate  preparations 
for  hostile  actions.  The  interface  will  be  designed 
to  incorporate  specification  of  interactions  between 
cues,  defined  in  terms  of  object,  spatial,  or  tempord 
primitives. 

2.2  Registration  Algorithms 

We  are  investigating  two  types  of  registration  processes, 
image-to-site  model  registration  and  image-to-image  reg¬ 
istration.  Prior  to  any  CD  task,  the  newly  acquired  im¬ 
age  needs  to  be  registered  to  the  existing  site  model. 
Depending  on  the  particular  CD  task,  e.g.,  if  building  or 
vehicle  related  activity  is  being  monitored,  we  can  use 
the  site  model  and  viewing  direction  of  the  new  image 
to  identify  regions  in  the  image  that  need  further  study. 
We  can  subsequently  invoke  the  necessary  lU  algorithms 
related  to  building  detection,  vehicle  location  and  count¬ 
ing  (and  road  extraction,  if  construction  of  roads  is  mon¬ 
itor^).  For  tasks  such  as  these,  there  are  two  very  im¬ 
portant  model-based  lU  techniques: 

Registering  images  to  site  models.  Consider,  for 
example,  the  problem  of  identifying  the  region  in  an 
aerial  image  corresponding  to  a  given  parking  lot.  While 
estimates  of  sensor  and  platform  parameters  are  known, 
it  is  not  sufficient  to  simply  project  the  parking  lot 
boundaries  onto  the  image  plane  using  these  parame¬ 
ters,  since  these  parameters  are  subject  to  errors.  Fur¬ 
thermore,  determining  which  parts  of  the  parking  lot 
are  visible  in  the  image  (since  parts  of  the  parking  lot 
can  be  occluded  by  other  elements  of  the  site  model) 
and  the  illumination  conditions  in  the  visible  part  of  the 
parking  lot  (parts  of  which  may  be  in  shadow  depend¬ 
ing  on  sun  angle  and  site  model  geometry)  are  critical 
to  subsequently  making  a  correct  decision  as  to  whether 
there  is  a  significant  difference  between  the  numbers  of 
observed  and  expected  vehicles  in  the  parking  lot.  In 
fact,  the  feasibility  of  performing  a  CD  task  depends  on 
the  lU  system  correctly  modeling  the  relationship  be¬ 
tween  a  given  image  and  the  site  model  (for  example,  if 
we  were  interested  in  whether  a  large  number  of  vehicles 
are  parked  near  a  certain  building,  it  could  be  important 
to  determine  if  that  part  of  the  parking  lot  is,  in  fact, 
visible  in  the  image).  As  describe  in  the  next  section, 
ATMS-based  techniques  can  also  be  used  for  registering 
images  to  site  models. 
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Model'based  object  recognition.  Many  CD  tasks 
involve  the  identification  of  objects  in  constrained  parts 
of  the  image,  along  with  an  analysis  of  their  density  and 
spatial  distribution.  If  the  image  has  sufficient  spatial 
resolution,  and  CAD  models  are  available  for  the  objects 
of  interest,  then  lU  techniques  such  as  alignment  [9],  and 
geometric  hashing  [10]  are  appropriate.  Given  that  good 
estimates  of  the  height  and  orientation  of  the  sensor  with 
respect  to  the  ground  plane  are  known  a  priori  (and  can 
be  improved  by  the  site  model  to  image  registration  pro¬ 
cess),  a  method  such  as  the  one  developed  at  Maryland 
by  Silberberg  et  al.  [11]  can  be  employed;  this  method 
has  the  minimal  combinatorial  complexity  of  any  3-D  ob¬ 
ject  recognition  algorithm,  requiring  that  only  a  single 
image  feature  be  matched  against  a  single  model  feature 
to  determine  the  image  to  model  transformation.  In  sit¬ 
uations  where  the  spatial  resolution  is  not  sufficient,  or 
where  geometric  m^els  are  not  available,  one  can  em¬ 
ploy  general-purpose  model-based  image  segmentation 
algorithms  such  as  Programmed  Picture  Logic  (PPL) 
[12],  which  can  be  extended  to  “learn”  the  descriptions 
of  objects  of  interest  (through  a  few  visual  examples), 
and  can  then  identify  remaining  instances  viewed  under 
similar  conditions  of  viewpoint  and  illumination. 

In  addition  to  image-to-site  model  registration,  which 
will  be  directly  useful  for  CD,  we  are  also  developing  a 
general-purpose  image-to-image  registration  algorithm. 
Such  an  algorithm  will  be  useful  for  building  site  mod¬ 
els  and  for  orienting  an  image  in  a  “favored  position”. 
The  traditional  stereo  paradigm  [13]  for  inferring  3-D 
structure  is  not  applicable  to  images  acquired  from  se¬ 
vere  off-nadir  viewing  directions.  Our  god  is  to  develop 
a  completely  automatic  registration  algorithm  using  site 
models  and  any  auxiliary  information  such  as  camera  pa¬ 
rameters.  Site  models  will  be  very  useful  for  registering 
two  severely  off-nadir  images,  as  we  can  predict  the  con¬ 
trasts  of  features  in  both  images,  occlusions  of  features 
and  shadow  regions. 

An  integral  component  of  site  model  based  registration 
and  change  detection  is  the  availability  of  site  models. 
We  are  working  on  site  model  construction  and  updating 
on  an  ongoing  basis.  The  solution  to  site  model  construc¬ 
tion  assumes  that  several  overlapping  coverage  images 
are  available.  We  will  initially  construct  a  site  model 
using  the  RCDE  site  model  rendering  system.  Image- 
to-model  registration  algorithms  will  be  used  to  register 
each  image  to  the  site  model.  When  two  or  more  im¬ 
ages  confirm  the  same  hypotheses  about  the  underlying 
object,  the  initial  assertions  about  the  object  will  be  re¬ 
placed  by  image-derived  assertions.  This  will  be  done 
in  an  incremental  fashion.  During  the  early  stages,  the 
errors  due  to  incomplete  specification  of  site  models  may 
be  handled  by  allowing  more  tolerance  in  the  predicted 
positions  of  features  and  their  computed  attributes.  As 
more  images  become  available,  the  representation  error 
will  decrease. 

2.3  IVuth  Maintenance 

Algorithms  for  CD  and  image  registration  need  to  per¬ 
form  model  based  search.  Any  search  method  used  for 
these  tasks  should  have  the  following  characteristics: 


1)  It  should  be  able  to  handle  uncertainties  in  the  lo¬ 
cations  and  attributes  of  features  due  to  errors  in 
feature  extraction,  incomplete  site  models,  etc. 

2)  It  should  be  efficient.  Since  the  high  resolution  RA¬ 
DIUS  images  may  produce  large  numbers  of  accept¬ 
able  features,  the  search  space  may  be  prohibitively 
large.  Efficiency  is  achiev^  by  avoiding  futile  back- 
trai^ng  and  eliminating  rediscovery  of  inferences. 

3)  It  should  be  permited  differential  diagnosis  so  that 
solutions  can  be  directly  compared  with  one  another 
at  different  points  in  the  search  space  and  the  “best” 
interpretation  chosen.  (Note  that  more  than  one 
reasonable  solution  may  be  found  in  real  life  situa¬ 
tions.) 

4)  It  should  be  able  to  make  use  of  geometric  and  pho¬ 
tometric  constraints  derived  from  site  models. 

5)  It  should  be  able  to  accept  guidance  from  the  user. 

The  second  requirement  eliminates  exhaustive  search 
and  depth-first  search,  also  known  as  chronological  back¬ 
tracking.  In  chronological  backtracking  [14],  when  the 
search  backtracks  because  it  encounters  contradictions, 
it  forgets  any  inferences  made  in  that  portion  of  the 
search  space.  These  inferences  will  be  rediscovered  at 
other  places.  Also,  the  underlying  reasons  for  the  contrar 
dictions  are  not  stored,  so  that  these  contradictions  may 
be  encountered  again  and  again.  As  a  solution  to  this 
“forgetting”  problem,  one  may  consider  dependency- 
directed  backtracking  [15].  In  this  scheme,  dependency 
records  are  maintained  for  each  inference.  These  records 
link  each  inference  to  its  antecedents.  When  a  contra¬ 
diction  is  encountered,  the  dependency  records  are  used 
to  backtrack  to  the  most  recent  selection  which  actu¬ 
ally  contributed  to  the  contradiction,  instead  of  resorting 
to  chronological  backtracking.  This  avoids  futile  back¬ 
tracking.  Also,  the  dependency  records  are  bidirectional 
and  are  used  to  reinstate  previously  derived  information 
in  different  portions  of  the  search  space,  thus  avoiding 
rediscovery  of  inferences.  When  a  contradiction  is  en¬ 
countered  the  underlying  assumptions  that  actually  led 
to  the  conflict  are  stored  for  future  reference  as  nogoods. 
These  nogoods  are  used  to  prevent  rediscovery  of  contra¬ 
dictions,  and  hence  improve  efficiency. 

IVuth  maintenance  systems  (TMSs)  of  various  kinds 
employ  dependency-directed  backtracking.  A  justifica¬ 
tion  based  truth  maintenance  system  (JTMS)  [7]  works 
with  nodes  each  of  which  corresponds  to  a  problem  solver 
datum.  Associated  with  each  node  is  a  justification 
which  summarizes  how  the  node  originated.  Truth  main¬ 
tenance  is  a  procedure  that  decides  which  problem  solver 
is  to  be  believed  (in)  and  which  disbelieved  (out).  It  uses 
dependency-directed  backtracking  for  this  purpose. 

In  our  registration  algorithms  we  plan  to  use  an  ATMS 
[8,  16,  17]  for  search.  An  ATMS  works  by  manipulating 
assumption  sets,  which  ue  primitive  data  from  which 
all  other  data  are  derived.  It  works  on  the  principle  that 
assumption  sets  can  be  manipulated  much  more  conve¬ 
niently  than  the  data  sets  they  represent.  It  simplifies 
truth  maintenance  and  eliminates  backtracking.  When 
using  an  ATMS,  all  contexts  (points  in  the  search  space) 
are  simultaneously  visible  to  the  problem  solver.  This 
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permits  differential  diagnosis.  An  ATMS  will  find  all 
the  solutions  that  pure  enumeration  would.  At  the  same 
time,  it  improves  the  efficiency  of  the  problem  solver 
without  sacrificing  coherence  and  completeness. 

An  important  ^vantage  of  using  an  ATMS  is  its  abil¬ 
ity  to  integrate  lA-derived  and  site-model-derived  infor¬ 
mation.  The  Hogoods  mentioned  above  can  be  set  by 
the  lA.  For  example,  if  the  sunlight  is  diffuse,  so  that 
crisp  shadow  information  is  unlikely  to  be  available,  the 
use  of  shadows  can  be  declared  as  a  nogood  condition. 
All  inferences  that  use  shadow  information  will  then  be 
avoided  and  the  final  result  will  be  one  which  did  not 
use  any  information  about  shadows.  As  an  example  of 
site-model-derived  constraints,  the  relative  positions  of 
features  in  the  two  images  to  be  registered  must  be  con¬ 
sistent  with  those  predicted  by  the  site  model  and  any 
violations  can  be  declared  as  nogoods.  Topological  con¬ 
straints  derived  from  site  models  can  alM  be  used  to 
define  nogoods. 

The  ATMS  will  also  be  useful  for  handling  the  be¬ 
lief  revisions  involved  in  site  model  updating.  In  an 
ATMS  uncertainty  is  dealt  with  by  making  ossnmpUons 
(hgpoihtses)  whose  beliefs  can  be  later  revised  as  new 
evidence  accumulates.  Another  advantage  in  using  an 
ATMS  is  that  it  provides  mechanisms  for  the  automatic 
construction  of  the  inference  network.  A  very  simple 
version  of  a  JTMS  was  used  in  3D  MOSAIC  [18]  for 
incremental  reconstruction  of  site  models  in  a  simple  do¬ 
main.  Since  an  ATMS  is  much  more  versatile  than  a 
JTMS,  we  will  be  able  to  handle  the  more  complicated 
site  models  that  arise  in  RADIUS  applications. 

3.4  Usability  Analysis 

One  of  the  key  contributions  of  TASC  to  our  project  will 
be  to  assist  us  in  developing  lU  capabilities  which  are 
amenable  to  use  by  lAs.  This  assistance  will  be  critical 
in  two  main  areas;  support  of  the  ATMS-based  regis¬ 
tration  and  site-model  updating  algorithm  development, 
and  assistance  in  developing  the  specification  interface 
for  goal-directed  CD. 

The  expectation  is  that  ATMS-based  techniques  will 
be  developed  in  such  a  way  that  they  can  be  integrated 
into  an  interactive  environment.  Such  interaction  must 
satisfy  the  following  constraints: 

1)  The  algorithms  developed  must  provide  rapid  feed¬ 
back  to  the  user  about  their  current  state  and  the 
quality  of  the  product  being  generated. 

2)  Any  global  algorithm  parameters  required  to  be  set 
by  the  I A  must  have  an  intuitive  interpretation.  The 
appropriate  values  of  these  parameters  should  de¬ 
pend  in  a  direct  manner  on  factors  easily  accessible 
to  the  lA,  for  example,  sun  angle  and  general  image 
quality. 

3)  The  effect  of  this  parameterization  on  algorithm  be¬ 
havior  should  tJso  be  intuitive. 

4)  An  interactive  capability  should  be  available  to  in¬ 
terject  additional  lA  guidance  as  the  algorithm  pro¬ 
gresses. 

5)  The  ability  of  the  algorithms  to  backtrack  is  crucial. 
The  lA  should  be  able  to  assert  new  information  or 


controvert  results  obtained  by  the  lU  system. 

A  similar  constructive  role  will  be  played  by  the  TASC 
team  in  supporting  the  development  of  the  specification 
interface  for  goal-directed  CD.  The  technology  is  not  ripe 
for  automatic  translation  of  lA  queries  in  natural  lan¬ 
guage  to  define  the  set  of  lU  tasks  that  need  to  be  per¬ 
formed.  Even  preliminary  efforts  and  successes  in  bring¬ 
ing  lU  algorithms  a  step  closer  to  the  eventual  users  in 
the  intelligence  community  will  be  worthwhile.  There 
ate  several  issues  to  be  addressed  from  an  lA  point  of 
view: 

1)  The  interface  must  be  powerful  enough  to  commu¬ 
nicate  a  broad  variety  of  objectives. 

2)  The  primitives  must  relate  directly  to  the  analysis 
paradigm  employed  by  the  analyst. 

3)  The  analyst  must  be  confident  that  the  specification 
interface  has  encoded  all  the  constraints  that  he/she 
has  imposed. 

An  lA  must  have  significant  confidence  in  the  system  in 
order  to  be  willing  to  use  it.  It  is  impossible  to  verify 
that  all  changes  of  relevance  have  been  detected  without 
human  review. 

3  Accomplishments  to  Date 

During  our  first  four  months  under  contract  (as  of  the 
time  of  the  writing  of  this  paper)  we  have  made  consid¬ 
erable  progress  on  several  fronts:  (1)  We  have  developed 
a  novel  image-to-image  registration  algorithm  that  can 
automatically  register  two  off-nadir  images.  When  addi¬ 
tional  information  about  camera  parameters  is  available, 
the  algorithm  becomes  considerably  simpler.  More  de¬ 
tails  about  the  algorithm,  and  experimental  results  ob¬ 
tained  on  model  board  and  real  images,  are  given  in  the 
remainder  of  this  section.  (2)  We  have  tested  the  line 
detector  developed  by  Venkateswar  and  Chellappa  [19] 
on  model  board  imagery.  The  line  outputs  will  be  useful 
for  building  detection  [2]  algorithms. 

In  the  following  subsMtions,  we  first  discuss  the  trans¬ 
form  between  images  takens  from  two  off-nadir  orien¬ 
tations.  We  then  present  our  image  registration  algo¬ 
rithm,  including  details  such  as  feature  point  extraction 
and  match  verification.  Experimental  results  on  model 
board  and  real  aerial  images  provided  by  the  sponsor  are 
presented. 

3.1  3-D  Transform 

Some  notation  In  the  following  discussion  we  use  the 
notation  given  below; 

•  1  0  0  • 

Tg{0)  0  cos  9  sin0  (1) 

.0  —  sin0  COS0 

COS0  sin0  O' 

7;(0)=  -sin0  COS0  0  (2) 

L  0  0  1  J 

e  =  j  (3) 

where  e  is  the  resolution  when  digitising  an  image,  and 
/  is  the  focal  length  of  the  camera. 
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Off-nadir  to  nadir  transform  We  begin  by  consider¬ 
ing  the  transformation  from  off-nadir  camera  coordinates 
^cVeZc  to  the  world  coordinate  system  z^y^Zw  with  r«- 
jlW  plane  parallel  to  the  ground  plane.  The  camera  co¬ 
ordinates  are  characterised  by  the  triplet  {a,0,y)  where 
a  is  the  angle  from  the  north  direction  to  the  positive 
direction  of  the  camera  rotatitm  aids,  is  the  an^e  from 
Zw  to  Ze,  and  7  is  the  angle  from  the  x  axis  to  the  north 
direction.*  Figure  2  shows  the  definitions  of  a,  0  and  7. 


Let  the  triplet  (a,  0, 7)  be  (a«, ,  /9Ui  7«)  when  measured 
in  world  coordinates  and  (ae,0c,yc)  when  measured  in 
camera  coordinates.  For  real  applications,  normally  a 
and  0  are  available  in  the  world  coordinate  system  while 
7  ia  defined  with  respect  to  the  camera  (image)  coordi¬ 
nate  system.  The  relationship  between  the  two  triplets 
is 


A 

= 

II 

Ole 

= 

.  sinotacos^ 

arctan - 

cosa« 

a*  +7e 

= 

<»•  +7« 

0. 

=  A. 

=  0 

(4) 

Ote 

=  arctan(tan  o«  cos  0) 

(5) 

7« 

=  <»• 

+  7- 

—  arctan(tan  a«  cos  0) 

(6) 

The  3-D  transform  from  the  world  coordinates  to  the 
camera  coordinates  can  be  written  in  terms  of  rotations 
with  respect  to  the  x  and  z  axes  defined  in  (1)  and  (2) 

as 

=  r,(-flte-7e-»-a*-»-7,)T,(-o»-7«)- 

T,{0)T,{a^-i-j^)K+Kc 

=  T,(-a,-y,)T.(0)T,{a^+-,^)V„+V,c  (7) 

where  V„e  accounts  for  the  3-D  translation. 

The  transformation  from  an  off-nadir  system  to  the 
world  system  is 

V^  =  T.i-<M^-y^)T.{-0ma,+y,)V,+Vc^  (8) 

where 

=  -r,(-o,  -  yMi-0)TA<Zc  +  7*)^.. 

’The  reason  for  nsiag  triplet  (a,0,y)  to  represent  the 
camera  orientation  is  that  in  real  applications,  the  north  di¬ 
rection  is  nsed  as  the  reference  to  define  the  camera  rotation 
axis. 


Off-nadir  to  off-nadir  transform  Consider  two  ott- 
nadir  cameras  characterised  by  triplets  (od ,0i,yci)  and 
(ac3i/%i7c3)  respectively.  Ftom  (7)  and  (8),  the  trans- 
formatkm  firam  one  off-nadir  ssrstem  to  the  other  is 


Va  =  r,(-Oe3-7e3)7’,(/?3)T,(a«2-f  7w2)- 

Tt{ymi  -  7*3)fr,(-o,i  -  7»i)T,(-^i)- 
7»(«el  +  7el)Vel  -b  V^eol]  +  V^e3 

=  7»(-®e3— 7c3)7*(A)7i(®o3— a«i)  • 
71r(“^l)7’i(Oel-|-7el)V^l  +  Vit 


-  RV;i  +  Vi3  (9) 

where 

=  «e3  +  7*2  =  ®-3  +  7i.2  (10) 

-  ®el  +  7el  =  “•!  +  7»1  (11) 

9  =  aiw3  ~  nr^i  (19) 


=7*(-^)7*(/?3)7i(Ow2+7ii>l)K!Wl  +  i»e2  (13) 


R  =  T,{-ip2)T.mT,{6)T.(-0x)T.(ipi) 


rn 

>■13 

>■13 

>’31 

>*33 

>’33 

>*31 

>‘33 

>•33 

(14) 


rii  =  cosficos^i  cos^ 

-sintfsin ip\  cos^coe/9i 

•f  sin  cos  sin  ^3  sin  02 

-I-  cos  8  sin  ipi  sin  <p2  cos  0i  cos  02 

-f  sin  sin  ^  sin  sin (15) 


ri3  =  cos  0  sin  ^1  cos  ^3 

-b  sin  8  cos  ipi  cos  ^  cos  0i 
■f  sin  8  sin  sin  ip2  cos  02 

—  cos0cos^i  sin^cos/9i  cob02 

—  cos  ^1  sin  ^  sin  ^1  sin /Ss  (16) 


ris  =  —sin  9  cos  ^3  sin /?i 

-I- cos  9  sin  ^  sin /9i  ccm02 
—  sin  ip2  cos  01  sin  02  (17) 


r3t  =  cos  9  cos  <pi  sin  <P2 

—  sin  9  sin  (pi  sin  ip2  cos  0i 

—  »in8cOBipiCOBIp2COX02 

—  cos  9  sin  cos  ^3  cos  0i  cos  02 

—  sin  cos  ^  sin  ^1  sin /Js  (18) 


r33  =  cos  9  sin  ip\  sin  tp2 
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+  ain  9  cos  sin  ^  C06 

—  sin  0  sin  cos  ^3  cos 

+  cos  9  cos  cos  ^  cos  i?!  cos/% 

+  cos^tco8^sin/}isinA  (19) 

r33  =  — sin9sin^ra/?i 

—  cos9cos^sin/9i  cos/% 

+  CQs^cos/9isinA  (20) 

rat  =  sin9cfM^isin/% 

+  cos  0  sin  cos /9i  sin  A 

—  sin^isin/?!  cos^  (21) 

1*33  =  sin0sin^isin^ 

—  cos  0  cos  cos /%  sin /% 

-rCos^isin/9i  cos/%  (22) 

rss  =  cos0sin/0isin/% 

4-  cos  cos  /%  (23) 

3.2  Image  'nransfonn 

For  nadir  images,  the  ground  plane  is  parallel  to  the 
image  plane,  leading  to  a  simple  relationship  between 
the  image  plane  coordinates  and  the  3-D  x  and  y  coor¬ 
dinates.  When  dealing  with  off-nadir  images,  there  is 
a  slant  component  in  the  depth,  depending  on  the  dis¬ 
tance  between  the  ground  point  and  the  camera  rotation 
axis.  In  our  triplet  (a,0,y)  camera  rotation  model,  the 
equations  for  the  camera  rotation  axis  are 

**isiny>i -yeicosy*!  =  0 

Zel  =  z. 

Hence  the  distance  from  a  point  (xeiiVeti^ei)  to  the 
camera  rotation  axis  in  the  OeZcVc  plane  is 

d  =  Xci  sin  tpi  -  Pel  cos  <Pi  (24) 

and  the  depth  Zei  can  be  determined  by 

Sei  =  +  (»ei  sin  y>i  -  Pel  COS ^i)  tan 01 

=  z*  -I-  i4*ei  +  Byei  (25) 

where 

A  =  sin  tan  A 
B  =  —  cos  ^1  tan  0i 

z,  is  the  distance  from  the  camera  to  the  center  of  the 
scene. 

Let  (X,  Y)  be  the  image  plane  coordinates  (indexes); 
we  then  have 

„  X 

T  =  =  7 

^  =  =  >- 

f  * 

The  depth  in  the  first  camera  coordinate  system  can 
be  represented  in  terms  of  the  image  plane  coordinates 


Zei  =  i4Xei  Byel  +  Z* 

=  AXiCiXel  -b  ^KitiZel  -b  z, 

■  1  -  AXiCi  -  BYiCi 

The  relationdiip  between  the  two  image  planes  is 

XiCi  =  ^ 

Se3 

_  Ti\*c\  4  fiaVei  4  naSti  4  ru 
rsi*ei  4  r33Pei  4  rasZd  4  rs4 
=  I»'ii-X^iti4ri3yici4ris4-^(1— 
AXie\—BY\eij\/[  rai^iCi  4r33yiCi 
+ra3+^(!i-AXiei-BYie{)]  (27) 

yjcz  =  ^ 

_  »*31»el  4  r33Pei  4  r33Zei  4  r34 
»'31*er  4  f'33Pei  4  r33Xel  4  r34 
=  ( ^ai-X^iCi4r33yiei4r334— (1  — 

AX\Ci  — ByiCi))  /[  r3iA’iCi  4r33yiCi 
4r33  4  i^(l-^A',e,  -By,5|)]  (28) 

When  the  camera  parameters  (oi,  0\,  71),  (03,  /%, 
73),  Cl  and  C3  are  available,  ru,  m,  ri3,  r3i,  r33,  r33, 
'‘33>  '‘33>  -4,  and  B  are  determined.  The  image  reg¬ 
istration  problem  is  then  equivalent  to  estimating  the 
three  translation  parameters  ^  and  ^  from  the 
correspondences  between  matched  points,  which  can  be 
achieved  by  solving  a  set  of  linear  equations  with  three 

unknowns  ^  and 

••  •# 

^(l-MX,<ci-Byuei) 

^9 

_2l(i  _  AXuex  -  BYxitx)X2iti 

^9 

=  ('"si^'ijCi  4  r33yijei  4  r33)A'3iC3 

— (•'ii'^iiCi  4  ri3yi,Ci  4  ris)  (29) 


-2i(l-^Xi.Ci-ByiiCi) 

c* 

_!^(1  _  AXuei  -  BYuei)Y3iC2 

^9 

=  (r3iA'ijCi  4  r33yijCi  4  r33)y3,C3 

— (r3iA’iiCi  4  r33yiiC|  4  r33)  (30) 

for  t  =  where  N  is  the  number  of  matched 

points. 

When  the  camera  information  is  not  available,  note 
that  r33Z«  4  r34,  the  depth  of  the  nadir  point  in  the 
second  camera  coordinate  system,  is  always  nonzero.  We 
can  then  rewrite  (27-28)  as 

+  ^  (rn-^A)ei 
(r33  4^)C3‘^  (r33  4^)C3 


(rS3+^)C2  (r33  +  ^)e2 

(-rsa  +  ^B)eie2 

(»' 

*‘»+^  (^•ai  -  ^^)gi  „ 

(r33  +  ^)£a'^  (r33  +  ^)ea  ' 
(raa-^B)ci^^  (-rsi  -f  ^A)ei£a  ^ 

(r33  +  5if)®a  *  (»'ss  +  ^)fa  ‘  ’ 

.  (-»'S2  +  ^B)eiei  „  ^  ^ 

+  (ra,  +  Jr)«  <’■ 


Ol  = 


aj  = 


03  = 


(>•33  +  il^)ea 

*‘»  +  ^ 
(*■33  + 

(>•11  ~ 

('’33  + 


(rjj  -  ^A)ei 

“•  -  l^TT^  <“> 

('■13  -  ^^)«1 

“■  -  (r«+“)».  <"* 

_  (raa  -  ^B)£t 
"  (r33  +  ^)ea 

(-r3i  +  ^^)ffi«a 

“•  -  -(;^ye—  <”> 

(-'•33  +  ^^)giga 
“*  ~  (r33  +  ^)ca 

The  transform  from  imagel  to  image2  can  be  determined 
in  terms  of  the  eight  puameters  o,-,  i  =  1, . . . ,  8  as 

V  03X1+05X1+01 

•*’  =  -.,x.-«y.  +  i  <“> 

w  04X1+05X1+03 

=  -.,x.-o,n  +  i  <“* 

The  eight  parameters  are  obtained  by  solving  the  linear 
equations 

Ol  +Xi,03+Xi,05+Xi,X2,07+XiiX3,0g  =  Xs,'  (43) 

aa+Xi<04+Xi<06+Xi,X2i07+XiiX3,08  =  X3<  (44) 

for  I  =  l,...,N,  where  N  is  the  number  of  matched 
points. 

3.3  Algorithm 

Overview  Figure  3  illustrates  the  image  registration 
algorithm.  Given  two  images,  we  first  locate  two  control 
points  to  get  the  north  direction  in  each  image.  A  small 
number  of  feature  points  are  then  located  using  a  Gabor 
wavelet  model  for  detecting  local  curvature  discontinu¬ 
ities  [20].  The  feature  points  extracted  from  different 


as  = 


as  = 


ar  — 


Os  = 


X2  = 


n  = 


frames  are  matched  using  area  correlation.  Three  match 
verification  tests  are  used  to  exclude  false  matches.  Af¬ 
ter  the  initial  matching  is  achieved,  a  multiresolution 
traiisf(»m-and-correct  matching  is  implemented  to  ob¬ 
tain  a  high  accuracy  registration.  At  each  resolution,  the 
second  frame  <3  is  first  transformed  to  the  coordinates  of 
frame  ti  using  the  estimated  matching  parameters  and 
then  matching  refinement  is  perform^  on  the  feature 
points  of  frame  f  1 . 


Figure  3:  Block  diagram  of  image  registration  algorithm 

There  are  several  variations  in  the  implementation  of 
the  above  algorithm.  First  of  all,  we  u^  two  control 
points  to  determine  the  north  direction  in  each  image. 
In  real  applications,  this  data  is  available  from  the  gy¬ 
rocompass  and  flight  diary.  For  applications  where  such 
data  is  not  available,  we  use  an  illuminant  direction  esti¬ 
mator  [21]  to  get  the  illuminant  directions  in  each  image 
and  then  align  the  images  according  to  known  illuminant 
difference.  The  use  of  the  illuminant  direction  estimator 
is  indicated  by  dashed  lines  in  Figure  3.  Secondly,  when 
camera  parameters  are  available,  only  three  translation 
parameters  are  required  for  determination  of  the  image 
transform;  we  solve  (29-30)  to  get  ^  and  ^  and 
transform  the  second  image  to  the  coordinates  of  the  first 
camera  using  (27-28).  For  applications  where  camera 
information  is  not  available,  we  formulate  the  transform 
between  the  two  images  in  terms  of  eight  parameters  and 
determine  the  transform  parameters  by  solving  (43-44). 
The  second  image  is  then  transformed  to  the  coordinates 
of  the  first  camera  using  (41-42). 

Feature  point  detection  For  feature  point  extrac¬ 
tion  we  use  a  Gabor  wavelet  decomposition  and  local 
scale  interaction  based  algorithm  reported  in  [20].  The 
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basic  wavelet  function  used  in  the  decomposition  is  of 
the  form 

♦(X.y,d)  =  e-(^'’+'"*)+-*'  (45) 

X'  =  Xcosd  +  ysind  (46) 

y'  =  -Xsini>  +  ycosd  (47) 

where  d  is  the  preferred  spatial  orientation.  In  our  ex¬ 
periments  is  discretized  into  four  orientations.  The 
feature  points  are  extracted  as  the  local  maxima  of  the 
energy  measure 

I(X,  y)=max{||iya(A',  Y,  ^y-fW^iX,  Y,  d)||}  (48) 

V 

where 

Wjix, y. d)=f (g*(2-ix, 2-iy, d),  i= (ii. i,}. 

Here  ji  and  are  two  dilation  parameters,  and  y  = 
2Ui-h)  is  a  normalizing  factor.  In  implementing  the 
above  algorithm,  we  further  require  the  energy  measure 
for  a  feature  point  to  be  the  maximum  in  a  neighborhood 
with  radius  equal  to  10  and  above  a  threshold. 

Match  verification  In  our  algorithm,  the  initial 
matching  is  implemented  on  2-D  rotation  compensated 
images.  Since  no  further  knowledge  about  the  camera 
parameters  is  used  in  the  initial  matching,  false  matches 
due  to  perspective  deformation  and  similarities  between 
similar  objects  are  inevitable.  Automatic  exclusion  of 
these  false  matches  is  a  key  to  success  in  image  registra¬ 
tion.  We  have  used  three  tests  to  exclude  less  reliable 
matches. 

1.  Distance  Test:  The  translation  between  the 
rotation-compensated  images  should  not  be  larger 
than  certain  fraction  of  the  image  size.  A  valid 
matching  pair  t  should  satisfy 

di,  =  \Xir  -  X<,|  <  XL, 
di,  =  \Yir  -  y/|  <  AL,  (49) 

\Xir-Xi,\+\Yir-Yi,\  <  Kmax{L„I,} 

For  example,  A  =  ^  and  k  =  |A. 

2.  Variation  Test:  The  translations  used  in  the  cor¬ 
rect  matches  should  support  each  other,  i.e. 

|d,-  —  d\<fur  (50) 

where  d,  is  the  distance  between  the  matching  pair 
t,  d  and  a  are  the  mean  wd  standard  deviation  of 
the  distances  for  all  the  matched  feature  pairs,  and 
/i  is  a  threshold,  for  example  /i  =  >/3  for  the  uniform 
distribution. 

3.  Outlier  Exclusion:  The  matched  feature  pairs 
should  satisfy  the  image  transform  model.  Can¬ 
didate  matching  pairs  with  large  residual  errors 
should  be  excluded.  This  test  also  help  to  exclude 
matches  on  building  roofs,  etc. 


3.4  Experimental  results 

Figure  4  shows  the  registration  of  a  stereo  pair  of  model 
board  images.  Figure  4.a  is  the  image  taken  by  the  first 
camera  with  parameters  a„  =  40*  and  /?  =  15*.  The 
north  direction,  estimated  using  two  control  points,  is 
Oe  =  93.52*.  Figure  4.b  is  the  image  taken  by  the  sec¬ 
ond  camera  with  parameters  a«,  =  —71*  and  —  15*. 
The  north  direction,  estimated  using  two  control  points, 
is  tte  =  91.40*.  Figure  4.c  shows  the  registration  of 
the  second  image  to  the  orientation  of  the  first  camera 
when  the  given  camera  parameters  were  used.  Figure  4.d 
shows  the  registration  of  the  second  image  to  the  orien¬ 
tation  of  the  first  camera  when  no  camera  parameters 
were  assumed. 

Figure  5  shows  the  registration  of  two  off-nadir  model 
board  images.  Figure  5.a  is  the  image  taken  by  the 
first  camera  with  parameters  =  14*  and  =  45*. 
The  north  direction  estimated  using  two  control  points 
is  Ue  —  274.89*.  Figure  5.b  is  the  image  taken  by  the 
second  camera  with  parameters  a„  =  50*  and  ^  =  30*. 
The  north  direction  estimated  using  two  control  points  is 
ttc  =  10.35*.  Figure  5.c  shows  the  registration  of  the  sec¬ 
ond  image  to  the  orientation  of  the  first  camera  when  the 
given  camera  pareuneters  were  used.  Figure  5.d  shows 
the  registration  of  the  second  image  to  the  orientation 
of  the  first  camera  when  no  camera  parameters  were  as¬ 
sumed. 

Figure  6  shows  the  registration  of  two  aerial  images. 
Figure  6.a  is  the  image  taken  by  the  first  camera,  and 
Figure  6.b  is  the  image  taken  by  the  second  camera. 
Since  information  about  the  north  directions  was  not 
available,  we  used  the  illuminant  direction  estimator  [21, 
22]  to  get  an  initial  estimate  of  the  camera  orientation 
change.  For  these  two  images,  no  information  about  the 
cameras  are  available.  Figure  6.c  shows  the  registration 
of  the  second  image  to  the  orientation  of  the  first  camera. 

Figure  7  shows  the  registration  of  another  set  of  aerial 
images.  Figure  7.a  is  the  image  taken  by  the  first  cam¬ 
era.  Figure  7.b  is  the  image  taken  by  the  second  camera. 
Again  the  illuminant  direction  estimator  was  used  to  get 
an  initial  estimate  of  the  camera  orientation  change.  Fig¬ 
ure  7.C  shows  the  registration  of  the  second  image  to  the 
orientation  of  the  first  camera. 
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Abstract 

Contextual  information  is  often  essential  for  visual  recog¬ 
nition,  but  the  design  of  image-understanding  systems 
that  effectively  use  context  has  remained  elusive.  We 
describe  some  of  our  experiences  in  attempting  to  em¬ 
ploy  contextual  information  in  computer  vision  systems. 
By  making  explicit  the  built-in  assumptions  inherent  in 
all  computer  vision  algorithms,  an  architecture  can  be 
designed  in  which  context  can  influence  the  recognition 
process.  This  paper  describes  such  an  architecture  for 
context-based  vision  (CBV). 

1  Introduction 

It  is  generally  accepted  that  the  surroundings  of 
an  object  may  have  a  profound  influence  on,  and  in 
some  cases,  may  be  necessary  for,  visual  recognition 
of  the  object.  What  is  not  so  well  established  is  how 
to  design  computer  vision  systems  that  can  exploit 
such  contextual  information. 

When  a  human  observes  a  scene,  or  even  stud¬ 
ies  a  photograph,  he  normally  has  at  his  disposal  a 
wealth  of  information  that  is  not  captured  by  the 
image  alone.  For  example,  if  Bob  shows  Alice  some 
photographs  he  took,  her  knowledge  that  Bob  re¬ 
cently  vacationed  in  Hawaii  may  help  her  to  recog¬ 
nize  that  the  photos  were  taken  there.  Any  knowl¬ 
edge  that  Alice  has  about  Hawaii  may  be  useful 
for  recognizing  the  content  of  the  scene  (e.g,  that 
the  amorphous  landform  is  actually  Diamond  Head, 
and  that  the  vegetation  is  palmetto  bushes  and  not 
agave  cacti). 

An  observer  can  also  infer  information  about  the 

’The  work  reported  here  waa  sponsored  by  DARPA  and 
monitored  by  the  US  Army  Topographic  Engineering  Center 
under  Contract  DACA76-92-C-0034. 


Figure  1:  An  im^e  in  which  the  use  of  context  is 
critical  to  the  recognition  of  some  objects. 


scene  that  is  then  useful  for  interpreting  other  parts 
of  the  image.  For  example,  given  an  outdoor  scene, 
usually  one  can  readily  determine  where  the  sky  is, 
which  direction  is  vertical,  what  the  weather  condi¬ 
tions  are,  and  whether  any  man-made  objects  are 
visible.  This  information  forms  part  of  the  context 
that  is  available  for  interpreting  the  remainder  of 
the  scene. 

An  image  such  as  shown  in  Figure  1  illustrates 
the  power  of  contextual  information.  The  inset,  a 
magnified  portion  of  the  larger  image,  displays  an 
object  that  is  difficult  to  recognize.  When  the  same 
object  is  viewed  in  the  context  of  the  intersection 
of  city  streets  (as  in  the  large  image),  it  is  readily 
recognized  as  an  articulated  bus. 
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In  this  paper,  we  describe  some  of  our  experi¬ 
ences  in  attempting  to  employ  contextual  informa¬ 
tion  in  computer  vision  systems.  By  making  explicit 
the  built-in  assumptions  inherent  in  all  computer 
vision  algorithms,  an  architecture  can  be  designed 
in  which  context  can  influence  the  recognition  pro¬ 
cess.  This  paper  describes  such  an  architecture  for 
context-based  vision  (CBV). 

The  first  half  of  the  paper  summarizes  the  types 
of  contextual  information  that  are  available  in 
image-understanding  systems  and  describes  some 
roles  that  context  can  play  in  the  interpretation 
process.  The  second  half  reviews  a  previously  con¬ 
structed  context-based  architecture,  CONDOR;  de¬ 
scribes  some  extensions  that  are  necessary  to  ex¬ 
tend  its  applicability  to  semiautomated  image  un¬ 
derstanding  (lU);  and  presents  some  empirical  re¬ 
sults  of  its  use  in  extracting  cartographic  features. 

2  Context-Based  Vision 

We  use  the  term  contextual  information,  or  context 
for  short,  in  the  broadest  sense  —  to  denote  any 
and  all  information  that  may  influence  the  way  a 
scene  is  perceived.  Thus,  the  camera  geometry,  the 
image  type,  the  availability  of  related  images,  the 
urgency  of  observation,  and  the  purpose  of  image 
analysis,  are  all  part  of  the  context.  A  computer 
vision  system,  like  a  human,  should  be  able  to  use 
all  types  of  context. 

Many  authors  have  used  contextual  information 
either  implicitly  or  explicitly  in  their  lU  systems, 
but  few  have  made  the  representation  and  use  of 
context  a  central  design  feature  [4,  5,  7,  13,  21]. 

The  effective  use  of  contextual  information  can 
be  addressed  by  considering  the  design  of  an  over¬ 
all  system  architecture,  rather  than  by  focusing  on 
individual  algorithms.  In  our  view,  this  can  be  ac¬ 
complished  by  structuring  a  computer  vision  sys¬ 
tem  as  a  composite  of  many  individual  algorithms. 
The  contextual  information,  including  the  percep¬ 
tual  task  and  the  available  imagery,  can  be  used 
to  choose  the  algorithms  most  appropriate  for  each 
subtask,  and  can  form  the  basis  for  evaluating  their 
results.  The  algorithms  can  perform  independently, 
but  are  able  to  interact  through  the  context  that  all 
are  controlled  by  and  all  contribute  to. 

The  concepts  described  in  this  paper  are  illus¬ 
trated  by  examples  from  two  architectures  we  have 


designed: 

•  CONDOR  [17,  18,  19]  is  a  system  that  analyzes 
ground-level  outdoor  imagery  of  natural  en¬ 
vironments  in  the  context  of  a  mobile  robot 
application,  condor  contains  an  elaborate 
mechanism  for  recognizing  and  labeling  nat¬ 
ural  objects  automatically.  Because  natural 
objects,  unlike  man-made  objects,  are  difficult 
to  recognize  without  consideration  of  context, 
analysis  of  these  scenes  demands  an  architec¬ 
ture  that  makes  strong  use  of  contextual  infor¬ 
mation. 

•  The  second  architecture  is  being  developed 
as  part  of  a  system  for  site-model  construc¬ 
tion  using  overhead  imagery  in  the  RADIUS 
Project  [8].  Unlike  condor,  this  system  is  de¬ 
signed  to  be  semiautomated  —  a  fact  that  has 
implications  for  both  the  way  in  which  con¬ 
text  can  be  employed,  and  for  the  availability 
of  contextual  information.  Being  a  semiauto¬ 
mated  design,  it  relies  upon  a  human  operator 
to  replace  some  of  the  machinery  incorporated 
in  condor  and  exploits  additional  contextual 
constraints  supplied  by  the  operator. 

3  The  Need  for  Context 

The  technical  problems  in  using  context  involve  the 
identification  of  appropriate  representations  for  the 
relevant  knowledge  and  the  design  of  an  architec¬ 
ture  that  can  effectively  invoke  this  knowledge.  A 
context-based  architecture  for  image  understanding 
must  have  (among  other  things)  a  means  for  enforc¬ 
ing  the  assumptions  of  lU  algorithms  and  a  means 
for  accessing  relevant  information. 

3.1  Enforcing  Assumptions 

Every  image-understanding  algorithm,  by  necessity, 
contains  numerous  built-in  assumptions  that  limit 
its  range  of  applicability.  For  example,  some  edge- 
finders  work  only  on  binary  images,  some  stereo  al¬ 
gorithms  cannot  handle  occlusions,  and  some  road- 
finders  are  confounded  by  strong  shadows. 

If  the  results  of  these  algorithms  are  to  be  re¬ 
lied  upon,  the  algorithms  must  not  be  employed  in 
situations  for  which  their  designers  did  not  intend 
them  to  be  used.  It  is  the  context  of  invocation  that 
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dictates  the  suitability  of  an  algorithm  for  a  partic¬ 
ular  task.  By  explicitly  encoding  the  assumptions 
and  inherent  imitations  of  lU  algorithms,  one  has 
the  potential  to  control  the  algorithms  by  reason¬ 
ing  about  the  context.  Representing  assumptions 
explicitly  and  matching  them  to  the  particular  cir¬ 
cumstances  is  one  of  the  keys  to  using  contextual 
information  in  a  computer  vision  system. 

3.2  Accessing  Nonlocal  Information 

Most  lU  algorithms  also  require  the  use  of  nonlocal 
information  —  data  outside  the  immediate  sphere 
of  computation  —  to  assist  the  interpretation  or  to 
control  the  processing  flow.  Examples  include  pixel 
data  that  are  outside  some  local  processing  window, 
additional  images  of  the  same  scene,  prior  facts  or 
expectations  that  are  stored  in  a  map  or  database, 
and  generic  knowledge  about  the  appearance,  func¬ 
tion,  or  purpose  of  objects  in  a  scene.  Such  infor¬ 
mation  is  used  by  many  lU  algorithms  to  compute 
parameters,  to  guide  search,  to  cue  recognition  pro¬ 
cesses,  or  to  reason  about  the  consistency  of  an  in¬ 
terpretation. 

lU  algorithms  must  have  access  to  nonlocal  in¬ 
formation  to  aid  interpretation.  Providing  direct 
access  to  relevant  nonlocal  information  is  another 
key  to  using  contextual  information  in  a  computer 
vision  system. 


task 

I  world  data 

I  world  knowledge 


images 


parameters 

Figure  2:  A  schematic  diagram  of  an  lU  algorithm 
embedded  in  a  vision  system. 

Physical  context  —  information  about  the  vi¬ 
sual  world  that  is  independent  of  any  partic¬ 
ular  set  of  image  acquisition  conditions.  Phys¬ 
ical  context  encompasses  a  range  of  speciflcity 
from  the  very  precise  “There  is  a  tree  at  (342, 
124)”  to  the  more  generic  “This  area  contains 
a  mixed,  deciduous  forest.”  Physical  context 
may  also  include  information  about  the  ap¬ 
pearance  of  scene  features  in  previously  inter¬ 
preted  imagery  and  dynamic  information,  such 
as  weather  conditions  and  seasonal  variations. 


scene  description 


4  Types  of  Context 

Before  describing  how  contextual  information  can 
be  represented  and  used,  it  is  useful  to  take  inven¬ 
tory  of  the  kinds  of  context  that  could  be  consid¬ 
ered. 

Figure  2  depicts  a  schematic  view  of  an  lU  algo¬ 
rithm  as  a  black  box.  Its  explicit  inputs  are  a  set 
of  images  and  some  parameters,  but  it  is  invoked  in 
the  context  of  an  assigned  task,  a  database  of  facts 
about  the  world,  and  a  knowledge  base  from  which 
additional  information  about  the  world  can  be  de¬ 
duced.  Some  of  its  outputs  are  symbolic  descrip¬ 
tions  that  can  also  be  used  to  augment  the  database 
or  knowledge  base,  or  to  assign  additional  tasks  for 
realizing  behaviors. 

We  have  found  it  convenient  to  divide  the  range 
of  contextual  information  into  three  categories.  Ad¬ 
ditional  semantic  knowledge  may  involve  contextual 
information  from  all  three  categories. 


Photogrammetric  context  —  information  sur¬ 
rounding  the  acquisition  of  the  image  under 
study.  This  includes  both  internal  camera  pa¬ 
rameters  {e.g.,  focal  length,  principal  point, 
field  of  view,  color  of  filter)  as  well  as  external 
parameters  {e.g.,  camera  location  and  orienta¬ 
tion).  We  also  include  the  date  and  time  of 
image  acquistion  as  well  as  the  images  them¬ 
selves. 

Computational  context  —  information  about 
the  internal  state  of  processing.  The  computa¬ 
tional  context  can  be  used  to  control  the  pro¬ 
cessing  sequence  based  on  partial  recognition 
results.  Different  strategies  can  be  used  when 
first  initiating  the  analysis  of  an  image  versus 
filling  in  the  detmls  of  a  largely  completed  anal¬ 
ysis.  The  assigned  task,  the  level  of  automation 
required,  and  the  available  hardware  processes 
are  all  construed  as  part  of  the  computational 
context. 
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It  is  worth  noting  that  context  may  be  either 
established  or  hypothetical.  Tentative  conclusions 
such  as  “The  sky  is  not  visible  in  this  image,”  and 
hypothesized  facts  about  the  world  such  as  “Assum¬ 
ing  that  no  buildings  with  peaked  roofs  are  at  this 
site”  can  be  treated  as  ordinary  context  to  generate 
hypothetical  conclusions. 

Just  what  constitutes  contextual  information  is 
highly  dependent  upon  the  domain  of  application 
and  the  goals  of  the  image-understanding  system. 
CONDOR  and  RADIUS  both  involve  the  delineation 
and  recognition  of  features  of  the  outdoor  world 
from  multiple  images.  Tables  1-3  detail  the  types 
of  context  used  or  usable  in  these  applications. 

The  information  in  the  tables  was  compiled  by 
examining  about  one  hundred  lU  algorithms  em¬ 
bedded  in  CONDOR.  That  list  was  then  augmented 
by  considering  additional  algorithms  that  appear 
to  be  relevant  to  the  radius  site-model  construc¬ 
tion  application.  The  algorithms  considered  range 
from  edge-finders  [1]  to  image-segmentation  [12],  to 
stereo  compilation  [2],  to  snakes  [10],  to  complete 
object  recognition  systems  [3,  20].  The  associated 
parameters  and  implicit  assumptions  for  each  algo¬ 
rithm  were  tabulated. 

Contextual  information  may  come  from  a  variety 
of  sources,  depending  on  the  nature  of  the  appli¬ 
cation.  Some  representative  sources  of  contextual 
information  are 

•  Database  -  Information  for  use  by  a  vision  sys¬ 
tem  may  have  been  previously  compiled  and 
stored.  Geometric  object  models,  map  data, 
and  iconic  texture  maps  are  examples. 

•  Image  header  -  Information  about  the  im¬ 
age  acquisition  is  often  stored  with  the  image. 
Caunera  models,  image  size  and  type,  and  time 
and  date  of  acquisition  are  examples. 

•  Derived  -  Results  of  earlier  lU  computation 
are  a  valuable  source  of  additonal  information 
about  a  scene. 

•  User  -  In  an  interactive  or  semiautomated  sce¬ 
nario,  the  human  operator  is  also  a  source  of 
information  that  can  provide  context  to  lU  al¬ 
gorithms.  This  information  could  range  from 
a  general  characterization  of  the  image  (e.p., 
urban  environment)  to  a  precise,  manual  ex¬ 
traction  of  individual  features. 


Table  1:  Physical  Context 


Geometry 

Geometric  modds  of  roads,  trails,  feaces, 
trees,  rocks,  bnildings,  railroads,  towers, 
fields,  etc. 

3D  Outline 

Location 

Orientation 

Photometry/ 

Albedo 

Radiometry 

Material  type 

Reflectance 

Snrface  properties 

Previons  image  snippets 

Dlnminatioa 

Snn  (asimnth,  elevation  angles) 

Base 

Clond  cover 

Shadow  contrast 

Weather 

Temper  at  nre 

Cnrrent  Precipitation 

Recent  Precipitation 

Wind  speed  and  direction 

Season 

Geography 

Site 

Terrain  type  (tnndra,  desert,  ocean,  ...) 
Land  use  (nrban,  rural,  agricultural,  ...) 
Topography  (e.g.,  Digital  Elevation 
Model) 

Environmental  events  (fire,  flood,  earth¬ 
quake,  war,  ...) 

Other 

Semantic  properties  (name,  use,  history, 
...) 

5  Uses  of  Context 

When  an  lU  algorithm  is  viewed  as  a  black  box  as 
in  Figure  3,  it  is  apparent  that  there  are  only  two 
opportunities  to  use  contextual  information  to  in¬ 
fluence  its  behavior.  At  the  input  end,  context  can 
be  used  to  select  the  best  match  of  image  data  with 
lU  algorithms  and  their  parameters.  At  the  output 
end,  context  cam  be  used  to  amalyze  and  filter  the 
results. 

Choosing  aJgorithms  and  their  parameters: 

Given  an  image  and  a  task  to  be  performed,  it  is 
necessary  to  determine  the  most  appropriate  algo¬ 
rithm  or  set  of  algorithms  for  accomplishing  the 
task.  When  the  assumptions  and  limitations  of  each 
algorithm  have  been  coded  explicitly,  it  is  possible 
to  match  their  requirements  with  the  context  of  the 
present  situation,  and  choose  the  ones  that  have 
(at  least)  the  potential  to  achieve  the  desired  re¬ 
sult.  Similarly,  a  mechanism  can  be  constructed  to 
compute  the  parameters  associated  with  those  algo- 
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task 


Table  2:  Photogrammetric  Context 


Date  and  time 

Look  angle 

Azimuth,  elevation,  roll 

Footprint 

Portion  of  ground  observed 

Modality 

Infrared,  color,  radar,  ... 

Multiplicity 

Monocular,  binocular  stereo,  mul¬ 
tiple,  ... 

Image  size 

Pixel  dimensions 

Image  element  type 

Binary,  scalar,  vector,  complex,  ... 

Resolution 

Ground  sample  distance  (GSD) 

Camera  model 

Focal  length,  principal  point,  non¬ 
perspective,  ... 

Table  3:  Computational  Context 


Task 

Interpret  everything,  find  tanks, 
model  all  buildings,  ... 

Interactivity 

Fully  automatic,  manual,  semiauto¬ 
matic,  batch,  continuous  interaction, 
...) 

Acceptable  processing  time 

Urgency 

Hardware 

Uniprocessor,  special-purpose  hard¬ 
ware,  multiprocessor,  ... 

Processing  state 

Just  starting,  already  looked,  detailed 
search,  ... 

rithms  from  the  available  context,  although  it  may 
be  difficult  to  identify  the  appropriate  computations 
in  advance. 

Choosing  image  data:  In  some  applications, 
including  the  CONDOR  and  RADIUS  scenarios,  a  mul¬ 
titude  of  imagery  is  available  for  analysis.  Choos¬ 
ing  the  subset  of  images  to  use  can  be  as  critical 
as  the  selection  of  appropriate  algorithms.  When 
an  algorithm  is  being  considered  for  invocation,  the 
explicitly  coded  assumptions  can  be  used  to  select 
the  images  that  are  best  suited  to  the  extraction 
task  being  given  to  that  algorithm. 

Evaluating  results:  When  lU  algorithms  have 
completed  their  processing,  the  system  has  pro¬ 
duced  a  set  of  results  that  are  best  considered  as 
hypotheses.  Analysis  of  the  results  with  the  ben¬ 
efit  of  relevant  contextual  information  can  lead  to 
improved  interpretations  of  the  imagery.  This  anal¬ 
ysis  can  take  place  in  several  ways  —  by  ranking  the 
hypotheses,  by  comparing  them,  by  checking  their 
consistency  with  other  hypotheses  or  with  the  es¬ 
tablished  context,  and  so  on.  In  each  case,  if  the 
analysis  software  is  encoded  as  a  collection  of  al¬ 
gorithms  with  explicitly  encoded  assumptions,  one 
can  use  the  context  to  choose  the  algorithms  and 


Figure  3;  A  schematic  diagram  of  a  context-based 
vision  system. 

control  their  invocation.  Not  only  does  this  ap¬ 
proach  reduce  unnecessary  computation,  but  it  also 
simplifies  software  construction  because  each  algo¬ 
rithm  need  work  only  in  some  narrowly  defined  con¬ 
text. 

6  An  Architecture  for  Context- 
Based  Vision 

In  the  context-based  vision  paradigm,  the  invoca¬ 
tion  of  all  algorithms  is  governed  by  context.  Rather 
than  having  the  control  structure  and  control  deci¬ 
sions  to  be  made  hard-wired,  the  process  is  driven 
by  context. 

CONDOR  was  designed  as  the  perceptual  archi¬ 
tecture  for  a  hypothetical  outdoor  robot.  Given  an 
image  and  a  possibly  extensive  database  describ¬ 
ing  the  robot's  environment,  the  system  is  to  an¬ 
alyze  the  image  and  to  augment  the  world  model. 
condor’s  recognition  vocabulary  consists  mainly 
of  natural  objects  such  as  trees,  bushes,  trail,  and 
rocks.  Because  of  the  difficulty  of  recognizing  such 
objects  individually,  condor  accepts  an  interpre¬ 
tation  only  if  it  is  consistent  with  its  world  model. 
CONDOR  recognizes  entire  contexts,  rather  than  in¬ 
dividual  objects  [17,  18,  19]. 

6.1  Context  Sets 

We  associate  a  data  structure  called  a  context  set 
with  each  lU  algorithm.  The  context  set  identifies 
those  conditions  that  must  be  true  for  that  algo¬ 
rithm  to  be  applicable.  Efficient  and  effective  vi- 
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sual  recognition  can  be  achieved  only  by  invoking 
the  lU  algorithms  in  those  contexts  in  which  they 
are  likely  to  succeed. 

FormaUy,  a  context  set  is  a  coUection  of  context 
elements  that  are  sufficient  for  inferring  some  rela¬ 
tion  or  applying  some  algorithm.  A  context  element 
is  a  predicate  involving  any  number  of  terms  that 
refer  to  the  physical,  photogrammetric,  or  compu¬ 
tational  context  of  image  analysis. 

Each  algorithm  has  an  associated  context  set,  and 
is  invoked  only  if  its  context  set  is  satisfied.  A  con¬ 
text  set  is  considered  to  be  satisfied  only  if  all  its 
context  elements  are  satisfied. 

As  an  example,  consider  a  simple  operator  that 
extracts  blue  regions  to  find  areas  that  could  be 
labeled  “sky.”  A  context  set  for  this  operator  might 
be 

{  image-is-color,  camera-is-horizontal,  sky-is-clear, 
time-is-daytimc  } 

The  blue-sky  algorithm  would  be  unreliable  if  it 
were  employed  in  anything  but  this  context. 


6.2  Approach 

The  CONDOR  architecture  employs  three  types  of 
algorithms  controlled  by  context  sets,  as  illustrated 
in  Figure  4: 


•  Type  I  context  sets  control  lU  algorithms  that 
produce  candidate  (hypothetical)  labeled  re¬ 
gions. 

•  Type  II  context  sets  control  algorithms  that 
compare  two  candidates  and  determine  if  one 
should  be  preferred  over  the  other.  This  step 
is  mainly  necessary  to  limit  the  combinatorics 
of  finding  mutuaUy  consistent  candidates. 

a  Type  III  context  sets  control  algorithms  that 
check  if  a  candidate  is  consistent  with  an 
emerging  world  model. 

For  each  class  in  the  active  recognition  vocab¬ 
ulary,  all  Type  I  context  sets  are  evaluated.  The 
operators  associated  with  those  that  are  satisfied 
are  executed,  producing  candidates  for  each  class. 
Type  II  context  sets  that  are  satisfied  are  then  used 
to  evaluate  each  candidate  for  a  class,  and  if  all 
such  evaluators  prefer  one  candidate  over  another, 
a  preference  ordering  is  established  between  them. 
These  preference  relations  are  assembled  to  form 
partial  orders  over  the  candidates,  one  partial  order 
for  each  class.  Next,  a  search  for  mutually  coher¬ 
ent  sets  of  candidates  is  conducted  by  incrementally 
building  cliques  of  consistent  candidates,  be^nning 
with  empty  cliques.  A  candidate  is  nominated  for 
inclusion  into  a  clique  by  choosing  one  of  the  can¬ 
didates  at  the  top  of  one  of  the  partial  orders.  Al¬ 
gorithms  associated  with  Type  III  context  sets  that 
have  been  satisfied  are  used  to  test  the  consistency 
of  a  nominee  with  candidates  already  in  the  clique. 
A  consistent  nominee  is  added  to  the  clique;  an  in¬ 
consistent  one  is  removed  from  further  considera- 
4ion  with  that  clique.  Further  candidates  are  added 
to  the  clique  until  none  remain.  Additional  cliques 
are  generated  in  a  similar  fashion  as  computational 
resources  permit.  Ultimately,  one  clique  is  selected 
as  the  best  interpretation  of  the  image  on  the  basis 
of  the  portion  of  the  image  that  is  explained  and 
the  reliability  of  the  operators  that  contributed  to 
the  clique. 

The  interaction  among  context  sets  is  significant. 
The  addition  of  a  candidate  to  a  clique  may  provide 
context  that  could  trigger  a  previously  unsatisfied 
context  set  to  generate  new  candidates  or  estab¬ 
lish  new  preference  orderings.  For  example,  once 
one  bush  has  been  recognized,  it  is  a  good  idea 
to  look  specifically  for  similar  bushes  in  the  im- 
This  tactic  is  implemented  by  a  candidate- 
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generation  context  set  that  includes  a  context  ele¬ 
ment  that  is  satisfied  only  when  a  bush  is  added  to 
a  clique. 

6.3  Representation  of  Context 

We  have  outlined  a  paradigm  in  which  the  require¬ 
ments  of  algorithms  are  matched  against  the  con¬ 
text  of  a  given  situation.  To  employ  this  paradigm, 
it  is  necessary  to  have  representations  for  the  vari¬ 
ous  categories  of  contextual  information  that  are  to 
be  employed. 

The  CONDOR  system  employs  the  Core  Knowl¬ 
edge  System  (CKS),  an  object-oriented  knowl- 
edge/database  that  was  specifically  designed  to 
serve  as  the  central  information  manager  for  a  per¬ 
ceptual  system  [15].  The  CKS  provides  the  abil¬ 
ity  to  store  contextual  information,  and  to  retrieve 
it  through  a  vocabulary  of  spatial  and  semantic 
queries.  It  has  the  further  ability  to  accommodate 
conflicting  data  from  multiple  sources  without  cor¬ 
rupting  the  inference  channels,  condor  uses  CKS 
to  store  a  persistent  model  of  the  world,  and  then 
uses  that  model  as  context  for  image  understanding. 
Image-understanding  results  are  stored  in  the  CKS 
and  hence  are  avadlable  as  context  for  subsequent 
processing. 

The  SRI  Cartographic  Modeling  Environment 
(CME)  provides  the  primitive  representations  for 
modeling  the  physical  objects  and  their  at¬ 
tributes  [9].  CME  is  also  used  for  geometric  op¬ 
erations,  including  coordinate  transformation,  and 
for  display  of  imagery  and  synthetically  generated 
scenes. 

6.4  Results 

Figure  5(a)  depicts  an  image  that  typifies  those  an¬ 
alyzed  by  CONDOR.  After  several  thousand  lU  al¬ 
gorithm  invocations  and  construction  of  20  cliques, 
condor’s  best  clique  correctly  identified  six  of  the 
trees  visible  in  the  image.  A  perspective  view  of 
the  grass  and  trees  in  the  3D  model  produced  by 
CONDOR  is  shown  in  Figure  5(b). 

CONDOR  was  able  to  achieve  similar  results  from 
processing  more  than  100  images  of  natural  scenes 
taken  within  a  limited  2-square-mile  area.  When 
tasked  to  analyze  images  from  other  natural  areas, 
condor’s  performance  degrades  because  its  contex¬ 
tual  knowledge  is  not  totally  relevant.  This  simul¬ 


taneously  illustrates  the  power  of  using  context,  as 
well  as  the  need  to  encode  all  contextual  constraints 
that  are  likely  to  arise. 

7  RADIUS  -  Site  Model  Con¬ 
struction 

We  now  turn  our  attention  to  the  radius  project, 
which  is  concerned  with  constructing  site  models  of 
cultural  objects  from  overhead  imagery.  Although 
the  specific  algorithms  to  be  employed  in  RADIUS 
are  likely  to  differ  greatly  from  those  in  condor, 
their  demands  for  contextual  information  are  very 
similar. 

The  biggest  difference  between  CONDOR  and  RA¬ 
DIUS  is  the  fact  that  RADIUS  is  being  designed  as 
a  semiautomated  system.  Accordingly,  our  design 
chooses  to  leave  the  evaluation  of  lU  results  to  the 
human  operator.  As  a  result,  the  Types  II  and  III 
context  sets  employed  in  CONDOR  are  not  neces¬ 
sary.  Instead,  we  concentrate  on  the  construction 
of  Type  I  context  sets  for  controlling  the  invocation 
of  lU  algorithms.  This  is  particularly  appropriate 
for  RADIUS  given  the  wide  variety  of  features  to  be 
extracted  and  the  large  number  of  lU  laboratories 
expected  to  contribute  algorithms. 

The  examples  presented  here  are  drawn  from  an 
architecture  that  is  being  designed  to  support  site 
model  construction  for  the  radius  application.  The 
architecture  incorporates  a  large  number  of  generic 
cartographic  feature  extraction  algorithms;  it  uses 
contextual  information  to  identify  those  most  likely 
to  succeed  at  a  given  task  and  to  set  their  associated 
parameters. 

7.1  Model-Based  Optimization 

While  the  architecture  we  have  designed  is  capar 
ble  of  enforcing  the  contextual  constraints  of  al¬ 
most  any  lU  algorithm,  our  initial  experiences  have 
focused  primarily  on  employing  algorithms  from 
a  paradigm  known  as  Model-Based  Optimization 
(MBO). 

Specializations  of  MBO  have  been  referred  to 
by  various  other  terms,  including  dynamic  pro¬ 
gramming  [6],  regularization  [14],  deformable  sur¬ 
faces  [22],  and  snakes  [10].  The  approach  under¬ 
lying  MBO  is  to  express  the  solution  to  a  feature- 
extraction  problem  as  a  mathematical  function  of 
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Figure  5:  ninpie  of  Processing  Results  by  CONDOR. 

some  variables,  and  then  to  extract  the  feature  from  This  context  set  gives  the  requirements  that  must 
imagery  by  adjusting  the  values  of  the  variables  exist  for  the  above  MBO  algorithm  to  be  applicable 


to  minimize  the  function.  Typically  the  objective 
function  includes  terms  that  bias  the  feature’s  ge¬ 
ometry  as  weU  as  its  match  with  image  data.  As 
we  have  posed  it,  MBO  operators  require  four  pa¬ 
rameters:  topological  primitive,  objective  function 
to  be  minimized,  source  of  initial  conditions,  and 
the  optimization  procedure  to  be  employed.  The 
Context-Based  Vision  architecture  must  set  these 
parameters  on  the  basis  of  known  contextual  infor¬ 
mation  or  (in  some  cases)  human  input. 

7.2  Context  Sets 

In  CONDOR,  Type  I  context  sets  are  used  to  specify 
the  conditions  that  must  be  met  for  a  given  algo¬ 
rithm  to  be  applicable.  The  context  set  can  also 
specify  the  conditions  that  must  be  met  for  a  given 
parameter  setting  to  be  useful.  For  example, 
MBO(closed-curve.  rectangular-corners, 
manual-entry,  gradient-descent): 
specifies  the  parameters  for  an  MBO  algorithm 
that  could  be  used  to  extract  roof  boundaries  un¬ 
der  some  circumstances.  The  following  context  set 
encodes  conditions  that  are  required  for  the  extrac¬ 
tion  of  roofs  using  that  algorithm: 

{  image-is-bw,  image-resolution<  3.0, 
interactivity-is-semiautomated  } 


and  it  specifies  the  suitable  parameter  values.  In  the 
example  above  for  detecting  roofs,  these  parameters 
have  been  specified  as  a  closed-curve  topology,  an 
objective  function  preferring  rectangular  corners, 
initial  boundary  provided  by  manual  entry,  and  the 
use  of  a  gradient-descent  optimization  procedure. 

In  practice,  a  large  number  of  context  sets  gov¬ 
erning  the  application  of  MBO  algorithms  as  well 
as  other  algorithms  could  be  constructed  and  used 
to  implement  a  cartographic  feature-extraction  sys¬ 
tem  suitable  for  site-model  construction.  It  is  clear 
that  such  a  collection  could  be  unwieldy  and  diffi¬ 
cult  to  maintain.  A  more  structured  representation 
of  the  context  set  concept  is  needed. 

7.3  Context  Tables 

One  alternative  representation  for  context  sets  is 
the  context  table  —  a  data  structure  that  tabulates 
the  context  elements  in  a  more  structured  fashion. 
An  lU  algorithm  is  associated  with  each  row  in  the 
table;  each  column  represents  one  context  element. 

The  context  table  is  equivalent  to  a  collection 
of  context  sets.  Conceptually,  it  provides  a  more 
coherent  view  of  the  contextual  requirements  of 
related  algorithms.  Applicable  algorithms  are  se¬ 
lected  by  finding  rows  for  which  all  conditions  are 
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n 

feature 

interactivity 

images 

jLftvre  ii  rfc 

resolution 

geography 

algorithm 

1 

roof 

semiautomated 

single  BW 

<  3  meters 

MBO(topology=closed-curve, 

obj-fn=rectangular-corners, 

init= manual-entry,  opt=gradient-descent) 

H 

roof 

manual 

single 

<  10  meters 

— 

CM  E(primitive=closed-curve) 

1 

road 

semiautomated 

single  BW 

<  1  meter 

hilly 

M  BO(topology=:ribbon-curve, 
obj-fn=(smoothness(0 . 5) ,  continuous, 
parallel), 

init=manual-entry,  opt=gradient-descent  ) 

1 

road 

semiautomated 

single  BW 

<  10  meters 

hilly 

M  BO(topology=open- curve, 
obj-fn=(smoothness(0.5),continuous), 
init=manual-entry,  opt=gradient-descent) 

5 

road 

semiautomated 

single  BW 

<  1  meter 

flat  V  urban 

MBO(topology=ribbon-curve, 

obj-fn=(smoothness(0.8),continuous, 

parallel), 

init=manual-entry,  opt=gradient-descent) 

6 

road 

semiautomated 

single  BW 

<  10  meters 

flat  V  urban 

MBO(topology=open-curve, 
obj-fn=(smoothness(0.8),continuous), 
init=manual-entry,  opt=gradient-descent) 

7 

road 

manual 

single 

<  10  meters 

— 

C  M  E(primitive=open-curve) 

8 

road 

manual 

single 

<  1  meter 

— 

CM  E(primitive=ribbon-curve) 

road 

semiautomated 

single 

<  2  meters 

— 

ROAD-TRACKER 

1 

(control=bidirectional-search , 
init=manual-entry) 

met.  Table  4  contains  an  excerpt  of  a  context  ta¬ 
ble  for  use  in  cartographic  feature  extraction  which 
iUustrates  the  representation. 

One  drawback  to  the  table  representation  is  its 
potentially  large  size.  Each  algorithm  may  require 
many  rows  to  capture  the  contextual  constrmnts 
of  its  various  parameter  combinations.  Its  chief 
value  is  its  organization  of  contextual  information 
for  knowledge-base  construction. 

7.4  Context  Rules 

A  third  alternative  for  representing  context  sets  is 
to  encode  them  as  rules  whose  antecedent  is  the 
context  set,  and  whose  consequent  is  the  applicable 
algorithm. 

For  example, 

{  image-is-bw,  image-resolution<  3.0, 
interactivity-is-semiautomatcd  }  => 

MBO(elosed-eurve.  rectangular-corners, 
manual-entry,  gradient-descent): 

One  advantage  of  encoding  the  rules  as  a  logic 
program  is  that  using  the  logic  program  interpreter 


eliminates  the  need  to  devise  special  machinery  to 
test  satisfaction  of  context  sets.  The  context  table 
of  the  previous  section  (Table  4)  can  be  recoded  as 
the  roughly  equivalent  Prolog  program  given  in  the 
Appendix. 

A  further  representational  efficiency  is  possible 
by  collapsing  rules  with  common  context  elements. 
For  example,  the  only  difference  between  rules  gov¬ 
erning  Algorithms  3  and  4  and  rules  governing  Al¬ 
gorithms  5  and  6  is  the  geography  term  and  the 
value  of  the  smoothness  parameter.  This  depen¬ 
dence  could  be  generalized  by  additional  rules  that 
relate  smoothness  to  geography. 

Whatever  representation  is  chosen,  it  is  clear  that 
context  sets  can  be  employed  in  either  direction.  In 
the  forward  direction,  the  context  sets  are  used  to 
find  applicable  algorithms.  In  the  opposite  direc¬ 
tion,  the  sets  can  be  used  for  several  purposes,  in¬ 
cluding  the  selection  of  images  on  which  to  invoke  a 
given  algorithm.  For  example.  Table  4  shows  that 
the  use  of  an  MBO  algorithm  for  finding  a  roof  (Row 
1)  requires  the  existence  of  a  monochrome  image 
with  3-meter  resolution  or  better. 
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7.5  Results 

Although  the  architecture  we  have  described  for  the 
RADIUS  application  is  not  yet  fully  functional,  we 
can  illustrate  its  application  using  the  example  Ta¬ 
ble  4. 

Figure  6  compares  the  results  of  applying  an 
MBO  algorithm  both  within  and  outside  its  inher¬ 
ent  contextual  constraints.  Figure  6(a)  shows  an 
overhead  view  of  a  portion  of  the  Mall  in  Washing¬ 
ton,  DC  —  a  flat  park  area  in  an  urban  setting. 
Figure  6(b)  shows  an  overhead  image  of  a  hilly  area 
in  the  foothills  of  the  Rocky  Mountains  in  Colorado. 
Both  images  are  shown  at  approximately  the  same 
scale. 

The  context  table  in  Table  4  can  be  used  to  se¬ 
lect  an  algorithm  suitable  for  extracting  roads  in  a 
semiautomated  setting.  In  the  context  of  the  anal¬ 
ysis  of  the  Washington  DC  image,  both  Algorithm 
5  and  Algorithm  9  are  applicable,  but  we  ignore  Al¬ 
gorithm  9  in  this  example.  Algorithm  5  calls  for 
manual  entry  of  the  initial  curve,  which  is  shown  in 
Figure  6(a).  Optimization  of  this  curve  using  the 
specified  objective  function  and  optimization  pro¬ 
cedure  results  in  the  model  depicted  in  Figure  6(c) 
—  a  reasonably  accurate  extraction  of  the  road. 

This  algorithm  is  not  applicable  to  the  Rocky 
Mountain  image,  because  of  the  difierent  geograph¬ 
ical  context.  If  it  were  applied  anyway,  optimiza¬ 
tion  of  the  initial  curve  shown  in  Figure  6(b)  would 
result  in  the  curve  shown  in  Figure  6(d)  —  an  ex¬ 
traction  that  does  not  follow  the  road  boundaries 
weU. 

The  context  table  shows  that  Algorithm  3  (with 
its  lower  smoothness  parameter)  is  applicable  for 
the  Rocky  Mountain  image.  Applying  it  to  the  same 
initial  curve  gives  the  result  depicted  in  Figure  6(f), 
a  significant  improvement  over  that  obtained  by  Al¬ 
gorithm  5. 

Had  Algorithm  3  been  appplied  to  the  Washing¬ 
ton  DC  image  (where  its  context  is  violated),  the 
result  shown  in  Figure  6(e)  would  have  been  ob¬ 
tained  -  a  noticeably  poorer  delineation  of  the  road 
than  that  obtained  with  a  higher  smoothness  pa¬ 
rameter.  It  is  not  surprising  that  the  choice  of  pa¬ 
rameters  can  have  a  critical  effect  on  the  output 
of  an  lU  algorithm.  More  important,  this  exam¬ 
ple  iUustrates  that  contextual  information  can  be 
successfully  used  to  choose  parameter  settings. 


7.6  Knowledge-Base  Construction 

The  context  sets  (or  context  table  or  context  rules) 
constitute  the  knowledge  base  employed  by  the  sys¬ 
tem.  It  is  clear  that  the  performance  of  the  system 
will  be  limited  by  the  accuracy  and  completeness  of 
the  knowledge  base.  The  context  sets  employed  in 
coNDORand  the  context  rules  being  constructed  for 
the  RADIUS  application  are  hand-crafted  based  on 
ad  hoc  experimentation  with  available  imagery.  It 
is  clear  that  a  more  automated,  or  at  least  a  bet¬ 
ter  grounded  procedure  for  constructing  the  context 
rules  is  desirable,  both  for  accommodating  a  poten¬ 
tially  large  knowledge  base  and  for  extending  the 
domain  of  competence  beyond  that  originally  con¬ 
ceived. 

There  are  several  approaches  by  which  the  sys¬ 
tem  could  learn  the  most  effective  context  rules. 
Perhaps  the  most  enticing  one  for  interactive  inter¬ 
pretation  is  one  in  which  the  system  learns  through 
experience.  V/henever  a  situation  arises  for  which 
there  is  no  applicable  algorithm,  or  for  which  ail  the 
applicable  algorithms  give  unacceptable  results,  the 
human  operator  has  no  choice  but  to  edit  the  result 
or  model  the  feature  by  hand,  and  then  continue 
the  site- model  construction.  Such  a  manual  extrac¬ 
tion  can  serve  as  the  “correct”  answer  in  a  super¬ 
vised  learning  process.  By  capturing  the  context 
that  failed  initially,  the  learning  procedure  can  the¬ 
oretically  compare  the  results  of  many  algorithms 
with  the  “correct”  one  —  whenever  there  is  a  suf¬ 
ficiently  accurate  match,  a  new  context  rule  can  be 
added.  One  can  also  imagine  finding  a  better  set  of 
parameters  by  posing  the  problem  as  one  for  MBO: 
the  algorithm’s  parameters  can  be  varied  systemat¬ 
ically  until  the  best  match  with  “correct”  answer  is 
obtained.  If  the  match  is  sufficiently  close,  a  new 
context  rule  with  the  corresponding  parameter  set¬ 
tings  can  be  installed. 

Automating  the  construction  of  the  context  rules 
is  both  important  and  difficult.  There  are  many 
promising  approaches,  but  none  have  yet  been  seri¬ 
ously  tried. 

8  Summary 

We  have  described  some  of  our  experience  in  ap¬ 
plying  the  CONDOR  architecture  to  the  site- model 
construction  task  of  radius.  The  semiautomated 
nature  of  radius  obviates  the  need  for  some  of  the 
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machinery  employed  in  the  fully  automated  design 
of  CONDOR.  The  availability  of  a  human  operator 
permits  access  to  some  kinds  of  context  that  were 
not  available  to  condor,  such  as  the  level  of  inter¬ 
activity  desired,  and  manual  sketches  of  individual 
features.  The  existence  of  a  human  to  review  and 
edit  the  lU  results  offers  the  opportunity  to  use  a 
supervised  learning  scheme  to  improve  the  quality 
of  the  knowledge  base  or  to  extend  its  range  of  com¬ 
petence. 

The  large  number  of  features  and  wide  range  of 
imaging  conditions  that  must  be  considered  for  site- 
model  construction  in  radius  stress  the  context  set 
representation  employed  in  CONDOR.  While  con¬ 
text  sets  were  adequate  for  the  knowledge  base  of 
CONDOR,  it  has  been  necessary  to  consider  more 
effective  representations  that  will  extend  to  the  re¬ 
quirements  of  site- model  construction.  Two  new 
constructs  —  context  tables  and  context  rules  — 
offer  a  more  systematized  organization  for  the  con¬ 
text  knowledge  base  that  should  facilitate  its  con¬ 
struction.  These  representations  offer  additional 
economies  in  both  storage  and  computation  that 
may  be  vital  to  implemetation  of  large  systems.  The 
symmetry  of  context  tables  and  rules  encourages 
their  use  in  either  direction:  to  select  algorithms 
and  set  their  parameters,  or  to  describe  the  condi¬ 
tions  that  must  be  satisfied  for  a  given  algorithm  to 
be  applicable.  This  final  capability  raises  the  pos¬ 
sibility  of  using  context  rules  to  choose  the  most 
appropriate  images  for  interpretation. 

Acknowledgments 

I  am  indebted  to  Marty  Fischler  for  the  numerous 
discussions  that  motivated  and  shaped  much  of  this 
work.  Thanks  also  to  Pascal  Fua  for  the  use  of  his 
snake  algorithms  and  to  Lynn  Quam  for  supplying 
the  Cartographic  Modeling  Environment  which  fa¬ 
cilitated  the  implementation  and  experimentation 
enormously. 

Appendix 

The  following  Prolog  program^  is  roughly  equiva¬ 
lent  to  the  context  table  depicted  in  Table  4. 


^More  compact  programs  are  possible. 


X  algdaaa,  ParaMtars) 

X  alg  spacifias  tha  appllcabla  fuactioas  aad  thair 
X  apprapriata  parasMtar  sattiags  far  usa  ia  a 
X  prase ribad  coataxt 

X  lasM  is  a  spabol  daaotlag  tha  faaetioa  to  ba  iavokad 

X  Paraaatars  is  a  ssqnaaca  of  paraaatsrs  shoss  forawt 
X  dapaads  oa  tha  fnactioa 

algtrto,  [elosad-carva,  obj-fa(ractaagalar-adgas) , 
aaonal-aatry ,  gradiaat-dascaat] ) 
objact~typa(roof ) , 
sita(Sita> , 

iataractivityCtaaiaotoMtad) , 
iaaga-sitadaaga  ,Sita) , 
aodalityCIaaga,  bo), 
iaaga-rasolotioadaaga,  OSD), 

OSD  a<  3.0  . 

alg(caa,  [closad-cnnra]  ) 
obJact-typa(roof ) , 
sita(Sita), 

iBtaractivity(aaanal) , 
iaaga-sitadaaga, Sita) , 
iaaga-rasolutioadaaga,  OSD), 

OSD  -<  10.0  . 

algtabo,  [ribboB-corva, 

obj-fn(saoothaass(0.5) ,  contianoas,  parallal) , 
aaanal-aatry ,  gradiant-dascant] ) 
objact-typa(road) , 
sita(Sita) , 

iBtaractivity(saaiaBtoaatad) , 
iaaga-sitadaaga, Sita) , 
aodalitydaaga,  bo), 
iaaga-ratelutioadaaga,  OSD), 

OSD  a<  1.0, 

tito-goegraphy(Sito,  hilly)  . 

algtabo,  [opoB-earoo, 

obj-fatsaoothaassCO.S) ,  coatiaaoas) , 
aaaaal-aatry ,  gradiaat-dascaat]) 
objact-typa(road) , 
sita(Sits) , 

iataractivity(saaiaatoaatad) , 
iaaga-sitadaaga ,  Sita) , 
aodalitydaaga,  bo), 
iaago-rosolatioadaaga,  OSD), 

OSD  ■<  10.0, 

sito-gaography(Sita,  hilly)  . 


algtsibo,  [ribboB-carra , 

obj-fB(saoothaass(0.8) ,  coatiaaoas,  parallal), 
aaaaal-oatry ,  gradioat-dascaat] )  :- 
objoct-typa(road) , 
sits(Sits) , 

iBtaractivity(saBiaBtoaatad) , 
iaago-sitadaaga,Sita) , 
aodalitydaaga,  bo), 
iaags-rssolatioadaaga,  OSD), 

OSD  ■<  1.0, 

sita-gsography(Sits ,  flat)  . 

algtabo,  [opoB-caroa, 

obJ-fB(saoothaass(0.8),  coatiaaoas), 
aaBaal-oatry ,  gradiaat-doscoat] )  :- 
objoct-typa(road) , 
sita(Sito), 

iBtoractivity(saaiaBtoaatod) , 
iaago-sitodaaga,Slta) , 
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■odalitydsag*,  ba) , 
iBaga-rasolutionduga,  OSD), 
aSD  a<  10.0, 

sita-gaogri^hy(Sita,  flat)  . 

alg(cBa,  [opan-curva]  ) 
objact-typaCroad) , 
sita(Slta), 

intaractivity(Banaal> , 
iBWga-sltadBaga,Sita) , 
iBaga-rasolutiendBaga,  OSO) , 

QSD  a<  10.0  . 

algCcaa,  [ribbon-curva]  ) 
objact-typa(road) , 
sita(Sita> , 
intaractivityCaanaal) , 
iaaga-sltadaaga.Sita) . 
iaaga-raaolutiondBaga,  QSD), 

QSD  a<  1.0  . 

alg(road-trackar, 

[bidiractional-saarch ,  aanual-antry]  ) 
objact-typa(road) , 
aita(Sita) , 

intaractivity(saBiatttoaatad) , 
iBaga-sitadBaga,Sita) , 
iaaga-rasolationdBaga,  OSD), 

QSD  -<  2.0  . 
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Abstract 

This  paper  presents  an  overview  of  our  program  of 
research  toward  the  automated  analysis  of  remotely 
sensed  imagery.  Research  results  in  the  areas  of 
photogranunetry,  stereo  analysis,  automatic 
building  extraction,  digital  terrain  modeling, 
knowledge-based  systems,  and  multispectral  image 
analysis  are  presented.' 

1.  Introduction 

The  automated  compilation  of  man-made  and 
natural  terrain  in  urban  or  built-up  areas  has  been 
the  focus  of  our  research  for  a  number  of  years. 
Built-up  areas  provide  some  of  the  most  difficult 
and  time  consuming  tasks  for  the  cartographic 
community,  and  provide  a  rich  environment  of 
varied  structures  and  natural  terrain  features  to  test 
the  robustness  of  new  approaches  to  computer 
vision.  The  theme  of  our  research  is  to  understand 
the  computational  aspects  of  automated  recovery 
of  three-dimensional  scene  information  using  a 
variety  of  image  domain  cues.  These  cues  include 
the  analysis  of  cast  shadows,  stereo  matching, 
geometric  models,  and  structural  descriptions 
based  upon  analysis  and  combination  of  low-level 
image-based  features.  We  look  for  opportunities  to 
augment  traditional  computational  vision 
techniques  with  domain  knowledge  since  it  is 
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clearly  used  by  both  cartographers  and  imagery 
analysts  in  a  variety  of  tasks  ranging  from  mapping 
to  environmental  land  use  analysis  to  natural 
resource  inventory. 

In  this  paper  we  provide  a  status  report  in  a  variety 
of  research  activities.  Section  2  describes  recent 
work  in  the  application  of  rigorous 
photogrammetric  methods  to  image  orientation  and 
building  extraction  within  the  context  of  the 
DARPA  RADIUS  program.  We  also  describe  recent 
results  using  our  stereo  matching  systems 
(McKeown  and  Hsieh  92,  Hsieh  et  al.  92]  on 
model  board  imagery.  A  companion  paper  in  this 
proceedings  describes  the  incorporation  of 
vanishing  point  geometry  into  the  BABE  building 
extraction  system  [McGlone  and  Shufelt  93].  Such 
geometric  analysis  is  a  key  requirement  for 
cartographic  feature  analysis  for  oblique  imagery, 
particularly  in  the  fusion  of  results  from  multiple 
views. 

Section  3  describes  new  research  in  developing 
symbolic  and  geometric  descriptions  for  buildings 
using  multiple  cues  such  as  stereo,  shadow 
analysis,  and  monocular  building  detection.  The 
goal  is  to  detect  those  regions  in  the  disparity  map 
created  by  stereo  matching  that  correspond  to 
buildings.  Given  that  structures  may  appear  on 
rolling  terrain,  and  that  stereo  analysis  rarely 
constructs  an  error-free  model  of  the  scene,  simple 
techniques  based  upon  region  analysis  must  be 
augmented  with  other  sources  of  information. 

Section  4  details  an  experiment  in  measuring  the 
effectiveness  of  human-computer  interaction  to  aid 
in  building  detection.  While  most  of  our  research 
has  focused  upon  automated  end-to-end  analysis, 
there  is  a  role  for  user  interaction  at  various  stages 
of  the  cartographic  process.  Some  preliminary 
results  are  presented  in  interactive  building 
selection  using  a  simple  pointing  method. 
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Section  5  describes  a  continuation  of  previously 
reported  research  toward  the  development  of 
improved  techniques  for  a  terrain  representation 
called  a  triangular  irregular  network  (TIN)  [Polls 
and  McKeown  92].  A  new  point  selection 
algorithm  is  described  and  results  for  the 
generation  of  a  2500  kilometer  square  area  of  the 
National  Training  Center  (NTC),  Fort  Irvin, 
California,  are  shown. 

Research  on  knowledge  acquisition  and  refuiement 
for  rule-based  systems  used  for  the  interpretation 
of  aerial  imagery  is  reported  on  in  Section  6. 
Using  a  large  production  system  architecture, 
SPAM,  we  report  on  ongoing  research  in  the 
evaluation  of  the  utility  of  various  sources  of 
knowledge  for  airport  scene  analysis. 

Finally,  in  Section  7  we  briefly  describe  current 
research  in  multispectral  analysis  to  determine 
surface  material  properties  as  a  knowledge  source 
of  built-up  area  segmentation  and  cartographic 
feature  extraction.  A  detailed  description  of 
performance  evaluation  for  two  classification 
techniques  can  be  found  in  a  companion  paper 
[Ford  et  al.  93]  in  this  volume. 

2.  Photogranunetric  approach 

Our  goal  is  to  apply  rigorous  photogmmmetric 
meth^s  in  several  areas  of  research,  particularly  to 
improve  the  extraction  of  geometric  cues,  and  to 
relate  partial  object  descriptions  across  multiple 
images.  One  of  our  first  triplications  has  been  the 
incorporation  of  vanishing  point  geome^  into  the 
BABE  building  extraction  system,  as  discussed  in 
[McGlone  and  Shufelt  93]  in  this  volume. 

In  this  section  we  discuss  our  research  using  the 
RADIUS  modelboard  imagery.  Under  the  RADIUS 
program  a  set  of  images  of  a  synthetic  industrial 
site,  represented  by  a  scale  model,  were  created 
and  distributed  to  the  RADIUS  community.  These 
images  differ  from  more  typical  mapping 
photography  used  in  cartographic  research  in  that 
they  are  t^en  from  oblique  angles  rather  than 
near-nadir  (down  looking)  mapping  cameras. 
Along  with  the  imagery  a  set  of  ground  control 
points  with  known  modelboard  coordinates  were 
distributed. 

Using  this  imagery  we  have  addressed  two  major 
areas.  The  first  is  the  implementation  of  a  rigorous 
central  projection  camera  model  and  the  solution 
for  the  camera  parameters  for  the  modelboard 
imagery.  The  second  is  the  evaluation  of  our 
current  stereo  matching  techniques  developed  for 
traditional  mapping  imagery  using  the  modelboard 
imagery. 


Image 

ID 

#  Ground 
Points 

#Inii^ 

Points 

RMS  Image 
Residuals 

J3 

76 

76 

2.2  pixels 

J4 

67 

67 

1.8 

J5 

65 

65 

2.4 

J6 

77 

77 

2.3 

J7 

81 

81 

2.4 

J3  — J7 

111 

366 

2.3 

Table  1:  RADIUS  modelboard  resection  results. 


2.1.  Modelboard  image  resection 

Our  woric  to  date  has  been  focused  on  an  initial  set 
of  eight  images,  J1  through  J8,  of  the  modelboard 
industrial  site.  Figures  1  and  2  are  two  overtyping 
areas,  from  images  J5  and  J4  respectively,  and 
illustrate  typical  scene  content.  In  order  to  obtain 
valid  position  and  orientation  parameters  for  the 
images  we  implemented  a  standard 
photogrammetric  resection  procedure. 

A  relatively  large  number  of  modelboard  control 
points  (70-80)  were  measured  in  each  of  five 
images,  J3  through  J7.  We  scaled  the  RADIUS 
modelboard  control  point  coordinates  into  world 
units  using  the  modelboard  scale  information, 
given  as  1:500.  In  order  to  better  integrate  with 
our  existing  landmark  database  software 
[McKeown  87]  we  transformed  the  modelboard 
coordinates  into  pseudo  geodetic  (latitude- 
longitude)  coordinates.  The  coordinate  origin  of 
the  nKxlelboard  imagery  was  taken  to  be 
somewhere  in  central  Kansas.  For  each  of  the 
images  we  performed  an  individual  image 
resection  to  establish  an  error  measure  based  upon 
RMS  image  displacement  of  the  measured  points. 
In  addition  we  performed  a  simultaneous  block 
adjustment  of  the  modelboard  images  using  all  of 
the  measured  points.  Simultaneous  resection  of  the 
images  in  the  same  adjustment  allows  better  error 
detection,  due  to  the  higher  redundancy  in  the 
solution,  and  gives  orientation  parameters  that  are 
more  consistent  between  images. 

Results  of  the  individual  and  simultaneous 
adjustments  of  images  J3  through  J7  are  shown  in 
Table  1.  One  can  see  a  fairly  consistent  residual 
error  of  about  2.3  pixels  in  these  resections. 
Further  refinements  of  our  camera  model  may 
improve  this  situation,  but  at  the  current  scale  of 
the  modelboard  photography  these  errors 
correspond  to  about  a  three  foot  displacement  in 
ground  position.  Sources  of  error  include 
uncertainty  in  the  modelboard  ground  control 
locations,  errors  in  the  measurement  of  these  points 
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in  each  of  the  modelboard  images,  and  unmodeled 
distortions  resulting  from  the  image  formation 
process.  It  remains  to  be  seen  how  such  errors  will 
effect  the  accuracy  of  cartographic  feature 
descriptions,  such  as  buildings,  whose  models  are 
composed  from  partial  object  descriptions  acquired 
from  multiple  views  of  the  scene. 

An  important  output  of  the  resection  solution  is  the 
precision  information  obtained  on  the  orientation 
parameters,  which  can  be  propagated  to  estimate 
the  precision  on  calculated  ground  coordinates, 
distances,  or  heights.  These  precision  estimates 
will  in  turn  allow  us  to  more  meaningfully  control 
and  merge  various  operations. 

The  image  resection  parameters  are  used 
pervasively  in  our  research.  For  building 
extraction  from  the  oblique  aerial  imagery,  the 
vanishing  point  information  is  directly  calculated 
and  exploited  as  described  in  [McGlone  and 
Shufelt  93].  In  the  stereo  processing  the  image 
orientation  parameters  are  used  to  precisely 
resample  the  images  into  epipolar  geometry  and  to 
calculate  elevations  and  heights  in  the  scene.  The 
incorporation  of  precise  camera  models,  resection 
information,  and  precision  information  into  other 
applications  is  now  in  progress. 

2.2.  Stereo  analysis  of  the  RADIUS 
modelboard  imagery 

The  RADIUS  modelboard  imagery  presents  several 
new  complexities  in  the  interpretation  of  aerial 
imagery.  The  emphasis  on  oblique  views  breaks 
some  of  the  basic  assumptions  built  into  processes 
that  analyze  and  interpret  near-vertical  stereo  pairs. 
In  order  to  establish  a  performance  baseline  for  our 
stereo  analysis  systems  we  processed  the  two 
stereo  pairs  (J4-JS)  and  (J6-J7)  found  in  the  initial 
release  of  the  RADIUS  modelboard  dataset.  We 
used  our  standard  orientation  methods  developed 
for  near-vertical  imagery  taken  along  a  single 
flightpath  [Perlant  and  McKeown  90]  as  well  as 
our  new  image  orientation  system  based  upon  the 
resection  results  reported  in  the  previous  section. 
Both  pairs,  (J4-J5)  and  {J6-J7),  are  relatively  wide 
angle  stereo,  with  convergence  angles  of  60  and  25 
degrees,  respectively.  For  the  examples  in  this 
section  we  show  results  using  the  60  degree  pair 
because  it  represents  an  extreme  case  with  respect 
to  our  previous  work. 

2.2.1.  Stereo  matching  and  refinement 

Most  stereo  systems  in  cartographic  analysis 
assume  that  the  stereo  pair  is  in  a  collinear  epipolar 
geometry.  We  use  two  independent  stereo 
matching  systems  of  this  type.  The  first.  Si,  is  an 
area-based  method  that  provides  good  figural- 
continuity  and  captures  a  sense  of  foreground  and 
background.  It  works  in  a  hierarchical  coarse-to- 


fine  fashion  in  order  to  capture  as  much  global 
continuity  as  possible  while  retaining  a  locally- 
based  process.  Its  results  are  best  in  smooth 
textured  areas,  but  tends  to  smooth  (blur)  abrupt 
changes  in  depth. 

The  second,  S2,  is  a  feature-based  method  that 
provides  a  more  accurate  estimate  at  a  few 
points — especially  near  depth  discontinuities,  but 
requires  interpolation  to  “fill  in  the  gaps.”  This 
process  also  uses  a  hierarchical  coarse-to-fine 
approach,  but  matches  “waveform”  features 
across  (epipolar)  scanlines  rather  than  a  correlation 
window.  To  remove  false  matches  this  process 
uses  a  inter-/intra-scanline  consistency  check 
[McKeown  and  Hsieh  92]. 

The  results  of  the  two  stereo  processes  are  refined 
using  a  monocular  segmentation  of  the  original 
intensity  image  into  homogeneous  regions.  This 
process  first  merges  the  disparity  results  from  each 
stereo  method  using  a  common  estimate  of 
“goodness”  to  select  the  best  match;  however,  if 
there  is  a  large  disagreement  between  the  two 
methods,  then  both  estimates  are  suppressed. 
Within  each  region  of  the  segmentation,  which  is 
assumed  to  represent  a  single  continuous  patch  of 
surface,  the  disparity  values  are  averaged  and  the 
outliers  are  removed.  Two  different  segmentations 
are  used  to  limit  the  formation  of  artifacts  during 
this  process  [McKeown  and  Perlant  92] . 

2.2.2.  Generation  of  epipolar  geometry 

Epipolar  resampling,  that  is,  resampling  a  stereo 
pair  of  images  so  that  the  epipolar  lines  run  along 
the  rows  of  the  image,  is  a  requirement  for  our 
stereo  matchers,  as  it  is  for  most  existing  computer 
vision  systems  ^plied  to  aerial  imagery. 
Unfortunately,  the  resampling  that  is  typically 
performed  uses  approximate  warping  techniques 
that  may  be  adequate  for  vertical  images  but  may 
fail  for  imagery  with  severe  obliquity.  We  have 
implemented  a  rigorous  epipolar  reprojection 
routine  that  transforms  a  given  stereo  pair  into  the 
required  geometry  using  the  full  orientation 
parameters  for  the  images. 

As  an  experiment  we  generated  epipolar  aligned 
imagery  using  two  different  techniques.  First,  we 
established  a  baseline  registration  by  performing  a 
relative  orientation  of  the  RADIUS  modelboard 
imagery  by  finding  common  scene  points  in  each 
of  the  stereo  pairs.  A  polynomial  orientation  was 
performed  giving  an  approximately  collinear 
epipolar  alignment  [Perlant  and  McKeown  90]. 
llie  second  orientation  was  performed  using  a 
rigorous  epipolar  reprojection  based  upon  the 
modelboard  resection. 
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Figure  3:  Refined  Si  disparity  results  using 
polynomial  orientation. 

2.2.3.  Modelboard  stereo  results 

Both  SI  and  S2  were  run  using  both  the  polynomial 
orientation  and  the  resection  reprojection  on  the 
RADIUS  J4  and  J5  stereo  pair.  Figures  1  and  2 
show  the  left  and  right  image  pairs.  Figures  3  and 
4  show  the  stereo  disparity  results  after  refinement 
using  the  polynomial  orientation.  The  disparity 
results  are  encoded  such  that  bright  areas  are 
higher  than  dark  areas. 

Using  the  polynomial  orientation  one  can  easily 
see  areas  of  mismatch,  especially  for  the  area- 


Figure  4;  Refined  S2  disparity  results  using 
polynomial  orientation. 


based  SI  process.  This  was  mostly  due  to  the  lack 
of  a  precise  alignment  of  the  epipolar  lines.  In 
addition,  the  large  number  of  occluded  regions 
caused  several  mismatches  by  the  feature-based  S2 
matcher.  In  many  cases  the  correct  match  between 
features  had  an  opposite  intensity  contrast,  which 
violates  one  of  the  current  S2  constraints.  This  was 
less  an  issue  of  registration  and  due  more  to  the 
imaging  geometry. 

Figure  5  shows  the  reprojection  of  the  original  left 
image  in  Figure  1  such  that  the  camera  axis  is 
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Figure  5;  Resampled  Image  J5  (left  view)  using 
full  resection. 


perpendicular  to  the  stereo  baseline.  Both  the  left 
and  right  images  were  reprojected  into  epipolar 
alignment.  One  can  notice  a  change  in  shape  of 
the  scene  due  to  the  obliquity  and  the  angle 
between  the  original  camera  axis  and  the  stereo 
baseline.  In  this  case  the  change  was  minimal 
since  the  baseline  was  nearly  horizontal. 

Figure  6  shows  the  result  of  .S2  matching  and 
refinement  using  the  resection  orientation. 
Although  the  results  in  Figures  4  and  6  are  not 
directly  comparable  due  to  the  reprojection,  one 
can  see  significantly  more  structure  in  the 
buildings  in  the  upper  left  corner  of  the  scene. 
Figures  7  and  8  show  perspective  views  of  the 
modelboard  reconstruction  and  al.so  highlight  some 
of  the  differences  between  the  two  orientation 
techniques.  Quantitative  analysis  of  the  stereo 
accuracy  along  the  lines  of  [Hsieh  et  al.  921  wdl  be 
performed  over  a  set  of  test  cases. 

2.3.  Open  issues 

As  discussed  in  Section  2. 1  we  have  applied  a 
rigorous  photogrammetric  approach  to  the  problem 
of  obtaining  a  more  exact  collinear  epipolar 
alignment  of  the  stereo  images.  We  still  need  to 
determine  how  much  tolerance  our  stereo 
algorithms  exhibit  with  oblique  imagery.  With  the 
larger  angle  oblique  images,  we  will  need  to 
consider  methods  to  deal  with  the  large  occluded 
areas  and  the  large  baseline-to-range  ratio. 

We  have  observed  that  the  stereo  refinement 
process  greatly  improved  the  disparity  results  in 
both  of  the  modelboard  image  tests.  We  plan  to 
introduce  additional  sources  of  information  in  the 
stereo  process,  such  as  wall  and  roof  hypotheses 


Figure  6:  Refined  S2  disparity  results  using 
full  resection. 


generated  by  monocular  analysis,  in  order  to  help 
guide  and  refine  the  stereo  matching. 

3.  Building  extraction  from  stereo 

The  interpretation  of  stereo  disparity  maps  to 
detect  and  delineate  manmade  structures  ct>ntained 
within  is  a  difficult  problem.  Our  recent  research 
has  been  addressing  the  detection  and  extraction  of 
buildings  using  stereo  analysis  together  with 
monocular  cues.  The  goal  is  to  produce  full  three- 
dimensional  models  of  complex  buildings  for  site 
model  construction  and  update.  Our  approach  is  to 
apply  the  cooperative-methods  paradigm  starting 
with  the  results  generated  by  the  stereo  analysis  of 
a  pair  of  aerial  images  and.  together  with 
mom)cular  cues,  mark  those  areas  of  the  image  that 
appear  to  be  structures. 

Our  first  step  is  ti>  obtain  a  set  of  refined  sterec) 
estimates  of  the  scene.  This  is  obtained  b>  using 
the  SI  and  S2  stereo  matching  systems  coupled  with 
disparity  map  refinement  as  described  by 
IMcKcown  and  Perlant  92).  In  the  course  of  the 
stereo  refinement  process  an  intensity 
segmentation  is  produced.  This  segmentation  is 
used  as  the  basis  for  subsequent  processing. 

The  second  step  is  to  merge  those  segmented 
regions  that  have  approximately  the  same  disparity 
and  that  are  adjacent.  Next,  those  (merged) 
regions  having  a  significantly  greater  disparity  than 
their  neighbors  are  selected.  The  rule  applied  in 
this  step  is  liberal  in  the  sense  that  we  wciuld  rather 
produce  a  few  false  positives  that  miss  buildings  at 
this  point. 
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Figure  9:  ix  3X(H)s  Intensity  Image. 


Figure  11:  Fine  Intensity  Segmentation. 


In  order  to  remove  some  or  all  of  the  false 
positives,  we  apply  heuristics  and  constraints 
derived  from  monocular  cues. 

•  Clusters  of  regions  whose  collective  si/e  is  small 
or  large  with  respect  to  possible  building  si/e  are 
discarded. 

•  Regions  that  exist  in  areas  marked  as  shadows 
I  Irvin  and  McKeow  n  89.  Shufelt  and  McKeow  n 
9.^1  are  removed. 

•  Multi-spectral  classification  of  regions  may  be 
used  to  remove  elevated  non-structural  regions 
such  as  tree  canopies  |Ford  and  McKeown  92a|. 

Finally,  the  remaining  clusters  of  buildings  are  re¬ 
analyzed  by  searching  for  a  best  fit  building  m»MJel 
for  each  cluster.  In  addition,  the  shadow  regions 


Figure  10:  Refined  s:  Stereo  Result. 


Figure  12:  Building  Hypotheses. 

are  again  used  to  hypothesize  the  location  of  the 
shadow  casting  sides  of  each  potential  building, 
acting  as  a  final  cluster  hypothesis  verineatii>n. 

3.1.  Extraction  test  results 

Some  initial  results  are  shown  for  a  complex 
industrial  scene.  tx  tstKm.  Figure  9  shows  the 
original  intensity  image  of  the  left  view  of  the 
stereo  pair  of  aerial  images,  while  Figure  10  shows 
the  refined  stereo  result  pnxluced  by  the  s: 
matcher  and  used  as  the  initial  input  to  the  building 
extraction  prcKcss. 

Figure  1 1  shows  the  segmentation  of  the  scene  in 
Figure  9  that  was  generated  during  the  stereo 
refinement  pnxess  in  which  the  range  of  intensity 
within  each  region  is  i.'S. 


These  regions  are  used  as  an  initial  over- 
segmented  view  of  the  aerial  scene  and  adjacent 
regions  are  merged  using  the  mode  stereo  disparity 
value  within  the  region.  The  threshold  for  merger 
of  adjacent  regions  is  ±1  pixel.  Next  individual 
regions  are  mailted  as  potential  buildings  based  on 
their  relationship  to  adjacent  regions.  According  to 
the  heuristic  rule:  (1)  If  the  mode  disparity  wiOiin 
the  region  is  less  than  the  lowest  of  its  neighbors, 
then  it  is  not  considered  a  building;  (2)  If  its 
disparity  is  greater  than  the  highest  of  its 
neighbors,  then  it  is  given  a  likelihood  value  of  1 .0 
(from  a  range  of  0-1.6);  (3)  If  the  nKxle  disparity  is 
equal  to  toth  the  high  and  low  values  of  its 
neighbors,  it  is  allowed  to  be  considered  a  building 
hypothesis  and  assigned  a  value  of  1 .6;  Otherwise, 
its  likelihood  is  calculated  by  the  formula: 

1.5xD-L-0.5xA/ 

H-L 

where: 

D  is  the  disparity  of  the  local  region, 

M  is  the  mode  of  the  surrounding 

regions, 

H  is  the  maximum  disparity  of  the 

surrounding  regions,  and 

L  is  the  minimum  disparity  of  the 

surrounding  regions 

Figure  12  shows  the  result  of  accepting  regions 
ra^  at  O.S  or  according  to  the  above  heuristic. 
Clusters  of  regions  that  are  very  large  (mote  than 
60(X)  pixels)  or  small  (less  than  100  pixels)  ate 
removed.^  This  is  followed  by  a  further  restriction 
that  all  clusters  that  do  not  have  a  hypothesized 
shadow  region  to  their  non-sunward  edges  are 
removed.  Figure  13  shows  the  final  result  after 
these  restrictions. 

As  a  result  of  this  process  many  of  the  significant 
buildings  are  detected,  with  various  degms  of 
accurate  delineation.  One  way  to  visualize  the 
results  is  to  look  at  the  differences  between  a  three- 
dimensional  ground  truth  description,  the  refined 
stereo  disparity  map,  and  the  three-dimensional 
scene  that  results  from  using  building  extraction. 
Figure  14  shows  the  original  scene  rendered  using 
a  hand-generated  stereo  ground  truth  estimate 
14(a),  the  shows  the  refined  S2  stereo  result  14(b), 
and  using  the  building  hypotheses  generated  by 
this  technique  14(c). 


^  Here  a  “cluster"  is  defined  as  the  transitive  closure  of 
adjacent  (within  2  pixels)  regions. 


Figure  13:  Filtered  Building  Hypotheses. 


3  JL  Open  issues 

Although  our  initial  results  are  promising,  we  feel 
that  no  approach  will  reliably  detect  and  delineate 
manmade  structures  solely  by  using  stereo 
disparity.  As  a  part  of  the  cooperative-methods 
pai^igm  we  plan  to  include  other  sources  of 
information  such  as  BABE  building  hypotheses 
[McKeown  90]  and  surface  material  classification 
(Ford  and  McKeown  92b,  Ford  and  McKeown 
92a].  In  rugged  terrain  or  in  areas  with  significant 
tree  canopy  additional  cues  will  be  necessary  for 
both  the  selection  and  the  filtering  of  building 
hypotheses.  In  addition,  we  expect  that  such 
monocular  cues,  such  as  those  generated  by  BABE 
will  play  an  important  role  in  the  verification  and 
re-analysis  of  region  clusters  during  the  model 
fitting  and  labeling  process. 

4.  Manual  selection  of  building  hyptrtheses 

Automated  feature  extraction  from  aerial  images  is 
a  complex  problem,  and  research  in  this  domain 
has  illustrated  the  difficulties  in  reliably  detecting 
and  verifying  building  structure.  Although  the 
ultimate  goals  of  our  work  in  this  area  are  systems 
which  will  accurately  detect  and  precisely 
delineate  man-made  features  in  aerial  photogr^hy 
without  human  intervention,  it  is  clear  that  a 
combination  of  current  extraction  techniques  with 
some  degree  of  user  guidance  has  the  potential  to 
exhibit  improved  performance  on  complex 
imagery. 

To  date,  many  of  the  semi-automated  systems 
require  a  large  portion  of  the  detection  and 
delineation  tasks  to  be  performed  by  the  user  of  the 
system.  In  such  systems,  the  user  interactively 
manipulates  a  variety  of  models  over  features  in 
the  image,  fitting  the  models  to  the  features 
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(c)  Visualization  after  Building  Extraction 


Figure  14:  DC38(X)8  Building  Extraction. 
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[Hanson  et  al.  87,  Kass  et  al.  87].  An  alternative 
paradigm  suggests  that  another  approach  for 
developing  a  high-perfonnance  system  is  to  allow 
the  user  to  1^  a  guiding  hand  during  the 
execution  of  the  feature  extraction  algorithms. 

4.1.  Manual  sdection  of  building  hypotheses 
We  have  been  exploring  possibuities  for  the 
tqyplication  of  human  interactitm  in  the  extraction 
process.  Our  current  testbed  for  this  research  is 
BABE,  a  line-comer  intensity  based  feature 
extractor  [McKeown  90].  In  brief,  babe  proceeds 
through  four  major  phases  to  incrementally 
generate  building  hypotheses.  The  first  ftfiase 
constructs  comers  from  lines,  under  the 
assumption  that  buildings  can  be  modeled  by 
straight  line  segments  linked  by  (nearly)  right- 
angled  comers.  The  second  phase  constmcts 
chains  of  edges  which  are  linked  by  comers,  to 
serve  as  partial  structural  hypotheses.  The  third 
phase  uses  these  line-comer  structures  to 
hypothesize  boxes,  paraUelopipeds  which  may 
delineate  man-made  features  in  the  scene.  The 
fourth  i^ase  evaluates  the  boxes  in  terms  of  size 
and  line  intensity  constraints,  and  the  best  boxes 
for  each  chain  are  kept,  subject  to  shadow  intensity 
constraints  similar  to  those  proposed  by  [Nicolin 
and  Gabler  87]  and  [Huertas  and  Nevada  88]. 

In  recent  work,  we  have  addressed  the  possibility 
of  replacing  the  hypothesis  evaluation  routine  with 
a  simple  form  of  user  verification,  in  which  a 
person  uses  the  mouse  to  drop  points  on  each 
individual  stracture  in  the  scene.  Then,  hypothesis 
evaluation  reduces  to  determining  which  boxes 
produced  by  BABE  contain  points  placed  by  the 
user.  This  level  of  interaction  does  not  place  great 
demands  on  the  user,  aiKl  makes  effective  use  of 
the  h)^thesis  generation  capabilities  of  BABE; 
thus,  it  serves  as  an  interesting  test  for  an 
intermediate  level  of  man-machine  interaction  in 
this  domain. 

Figure  IS  shows  a  ground-troth  hand  segmentation 
of  a  suburban  scene  in  Washington,  DC.  Figure  16 
shows  the  complete  set  of  hypotheses  generated  by 
BABE  for  this  scene,  and  Figure  17  shows  the 
hypotheses  verified  by  the  shadow  intensity 
constraint  algorithms  invoked  in  the  fully 
automatic  version  of  babe.  Figure  18  illustrates 
the  results  of  a  semi-automated  BABE  execution,  in 
which  the  shadow  verification  algorithm  was 
replaced  by  user  selecticm  of  three  points  on  each 
building,  followed  by  intersection  of  these  points 
with  the  full  set  of  hypotheses  in  Figure  16.  Each 
of  these  results  was  then  compared  on  a  pixel-by¬ 
pixel  basis  with  the  ground-troth  hand 
segmentation  to  generate  the  statistics  in  Table  2. 
Note  that  we  give  data  for  a  single-point  user 
selection  example  as  well;  we  omit  the 
corresponding  figure  for  brevity. 


Results 

Buildbig 

Detection 

Backgroand 

Detectkm 

Total 

DetectioD 

All  Boxes 

82.1% 

63.0% 

65.8% 

Best  Boxes 

57.4 

97.7 

91.9 

1  Point 

67.7 

94.1 

90.2 

3Pcmt 

77.1 

92.9 

90.6 

Table!:  Building/background  detection  statistics 


With  the  simple  mechanism  of  multiide  point 
selection,  a  user  interacting  with  babe  can  achieve 
a  marked  improvement  in  building  detection,  at  the 
sli^t  expoise  of  accumulatir^  errors  in 
b»;kground  classification.  This  is  due  to  line 
placement  errors  in  BABE  hypotheses  that  are 
otherwise  accurate  descriptions  of  man-made 
structure.  Note  also,  however,  that  the  total  scene 
classification  rate  remains  essentially  the  same  in 
each  of  the  three  examples.  This  suggests  that  user 
interaction  at  the  verification  level  trades  detection 
rate  against  overall  classification  precision. 

4  J.  Open  issues 

Given  that  the  initial  hypothesis  data  produced  by 
BABE  stiU  fails  to  detect  18%  of  the  building 
structures  in  the  scene,  it  should  be  clear  that  more 
work  is  necessary  on  the  basic  feature  extraction 
algorithms,  and  we  intend  to  continue  our  pursuits 
in  this  area.  User  interaction  at  an  intermediate 
level  aiq)ears  to  be  a  froitfiil  avenue  for  further 
exploration,  however,  and  we  intend  to  investigate 
this  topic  further.  One  key  issue  is  the 
determination  of  the  appropriate  level  of 
interaction  between  a  user  and  a  feature  extraction 
algorithm.  We  also  plan  to  experiment  with  user 
input  at  other  leases  in  the  extraction  algorithms, 
sudi  as  comer  detection,  line  linking,  and  structure 
generatioa 

5.  Terrain  Modeling  and  Viaialization 

Terrain  modeling  is  becoming  an  increasingly 
important  issue  with  the  advent  of  large-scale 
distributed  simulations  for  training,  mission 
rehearsal,  and  mission  plarming.  Suc^  systems 
rely  on  efficient  representations  for  natural  terrain, 
as  well  as  manmade  features  such  as  buildings, 
roads,  and  bridges.  Our  recent  work  in  this  area 
has  focused  on  the  development  of  visualization 
tools  for  three-dimensional  data,  and  in  the 
continuation  of  our  research  in  triarigular  irregular 
networks  (TINs). 

5.1.  Visualization  tools 
With  the  increasing  availability  of  a  variety  of 
digital  spatial  data  ranging  from  map  databases, 
object  model  descriptions,  digital  elevation 


240 


Figure  15:  CiKuiiui  truth  sogiiicntatioii  tor 
IX'.'74()5. 


Figure  17:  Automatic  verification  for  1X'474()5. 


models,  and  jzco-rcfcrcnccd  imagery,  there  is  a 
need  to  conveniently  view  image  and  vector  data  to 
support  various  aspects  of  our  research.  In  many 
cases  these  datasets  arc  best  visual i/cd  in  three 
dimensions.  Figure  19  demonstrates  the  difference 
between  viewing  terrain  as  an  intensity  mapped 
height  field  and  as  an  overhead  shaded  relief 
rendering.  The  latter  process  takes  into  account 
shading  from  a  light  source  and  tends  to  make  the 
surface  structure  more  apparent.  .Small  changes  in 
terrain  detail  are  enhanced  and  surface  slope  and 
aspect  appear  more  pronounced.  To  support  our 
need  for  .^D  display  of  spatial  data  we  have 
developed  an  X/Motif  application.  XRFl.ItT.  to 
allow  us  to  visualize  digital  elevation  models  and 
TINs.  manual  ground  truth  segmentations. 


Figure  16:  Hypothesi''  generation  for  ix  ''  tus. 


Figure  18:  User-assisted  point  verification. 


automated  stereo  results,  multi-spectral  results,  and 
digital  map  data  (ITD.  DLMS)  overlaid  on  terrain. 

Figure  20  shows  a  sample  control  panel  used  to 
specify  imagery,  terrain,  map  overlay,  and  viewing 
parameters.  Users  can  create  and  store  multiple 
ordered  sets  of  camera  parameters  in  order  to 
compare  results  from  different  stages  of  an 
e.xtraction  process.  They  can  also  compare  results 
from  different  analysis  methods  from  a  single 
known  viewpoint.  XRHi.lHl-  has  an  intuitive 
graphical  interface  for  control  and  creatiim  of  these 
camera  parameters,  as  well  as  positioning  of  an 
illumination  source  used  for  shading  calculations, 
and  simple  animation  support.  This  interface  is 
shown  at  the  lower  right  of  Figure  20.  The  large 
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Figure  19:  Comparison  ot  2D  and  2D  display  of  Digital  idevation  Models. 


and  small  circles  represent  camera  lookfrom  and 
lookat  points,  and  the  image  displayed  underneath 
corresponds  to  the  image  being  overlaid  on  the 
terrain. 

5.2.  A  new  TIN  generation  method 

A  digital  elevation  model  (DHM)  is  a  terrain  model 
consisting  of  elevation  data  regularly  spaced  on  a 
grid.  A  triangular  irregular  network  consists  of 
elevation  data  that  are  irregukirly  spaced  and  are 
connected  into  triangular  facets  to  form  a  surface. 
The  ability  of  the  TIN  to  place  points  irregularly 
permits  point  density  to  adapt  to  terrain 
complexity,  and  allows  points  to  be  placed 
precisely  cm  peaks  and  valley  tloors.  The  TIN 
terrain  model  is  idealK  suitetl  to  real-time 
rendering,  as  it  consists  of  a  reduced  set  of 
polygons  tailored  to  the  underlying  terrain 
complexity.  Previous  research  in  TIN  generation 
using  point  selection  from  the  DHM  was  described 
in  an  earlier  paper  |Polis  and  McKeown  92|.  In 
this  section  v\e  give  a  brief  update  on  our 
development  of  a  new  (and  improved)  method  for 
point  selection. 

Our  point  selection  method  relies  on  the  iterative 
selection  of  points  based  upon  successive 
approximation  to  the  actual  terrain  surface.  At 
each  iteration  a  dense  DFiM  is  constructed  by 
interpolation  from  the  current  TIN.  This 


approximate  TIN  is  compared  to  the  aelual  DHM 
and  the  point  or  points  having  the  greatest  error  in 
elevation  are  determined.  These  DEM  points  are 
added  to  the  TIN  as  correction  points,  and  a  new 
triangulation  is  generated.  The  triangulation  is 
evaluated  based  on  a  global  point  budget  and  the 
residual  global  error.  If  the  point  budget  is  not 
exceeded  and  the  RM.S  error  is  still  greater  than  a 
user  specified  error  g()al.  then  the  process  is 
repealed.  In  practice  the  point  budget  controls  the 
stopping  conditions  for  the  TIN  generation 
process. 

Our  previt)us  point  selection  method  relied  on  the 
generation  of  error  contours  and  associated  medial 
axes.  Points  would  be  selected  from  the  set  of 
maximal  contours  generated  at  each  iteration.  The 
new  point  selection  is  based  solely  upon  a  measure 
calculated  at  each  point  in  the  approximation 
DH-M.  The  new  process  chooses  fewer  correction 
points  per  iteration  and  as  a  result  main  more 
iterations  are  required.  However,  the  points 
chosen  are  of  higher  quality  in  terms  of  our  RMS 
erri'r  metric.  As  a  result  of  this  a  fast  triangulation 
is  necessary,  so  the  miHlified  greedy  triangulation 
has  been  replaced  with  a  Delaunay  triangulation. 
However,  the  improvement  in  point  selection 
appears  to  outweigh  the  loss  in  triangulation 
accuracy,  especially  since  the  iterative  process  will 
naturally  add  points  to  correct  poor  triangulations. 


Figure  20:  User  control  panel  for  XRhl.U  l . 


In  the  following  section  we  will  describe  the  use  of 
this  new  triangulation  method  to  generate  large- 
scale  TINS  suitable  for  use  in  SIMNET. 

5.2.1.  TIN  generation  for  SIMNET 

A  digital  elevation  model  constructed  to  support  a 
SIMNET  training  exercise  was  provided  to  us  by  the 
U.S.  Army  Topographic  Engineering  Center 
(USATEC).  The  DEM  covered  an  area  50 
kilometers  on  a  side  (2500  square  km)  including 
the  National  Training  Center  (NTC),  Fort  Irwin, 
California.  This  area  is  primarily  desert,  with 
some  highly  eroded  mountainous  areas  and 
intricate  alluvial  fans  miming  to  the  desert  floor. 
The  sheer  size  of  the  area  presents  signifk  .ant 
problems.  The  DEM  consists  of  1979x1979 
points,  nearly  4  million  elevation  posts.  To 
maintain  the  desired  polygon  density  for  the 
SIMNET  computer  image  generation  systems,  only 
90,(XX)  points  were  to  be  selected  for  the  TIN. 

An  additional  complication  for  the  TIN  generation 
process  was  the  desire  for  reduced  fidelity  in  the 
mountainous  areas  with  increased  detail  in  the 
areas  of  alluvial  fans  and  on  the  desert  floor.  This 
was  primarily  driven  by  the  fact  that  mountainous 
areas  are  not  accessible  to  ground  vehicles 
(simulated  or  otherwise)  yet,  due  to  their  height 
and  complexity,  they  tend  to  accumulate  a  la^e 
number  of  TIN  points.  This  decreases  the  budget 
available  for  other  areas  of  the  terrain.  An  overiay 
indicating  the  mountainous  areas  was  provided  by 
USATEC,  and  was  used  to  pnxluce  an  importaiKe 
grid.  Our  initial  experiment  was  to  make  the 
maximum  error  in  the  mountains  approximately 
one  fifth  as  large  as  that  in  the  low  lying  areas.  We 
smoothed  the  importance  grid  to  avoid  problems 
that  might  result  from  a  discontinuity  at  the 
boundary  of  the  mountainous  area.  Since  point 
selection  under  our  new  method  was  based  solely 
upon  a  measure  calculated  at  each  point,  it  was 
now  possible  to  use  the  importance  grid  to  apply  a 
weight  to  each  point  based  upon  its  subjective 
importance. 

Figure  21  shows  a  shaded  relief  representation  of 
the  western  part  of  the  National  Training  Center. 
The  left  half  shows  the  terrain  relief  using  the 
original  digital  elevation  model.  The  right  half 
shows  the  same  area  using  the  TIN  representation 
for  the  underlying  surface  structure.  TTie  TIN  was 
generated  using  selective  fidelity  in  the 
mountainous  areas.  Using  approximately  2.5%  of 
the  original  DEM  points  we  were  able  to  construct 
a  TIN  with  an  RMS  elevation  error  of  3.1  meters 
when  compared  to  the  original  DEM.  The  range  of 
elevations  in  the  DEM  was  approximately  1500 
meters.  From  a  qualitative  standpoint  it  appears 
that  the  major  topographic  features  are  generally 
preserved  and  that  detail  in  the  alluvial  fans  and 
desert  floor  areas  are  also  quite  good.  This 


impression  was  confirmed  using  the  SIMNET 
system  at  USATEC  and  driving  an  Ml  tank 
(simulated)  through  the  terrain. 

S3.  Open  issues 

We  have  shown  the  utility  of  our  new  TIN 
constmction  method  for  a  large-scale  digital 
elevation  model.  Research  issues  remain  in 
determining  how  to  factor  more  detailed  mobility 
information  into  the  point  selection  process.  We 
are  also  interested  in  addressing  how  to  integrate 
small  scale  cartographic  features,  particularly  roads 
into  a  TIN,  while  maintaining  a  limited  polygon 
budget. 

From  a  pragmatic  standpoint,  the  generation  of  a 
TIN  directly  from  the  NTC  digital  elevation  model 
using  our  new  point  selection  method  would  take 
weeiu,  even  on  a  fast  (20mips)  workstation.  Our 
initial  solution  was  to  divide  the  DEM  into  tiles 
and  then  generate  a  TIN  for  each  tile.  We 
maintained  a  restriction  that  TINS  must  match 
along  common  boundaries.  The  execution  time  is 
divided  by  the  number  of  tiles,  and  can  be  further 
reduced  since  tiles  which  have  no  common 
boundary  can  be  generated  in  parallel.  Using  this 
method  we  were  able  to  generate  the  NTC  TIN 
overnight  using  three  woikstations.  There  are 
limits  to  this  technique  since  as  the  number  of  tiles 
is  increased,  the  global  behavior  of  point  selection 
is  greatly  reduced.  This  can  defeat  the  overall  goal 
of  placing  points  wherever  their  utility  is  the 
greatest 

6.  Toward  Knowledge  Refinement  for 
Large  Rule-BasedSystems 
Knowl^ge  refinement  is  a  central  problem  in  the 
field  of  expert  systems  [Buchanan  and  Shortliffe 
84].  It  refers  to  the  progressive  refinement  of  the 
initial  knowledge-base  of  an  expert  system  into  a 
high-performance  knowledge-base.  For  rule-based 
systems,  refinement  implies  the  addition,  deletion 
and  modification  of  rules  in  the  system  so  as  to 
improve  the  system’s  empirical  adequacy,  i.e.,  its 
ability  to  reach  correct  conclusions  in  the  problems 
it  is  intended  to  solve  [Ginsberg  et  al.  88]. 

The  goal  of  our  research  effort  is  to  understand  the 
methodology  for  refining  large  mie-based  systems, 
as  well  as  to  develop  tools  that  will  be  useful  in 
refining  such  systems.  The  vehicle  for  our 
investigation  is  spam,  a  production  system  (rule- 
based  system)  for  the  inteipretation  of  aerial 
imagery  [McKeown  et  al.  89,  McKeown  et  al.  85]. 
It  is  a  mature  research  system  having  over  600 
productions,  many  of  which  interact  with  complex 
geometric  ^gorithms.  A  typical  scene  analysis 
task  requires  between  50,(XX)  to  4(X),(XX) 
pitxluction  firings  and  an  execution  time  of  the 
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Figure  21:  simni  r  5()km  NTC  DHM  (Iclu  j'lxlapDscd  with  50kni  NTC  TIN  (riyht) 


order  of  2  to  4  epu  hours.  ’ 

Large,  eompute-intensive  systems  like  SI’.AM 
impose  some  unique  eonstraints  on  knowledge 
refinement.  First,  the  problem  of  eredit/blame- 
assignment  is  eomplieated.  It  is  extremely  diffieult 
to  isolate  a  single  eulprit  produetion  (or  a  set  of 
culprit  productions)  to  blame  for  an  error  observed 
in  the  output.  Second,  given  the  large  run-time,  it 
is  not  possible  to  rely  on  extensive  experimentation 
for  knowledge  refinement. 


IMiif.’  .1  hifjhK  opiimi/cil  (' ()l’S5|K;ilp  ci  .il 
XS|  ninninp  nn  ,i  Dl-.C  so(K)/2n() 


As  a  result,  the  methodology  adopted  in  well- 
known  systems  sueh  as  SI  I  K  and  .si  l  k:  [Politakis 
and  Weiss  S4.  Ginsberg  et  al.  SX].  or  KRt  ST  |('raw 
and  Sleeman  91 1.  cannot  be  directly  employed  to 
refine  knowledge  in  SP.XM.  Our  approach  is  to 
address  this  problem  in  a  bottom-up  fashion,  i.e.. 
begin  by  understanding  SPWI's  indi\idual  phases, 
and  then  attempt  to  understand  the  interactions 
between  the  phases.  A  different  set  of  tools  is 
required  to  allow  the  user  to  focus  attention  on 
individual  modules  responsible  for  intermediate 
results  and  refine  them.  In  our  work  so  far.  we 
have  focused  on  the  seeond  phase  in  spxxt.  local 
consistency  (I.CC).  which  applies  constraints  to  a 


.?4k 


set  of  plausible  hypotheses  and  prunes  the 
hypotheses  that  are  inconsistent  with  those 
constraints.  Furthermore,  we  have  narrowed  this 
focus  to  refining  SPAM’s  distance  and  orientation 
constraints. 

In  working  toward  refining  these  constraints,  we 
posed  several  questions  to  help  guide  our  analysis: 

1.  Does  this  constraint  play  a  positive  (helpful) 
or  negative  (unhelpful)  role  in  the 
interpretation  process? 

2.  If  the  role  is  positive,  are  there  improved 
constraint  bounds  values? 

3.  What  is  this  constraint’s  impact  on  run  time? 

4.  Is  this  constraint  unifonnly  applicable  or 

should  it  be  applied  selectively?  If 

selectively,  in  what  cases  should  we  apply  the 
constraint? 

In  the  following  sections  we  describe  some  of  our 
current  efforts  toward  addressing  these  questions. 

6.1.  The  refinement  methodology 

We  have  begun  our  investigation  on  knowledge 
refinement  by  focusing  on  SPAM’s  second  phase  of 
processing,  LCC.  This  phase  was  chosen  because 
most  of  SPAM’s  time  is  spent  in  this  phase,  and  it 
showed  the  most  potential  for  future  growth.  LCC 
performs  a  m^ified  constraint  satisfaction 
between  hypotheses  generated  in  SPAM’s  first 
phase.  In  LCC,  a  successful  application  of  a 
constraint  provides  support  for  a  pair  of 
hypotheses,  and  an  unsuccessful  application  goes 
towards  filtering  out  that  pair  of  hypotheses.  The 
distance  constraint  specifies  allowable  distance 
ranges  between  different  pairs  of  hypothesized 
objects,  e.g.,  two  hangar  buildings  must  occur 
between  20  and  200  meters  apart,  while  a  parking 
apron  and  a  hangar  building  must  be  between  0 
and  SO  meters  apart.  In  essence,  each  constraint  in 
the  LCC  phase  classifies  the  pairs  of  hypotheses  — 
either  the  constraint  supports  that  pair,  or  it  does 
not. 

Our  refinement  methodology  consists  of  three 
parts:  intermediate  result  evaluation,  constraint 
optimization,  and  embedded  evaluation.  The  first 
two  methods  allow  the  isolation  and  improvement 
of  individual  constraints,  while  the  third  method 
allow  us  to  evaluate  the  performance  of  the  new 
knowledge  in  the  context  of  the  overall  system 
output. 

6.1.1.  Intermediate  result  evaluation 

In  order  to  measure  the  effect  of  various  spatial 
constraints  we  needed  to  establish  a  database  of 
correct  inputs  and  outputs  for  LCC.  For  each  of  the 
sets  of  di^  that  we  run  through  SPAM  we  have  a 
ground-truth  database  containing  all  the  objects  in 


the  scene  with  their  correct  hypothesis  labels.  An 
“ideal”  input  to  the  LCC  phase,  a  set  of  hypiAheses 
that  are  100%  correct,  can  easily  be  manually 
generated  and  run  through  the  system.  Any  errors 
in  the  output  are  then  directly  attributable  to  the 
constraints. 

A  set  of  constraint  results  can  be  generated  by 
allowing  a  user  (the  expert)  to  enumerate  those 
constraints  that  should  exist  between  each  pair  of 
objects  in  the  ideal  input.  This  is  equivalent  to  an 
“ideal”  output  for  LCC. 

Once  SPAM  has  processed  the  ideal  input,  the 
generated  output  can  then  be  compared  to  Ae  ideal 
output.  Such  a  comparison  is  informative  as  it 
allows  a  quantitative  measure  of  error  to  be 
computed.  We  can  produce  this  comparison  as  a 
set  of  confusion  matrices  where  each  matrix 
represents  the  results  for  a  single  constraint  and  a 
single  pair  of  classes.  These  matrices  contain  the 
usual  cells  (tnie-positives,  false-positives,  true- 
negatives,  false-negatives).  An  example  confusion 
matrix  is  shown  in  Figure  22. 

A  true-positive  entry  in  the  confusion  matrix 
indicates  situations  where  the  expert  and  LCC  both 
conclude  that  the  constraint  supports  a  pair  of 
hypotheses.  A  true-negative  entry  indicates 
situations  where  the  expert  and  LCC  both  conclude 
that  the  constraint  does  not  support  a  pair  of 
hypotheses.  A  false-positive  ent^  is  one  where 
LCC  concludes  support,  while  the  expert  does  not. 
A  false-negative  entry  is  one  where  the  expert 
concludes  support,  while  LCC  does  not. 

6.1.2.  Constraint  optimization 

By  examining  the  overlap  of  each  histogram,  we 
can  tell  if  SPAM’s  distance  constraint  is  working 
properly.  The  overlap  of,  for  example,  true 
positives  with  false  positives  can  tell  us  how  the 
constraint  can  be  modified  to  achieve  the  greatest 
number  of  true  positives  without  introducing  too 
many  false  positives.  For  numeric  constraints, 
such  as  distance,  we  have  developed  an  automatic 
process  for  adjusting  the  constraint  bounds  to 
generate  an  improved  set  of  ranges. 

Automated  bounds  selection  is  achieved  by  doing 
an  exhaustive  search  through  the  space  of  possible 
bounds  settings,  evaluating  each  setting  with  an 
objective  function.  Currently,  this  objective 
function  weighs  all  cells  in  the  confusion  matrix 
equally  and  seeks  to  maximize  the  number  of 
elements  in  the  diagonal  cells  of  the  matrix  (tnie- 
positives,  true-negatives). 
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Figure  22:  Confusion  matrix  showing  correctness  of  SPAM’s 
hangar-building/road  distance  constraint 


6.13.  Embedded  evaluation 

Both  methods  described  above  evaluate  and 
improve  the  performance  of  isolated  constraints. 
However,  it  is  most  important  that  the  system’s 
overall  output  improve  with  the  adjusted  contraint 
embedded  within  it. 

We  want  to  choose  to  evaluate  embedded 
perfomnance  at  a  place  within  the  system  where  the 
intermediate  results  have  been  used,  but  where 
only  a  small  amount  of  processing  has  been  done 
so  that  the  credit  assignment  problem  is  avoided. 
We  chose  to  evaluate  performance  at  the  end  of 
spam’s  third  phase,  FA.  This  phase  does  grouping 
based  on  the  results  of  the  constraints  applied  in 
ICC.  These  groups  of  supporting  hypotheses  are 
called  Junctional-areas  (FAs). 

spam’s  long  run  times  prohibit  iterative  refuiement 
if  the  number  of  iterations  required  can  be  large. 
This  limitation  can  be  avoid^  by  qrpropriately 
choosing  experiments  to  run  and  observing  the 
system’s  behavior.  In  this  way,  we  sample  the 
space  of  possible  bounds  settings  and  hence, 
sample  the  system’s  output  behavior. 

6.2.  Analysis  of  results 

We  ran  the  LCC  phase  of  SPAM  with  a  set  of  hand- 
labeled  hypotheses  and  compared  this  to  our 


ground-truth.  The  resulting  comparison  histogram 
was  used  by  our  bounds  adjustment  procedure 
which  generated  a  set  of  optimal  setting  for  this 
constraint.  These  experiments  were  performed  on 
four  data  sets,  with  each  data  set  corresponding  to 
a  different  airport  scene. 

Next,  we  ran  rive  experiments  for  each  data  set, 
allowing  SPAM  to  execute  through  it’s  third  phase. 
Each  experiment  corresponded  to  a  modification  of 
the  distance  constraint  twunds,  as  follows: 

off  constraint  disabled; 

low  bounds  set  to  minimum  value; 

optimized  optimal  bounds  setting  arrived  at  by 
adjustment  procedure; 

original  original  bounds  setting  optimization 
process; 

high  bounds  set  to  maximum  value; 

Those  table  entries  labeled  orient-*  are  orientation 
constraint  modirications.  For  each  run,  we 
compiled  statistics  on  run-time,  number  of 
production  firings,  number  of  functional-areas 
generated,  and  number  of  correct  and  incorrect 
hypotheses  included  in  those  functional-areas. 
Evaluation  was  done  by  comparing  each  run  to  the 
original  bounds  settings. 
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Moffett 


EXT  National 


Bounds 

Setting 

Time 

(cpu 

hours) 

No.  Rule 
Firings 

No.  of 
FAs 

No.  of 
Correct 
(diff.) 

off 

0.26 

97051 

16 

-21 

low 

0.29 

103779 

16 

-21 

optimized 

0.30 

113107 

16 

-10 

original 

0.30 

1I50I2 

17 

0 

high 

0.35 

135978 

19 

16 

orient-off 

0.27 

104672 

17 

-1 

orient-low 

0.30 

114783 

18 

-1 

orient-high 

0.32 

124697 

18 

1 

off 

0.52 

161685 

27 

-25 

low 

0.57 

170801 

27 

-26 

optimized 

0.57 

180060 

28 

-9 

original 

0.58 

183273 

28 

0 

high 

0.72 

240726 

28 

68 

orient-off 

0.54 

167618 

28 

-16 

orient-low 

0.57 

186919 

28 

3 

orient-high 

0.62 

203931 

28 

27 

Table  3:  Sampled  SPAM  performance,  measured  along  several 
dimensions  while  changing  constraint  settings. 


Our  results  are  presented  in  Table  3.  From  this 
table,  we  can  make  several  observations.  First, 
from  the  increase  in  run  time  (from  offXo  original), 
it  can  be  noted  that  the  distance  constraint  is 
having  some  impact  on  the  results.  The  increase  in 
the  number  of  correct  hypotheses  and  the  drop  in 
the  number  of  incorrects  reveals  that  this  constraint 
is  playing  a  positive  role. 

Finding  the  best  setting  for  the  bounds  of  the 
constraints  is  a  more  difficult  problem.  The 
evaluation  function  for  this  task  seems  very 
complex,  taking  into  account  relationships  between 
numbers  of  corrects/incorrects,  sizes  of  functional- 
areas,  and  run  time.  For  the  Moffett  data  set,  the 
number  of  corrects  increases,  while  the  number  of 
incorrects  increases,  but  at  a  slower  pace.  From 
this  we  would  conclude  that  the  bounds  should  be 
set  to  the  maximum  value.  However,  the  same 
analysis  for  DC  National  implies  that  the 
optimized  value  would  be  best.  Other  larger  data 
sets,  such  as  those  for  San  Francisco  National 
Airport,  show  a  similar  trend.  This  suggests  that 
the  bounds  for  the  distance  constraint  should  be 
chosen  on  a  case  by  case  basis. 


An  interesting  phenomena  is  observed  as  the 
constraints  are  selectively  turned  off.  The 
generated  functional-area  groups  get  smaller,  but 
Uiey  do  not  radically  change  in  area  of  coverage. 
This  implies  that  the  distance  constraint  is 
selectively  applicable,  i.e.,  it  largely  overlaps  with 
the  other  system  constraints,  but  it  is  necessary  for 
the  inclusion  of  some  subset  of  hypotheses. 
Because  optimizing  and  then  coupling  these  two 
constraints  does  not  produce  a  dramatic 
improvement  in  results,  it  appears  that  more 
constraints  may  be  required  to  do  a  better  job  of 
interpretation 

63.  Open  issues 

With  the  recent  emphasis  on  performance 
evaluation  of  vision  systems  focused  upon  low  and 
intermediate  level  vision  tasks,  this  woiic 
establishes  a  data  point  in  the  area  of  high  level 
vision  systems.  Though  our  goal  is  to  improve  the 
interpretations  generated  by  SPAM,  we  have  begun 
by  improving  our  understanding  of  how  the 
individual  components  of  SPAM  operate,  and  how 
they  interact.  This  will  provide  tte  foundation  for 
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understanding  the  effects  of  modifying  or  adding 
knowledge  to  the  system. 

We  have  been  able  to  show  that  SPAM’s  distance 
constraint  plays  a  positive  role  in  the  interpretation 
task.  However,  choosing  an  “optimal”  value  is 
difficult,  and  seems  to  be  scene  dependent. 
Finally,  we  have  determined  that,  of  the  two 
constraints  considered  thus  far  (distance  and 
orientation),  the  applicability  of  both  overlaps  a 
great  deal. 

There  is  still  much  to  be  done.  In  the  short  term, 
there  are  several  obvious  problems  that  we  have 
not  addressed.  First,  we  ne^  to  look  more  closely 
at  the  applicability  of  the  constraints  and 
characterize,  if  possible,  the  cases  where  each 
constraint  can  be  applied.  Second,  it  is  unclear  if 
the  bounds  optimization  procedure  we  developed 
will  extend  to  non-numeric  constraints.  Finely, 
we  wish  to  extend  the  analysis  to  simultaneously 
perform  validation  across  multiple  constraints. 

Overall,  we  believe  that  it  will  be  possible  to  build 
a  heuristic  system  that  would  automate  the 
knowledge  refinement  process,  similar  to  the 
automat^  system  in  [Ginsberg  et  al.  88,  Politakis 
and  Weiss  ^].  Within  such  a  system,  we  would 
like  to  discover  ways  not  only  to  improve  the 
current  constraints,  but  to  automate  methods  for 
determining  what  new  knowledge  may  be  needed. 

Our  work  in  multispectral  analysis  to  determine 
surface  material  properties  has  been  focused  on 
basic  research  on  demonstrating  the  utility  of  such 
data  for  cartographic  feature  extraction.  For  many 
tasks  in  traditional  remote  sensing  it  is  clear  that 
having  surface  material  information  drives  many 
tasks  in  land  use,  environmental  monitoring,  and 
natural  resource  management.  Our  hypothesis  is 
that  such  data  can  aid  in  manmade  object 
detection,  delineation,  and  identification. 
However,  getting  multispectral  imagery  at  spatial 
resolutions  that  are  comparable  with  the  high 
resolution  panchromatic  imagery  has  been 
difficult. 

Initial  work  has  demonstrated  the  utility  of  the 
refinement  of  multispectral  classification  using 
monocular  panchromatic  imagery,  and  the  fusion 
of  stereo  disparity  maps  with  surface  material 
information  [Ford  and  McKeown  92b,  Ford  and 
McKeown  92a].  One  issue  is  maintaining  accurate 
registration  between  the  multispectral  scanner  data 
(8  meter  gsd)  and  the  panchromatic  imagery  (1.3 
meter  gsd).  Once  this  is  accomplished  a  unique 
hybrid  thrrc  dimensional  multispectral  dataset  can 
be  created  and  utilized  for  further  analysis. 

Our  recent  research  has  been  to  perform  a 
performance  evaluation  of  two  classification 


techniques,  gaussian  maximum  likelihood  and 
differential  radial  basis  function,  for  surface 
material  classification.  In  order  to  do  this 
evaluation  we  have  created  several  highly  detailed 
ground  truth  segmentations  based  upon  manual 
analysis  of  the  multispectral  imagery,  as  well  as  by 
inspection  of  panchromatic  imagery  acquired  over 
the  same  area.  Details  of  this  work  can  be  found  in 
a  companion  p^r  [Ford  et  al.  93]  in  this  volume. 

Our  overall  conclusions  are  that  multispectral 
imagery  with  moderate  spatial  resolution  has  great 
potential  to  provide  scene  domain  cues  necessary 
to  improve  the  performance  of  cartographic  feature 
extraction  bas^  on  panchromatic  imagery  with 
high  spatial  resolution. 

7.  Multi-spectral  classification 

Our  woik  in  multispectral  analysis  to  determine 
surface  material  properties  has  been  focused  on 
basic  research  on  demonstrating  the  utility  of  such 
data  for  cartogr^hic  feature  extraction.  For  many 
tasks  in  traditional  remote  sensing  it  is  clear  that 
having  surface  material  information  drives  many 
tasks  in  land  use,  environmental  monitoring,  and 
natural  resource  management.  Our  hypothesis  is 
that  such  data  can  aid  in  manmade  object 
detection,  delineation,  and  identification. 
However,  getting  multispectral  imagery  at  spatial 
resolutions  that  are  comparable  with  the  high 
resolution  panchromatic  imagery  has  been 
difficult. 

Initial  work  has  demonstrated  the  utility  of  the 
refinement  of  multispectral  classification  using 
monocular  panchromatic  imagery,  and  the  fusion 
of  stereo  disparity  maps  with  surface  material 
information  [Ford  and  McKeown  92b,  Ford  and 
McKeown  92a].  One  issue  is  maintaining  accurate 
registration  between  the  multispectral  scanner  data 
(8  meter  gsd)  and  the  panchromatic  imagery  (1.3 
meter  gsd).  Once  this  is  accomplished  a  unique 
hybrid  three  dimensional  multispectral  dataset  can 
be  created  and  utilized  for  further  analysis. 

Our  recent  research  has  been  to  perform  a 
performance  evaluation  of  two  classification 
techniques,  gaussian  maximum  likelihood  and 
differential  radial  basis  function,  for  surface 
material  classification.  In  order  to  do  this 
evaluation  we  have  created  several  highly  detailed 
ground  truth  segmentations  based  upon  manual 
analysis  of  the  multispectral  imagery,  as  well  as  by 
inspection  of  panchromatic  imagery  acquired  over 
the  same  area.  Details  of  this  work  can  be  found  in 
a  companion  paper  [Ford  et  al.  93]  in  this  volume. 

Our  overall  conclusions  are  that  multispectral 
imagery  with  moderate  spatial  resolution  has  great 
potential  to  provide  scene  domain  cues  necessary 
to  improve  the  performance  of  cartographic  feature 
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extraction  based  on  panchromatic  imagery  with 
high  spatial  resolution. 
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Abstract 

We  describe  a  system  for  detection  and  description  of 
buildings  in  aerial  scenes.  This  is  a  difficult  task  as  the  aerial 
images  contain  a  vcaiety  (^objects.  Low-level  segmentation 
processes  give  highly  fragmented  segments  due  to  a  number 
of  reasons.  We  use  a  perceptual  grouping  approach  to  collect 
these  fragments  and  discard  those  that  come  from  other 
sources.  We  use  shape  properties  of  the  buildings  for  this.  We 
use  shadows  to  verify  the  hypotheses  generated  by  the  group¬ 
ing  process.  This  step  also  provides  3-D  descriptions  of  the 
buddings.  Our  system  has  been  tested  on  a  tuenber  of  exam¬ 
ples  taken  from  the  imagery  supplied  by  the  RADIUS  pro¬ 
gram  and  the  results  have  generally  been  very  good.  The 
current  system  is  largely  limited  to  overhead  views,  we  are 
currently  working  an  extensions  to  oblique  views. 

1  Introduction 

The  goal  of  this  work  is  to  detea  and  describe  buildii^ 
from  monocular  views  of  aerial  scenes.  This  is  a  dim- 
cult  but  important  task  for  many  applications  such  as 
photo-interpretation  and  cartograpl^.  There  have  been 
many  (nevious  attempts  to  solve  tins  problem,  in  our 
group  [Huertas,  1983,  Huertas  and  Nevada,  19M,  Mo¬ 
han  arxl  Nevada,  1989]  and  elsewhere  [Irvin  and 
McKeown  1989,  Liow  and  Pavlidis,  1990,  Vea- 
kateswar  and  Chelliqrpa,  1990].  These  sy^ms  have 
shown  interesti^  ptrformance  but  on  limited  exam¬ 
ples.  The  technique  we  describe  in  this  pqrer,  we  be¬ 
lieve,  significantly  extends  the  range  of  scenes  that  can 
be  analyzed  though  many  [voblems  remain.  We  show 
several  examples  taken  from  the  images  provided  by  die 
RADIUS  prr^am  to  demonstrate  the  effectiveness  of 
our  technique. 

Building  detection  is  difficult  for  sevoal  reasons. 
The  contrast  between  the  roof  of  a  buil<^  and  sur¬ 
rounding  structures  such  as  curbs,  parking  lots,  and 
walkways  can  be  low.  The  contrast  between  the  ro^  of 
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Figure  1.  A  building  form  Ft  Hood,  Ibxas 

various  wings,  typically  made  of  the  same  material,  may 
be  even  lower.  Low  contrast  alone  is  likdy  to  cause 
low-levd  segmentation  to  be  fragmented.  In  addititm,  ' 
snudl  structures  on  tihe  roof  and  objects,  such  as  cars  and 
trees  adjacem  to  the  building  will  cause  further  frag- 
mentatitm  and  give  rise  to  “noise"  boundaries.  Roofs 
may  also  have  markings  cm  them  caused  by  dirt  or  vari¬ 
ations  in  materid.  Shidows  and  other  surface  markings 
on  the  roof  cause  similar  problems. 

There  are  other  characteristics  of  these  images  which 
may  cause  proUems.  Roofe  have  raised  borders  which 
sometimes  cast  shadows  mi  the  roof.  This  results  in 
mult^le  close  paralld  edges  alrmg  the  roof  boundaries 
and  often  these  edges  are  broken  and  disjoiitt.  At  roof 
comers  and  at  junctimis  of  two  roofs,  multiple  lines 
meet  leading  to  a  number  of  comers  making  it  difficult 
to  choose  a  comer  for  tracking.  A  roof  cast  a  shadow 
along  its  side  and  often  there  are  objects  on  the  ground 
such  as  grass,  trees,  trades,  pavement,  etc.,  which  lead 
to  changes  in  the  contrast  almig  tiie  roof  sides. 

Cmisider  the  building  in  Figure  1,  from  a  scene  of  Ft 
Hood  in  Texas.  The  building  is  easy  for  hunums  to  see 
and  describe,  but  it  is  difficult  for  computer  visimi  sys- 
terns.  Figiro  2  shows  the  line  s^ments  detected  in  the 
image,  using  LINEAR,  our  liriear  feature  extraction 
software  [Nevatia  and  Babu,  1980,  Canny,  198^.  We 
are  stiU  able  to  see  the  roof  structures  of  die  buildings 
readily  and  easily,  but  the  conqilexity  of  die  task  now 
becomes  more  apparent  The  blading  boundary  is  fiag- 
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mented,  there  are  gaps  and  missing  segments.  Thae  are 
also  many  extraneous  boundaries  caused  by  other  struc¬ 
tures  in  tte  scene. 


Figiue  2.  Line  segments  extracted  from  image 


Much  of  the  previous  wcnic  to  resolve  these  prob¬ 
lems  has  used  some  fmm  of  a  contour  tracing  technique, 
see  for  example  [Huertas,  1983,  Huertas  and  Nevada, 
1988,  Venkateswar  and  Chellappa,  1990].These  are  es¬ 
sentially  local  techniques  that  must  make  a  decision  of 
which  path  to  trace  at  each  local  junction.  Of  course,  all 
paths  could  be  traced  using  backtracking  but  then  the 
search  ^ace  may  become  prohibitively  large. 

We  propose,  instead,  to  use  a  perceptual  grouping 
approach.  Cultural  features  such  as  buildings  rq)resent 
structures  that  are  not  random  but  have  specific  geomet¬ 
ric  properties.  In  this  work,  we  restrict  the  shapes  of 
buildings  to  be  rectan^ar  or  composition  of  rectangu¬ 
lar  sha^  (thus  allowing  L,  T  and  I  shi^  for  exam¬ 
ple).  We  also  assume  that  the  viewpoint  is  more  less 
overhead.  Thus,  primarily,  we  see  roofs  which  projea 
as  rectangles  or  composition  of  rectangles.  This  proper- 
Qr  can  be  used  to  organize  the  detect  line  sclents 
into  roof  hypotheses.  We  believe  that  this  approach 
leads  to  many  fewer  hypotheses  than  would  be  generat¬ 
ed  by  a  complete  contour  tracing  scheme. 

We  can  choose  among  the  many  hypt^eses  by  uti¬ 
lizing  other  properties  of  the  image.  Specifically,  under 
favorable  iinaging  conditions  3-D  structures  cast  shad¬ 
ows  that  allow  verification  of  roof  h^otheses  and  fur¬ 
ther  provide  us  an  estimate  of  the  hei^t  allowing  us  to 
generate  3-D  descriptions  of  the  deteoed  buildings.  An 
dternative  would  be  to  utilize  more  than  one  image  of 
the  scene  to  infer  heights  of  the  features  of  the  roof  and 
to  separate  them  from  features  on  the  ground;  another 
project  in  our  group  has  explored  this  approach  [Chung 
and  Nevada,  1992].  The  advantage  of  using  only  one 
image,  of  course,  is  that  such  imagery  is  much  easier  to 
acquire. 

Our  approach  combines  several  of  the  techniques 
from  our  previous  work.  Our  po^ceptual  grouping  ap¬ 
proach  comes  from  the  work  described  in  [Mohan  and 
Nevada  1989],  however,  we  use  a  very  differem  selec¬ 
tion  technique.  The  earlier  work,  in  fact,  used  perceptu¬ 
al  grouping  for  stereo  analysis,  here  we  apply  it  to  mo¬ 
nocular  analysis.  Our  shadow  analysis  method  is  an  ex¬ 
tension  of  ^  approach  first  described  in  [Huertas, 
1983,  Huotas  and  Nevada,  1988]. 


2  Overview  of  the  System 
The  diagram  in  Figure  3  shows  the  main  components  in 
our  system.  The  system  uses  the  line  segments  ^{X’oxi- 
mating  the  intensity  boundaries  to  compute  lines  and 
rdevam  junctions  among  than.  A  hierarchy  of  features 
including  parallel  rdationships  and  pt^ons  of  rectan¬ 
gles  leai^  to  the  formation  of  building  hypotheses. 
Itese  consist  of  instances  of  rectangular  shapes  that  po¬ 
tentially  correspond  to  building  roofr  and  parts  of  build¬ 
ing  roofs  (see  section  3).  Next,  promising  rectangles  are 
setected  and  verified  to  correspond  to  building  struc¬ 
tures. 

Our  philosophy  in  the  design  of  this  system  has  been 
to  make  only  Aose  decisions  that  can  be  made  confr- 
dently  at  each  levd.  Thus,  we  choose  to  generate  as 
many  hypotheses  as  seem  feasible  at  the  first  levd.  Our 
selection  process  too  is  conservative  and  favors  keqiing 
hypotheses  that  may  be  viable.  The  verification  process 
1^  the  most  global  information  and  can  make  stronger 
decisions.  Even  here,  if  our  system  is  to  be  embedded  in 
a  larger  system,  some  of  the  decisions  would  be  de¬ 
ferred  to  that  system  where  more  context  is  available  fa 
decision  making. 
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Figure  3.  Block  Diagram  of  the  System 

3  Generation  of  Hypotheses 
The  process  of  hypotheses  formation  is  essentially  the 
one  described  in  [Mohan  and  Nevada,  1989].  This  sys¬ 
tem  has  been  applied  to  building  detection  but  using  a 
stereo  pair  of  images.  In  this  process  we  construct  a  fea¬ 
ture  hierarchy  which  encodes  the  structural  relation¬ 
ships  specific  to  rectangular  shapes;  Lines,  parallels,  U- 
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contours,  and  rectangles. 

Lines  and  Junctions 

A  group  of  close  parallel  lines  represent  a  linear  struc¬ 
ture  at  a  higher  granularity  level  than  the  edges  (see  the 
common  boundary  between  the  building  wings  in 
Figure  2.)  The  resulting  lines  have  a  length  and  an  mi- 
entation  derived  from  w  contributing  dements.  Figure 
4  shows  the  lines  obtained  from  grouping  the  s^ments 
in  Figure  2.  We  use  these  lines  to  detectL-junctions  and 
T-junctions  also  shown  in  Figure  4. 


Figure  4.  Linear  structures  and  junctions 

Parallels  and  U-stnictures 
Structures  in  urban  scenes  like  buildings,  roads  and 
parking  lots  are  often  organized  in  regular  grid-like  pat¬ 
terns.  These  structures  are  composed  of  paralld  sides. 
As  a  consequence,  fra*  each  significant  line-structure  de¬ 
tected  in  the  scene,  there  is  not  one  but  many  lines  par¬ 
allel  to  it  For  each  line,  we  find  lines  that  are  paralld 
and  satisfy  a  number  of  reasonable  constraints.  Note 
that  the  formation  of  ^parallel  structure  also  aids  in  the 
fOTmation  of  new  lines,  as  they  suggest  extension  and 
contraction  of  the  parallels  to  achieve  full  overlap. 

When  the  two  lines  in  a  parallel  structure  have  their 
ends  aligned,  they  strongly  suggest  the  presence  of  a 
line  with  which  tte  paraUd  structure  would  form  a  U- 
structure.  Even  if  the  third  line  does  not  exist  in  the  set 
of  lines,  we  hypothesize  it  and  generate  the  U-structure. 

Rectangles 

Rectangle  structures  are  generated  from  the  U-struc- 
tures.  Tile  rectangles  formed  in  our  example  are  shown 
in  Figure  5.  In  pnu^cal  applications  this  number  can  be 
reduced  by  restricting  the  formation  of  rectangles  on  the 
basis  of  size,  as  a  fii^mi  of  image  resolution,  for  ex¬ 
ample.  Rectangles  are  also  generated  from  matching 
junctions  along  the  direction  of  illumination  (see  strong 
junctions  in  section  3.)  We  hypothesize  the  missing  pOT- 
tions  of  a  rectangle  having  a  corner  with  a  matcUng 
shadow  corner. 

4  Selection  of  Hypotheses 
After  the  formation  of  all  reasonable  rectangles,  a  sdec- 
tion  process  is  allied  to  choose  rectan^es  having 
strong  evidence  of  support  and  having  minimum  con¬ 
flict  among  them.  Our  previous  system  used  a  Cmi- 
straim  Satisfaction  Network  (CSN)  [Mohan  and  Neva¬ 
da,  1989].  Here,  we  use  adifferent  method  which  seems 


Figure  3.  Rectangle  hypotheses  gmerated 

to  give  much  more  predictaUe  results. 

Our  new  system  uses  two  kinds  of  criteria:  local  se¬ 
lection  criteria  and  global  selection  criteria.  Local  se¬ 
lection  criteria  determine  wh^her  w  not  a  rectangle  is 
“good”  based  on  the  local  siqiprating  evidence.  Only 
good  rectangles  are  retained  for  glol^  selection.  It  is 
possible  that  some  of  the  good  rectan^es  retained  after 
the  local  sdection  are  mutually  crmtained  or  diq)licated 
or  overl^p^  with  some  other  good  rectangles.  Global 
selection  critnia  sdect  the  best  consistem  rectangles 
from  good  rectangles. 

We  apply  local  selection  criteria  and  global  selection 
criteria  dintfoitly.  Local  selection  criteria  (evaluation 
criteria)  wmk  together  to  evaluate  the  goodness  of  a 
rectangle,  while  ^obal  sdection  criteria  worir  sq>arate- 
ly.  Each  ^obal  Section  criterion  acts  like  &  filter.  The 
set  of  retained  rectan^es  pass  through  all  filtos  and  the 
set  of  rectangles  coming  out  from  the  last  filter  will  is 
the  set  of  rectangles  sdected  by  the  sdection  process. 

The  local  sdection  criteria  are  used  to  remove  rect¬ 
angles  framed  using  weak  evidence.  Fra  each  rectai^e 
the  evaluation  criicria  compute  a  goodness  value.  If  this 
value  exceeds  a  given  threshold,  the  rectangle  is  sdect¬ 
ed,  otherwise  the  rectapgle  is  removed. 

Every  evaluation  raiterion  is  weighted  according  to 
its  importance.  The  goodness  of  a  recumgle  is  then  mea¬ 
sured  by  the  sum  of  the  wdghted  valuer  returned  by  the 
evaluation  criteria.  The  problem  of  measuring  die  good¬ 
ness  of  a  rectangle  now  becomes  a  problran  of  fi^ng 
and  fonnulating  good  evaluation  criteria,  and  how  to  as¬ 
sign  appropriate  wdghts. 

Whether  a  rectangle  is  good  ra  not  dqiends  on  evi¬ 
dence  of  support  We  distinguish  between  positive  evi¬ 
dence  and  negative  evidence  of  support  fra  a  rectangle. 
The  positive  evidence  we  use  includes  the  presence  of 
edges,  comers,  parallds,  and  shadows.  The  negative  ev¬ 
idence  includes  the  presence  of  lines  crossing  any  side 
of  a  rectangle,  existmce  of  L-junctions  ra  T-junctions  in 
any  side  of  a  rectangle,  existence  of  overlajping  gap  on 
opfXKite  sides  of  a  rectangle,  and  displacement  between 
four  sides  of  a  recta^e  and  its  corresponding  edges 
support  Negative  evidmce  is  as  important  as  positive 
evidence,  b^ause  tl^  he^i  us  to  remove  those  rectan¬ 
gles  which  are  less  likely  to  be  part  of  buildings. 

Good  rectangles  surviving  local  selection  may  com¬ 
pete  with  each  other.  Fra  exanqile,  some  rectangles 


could  share  the  same  edge  or  ccnners  support  and  some 
rectangles  might  overltq)  with  each  other.  The  goal  of 
^obal  selection  criteria  is  to  select  a  minimum  set  of 
rectangles  which  best  describe  the  rectangular  composi- 
ticm  of  the  scene. 

Global  selection  criteria  examine  overltqiping  rect¬ 
angles  and  choose  one  if  appropriate.  The  sdection  is 
based  on  rdative  properties  of  each  rectangle,  the 
amount  and  kind  of  overlap,  and  whetho'  they  share 
suppmt  or  not  Note  that  a  rectangle  fully  contained  in 
ano^o-  is  not  necessarily  removed.  If  a  rectan^e  does 
not  overlap  with  any  other  rectangles  then  it  is  not  in 
competition,  and  it  remains.  If  available,  some  of  the 
shadow  evidence  is  used  in  this  process. 

The  rectangles  selected  in  our  Ft  Hood  example  af- 
to*  both  die  lo^  and  global  selection  criteria  have  been 
applied  are  shown  in  Figure  6. 


Figure  6.  Selected  Rectangles 

5  Verification  of  Hypotheses 
The  purpose  of  verification  is  to  validate  the  selected 
hypotheses  to  corre^nd  to  buildings.  Our  validation 
su^  segments  the  objects,  generates  a  description  of  the 
shi^  of  the  structures  and  derives  a  3-D  model. 

5.1  Shadow  Analysis 

By  shadow  analysis  we  mean  the  establishment  of  cor¬ 
respondences  between  shadow  casting  elements  and 
shadows  cast,  and  the  use  of  these  correspondences  to 
voify  and  model  3-D  structures.  We  assume  that  the 
sun  an^es  are  given  and  that  the  ground  surface  in  the 
immediate  neighborhood  of  the  structure  is  fairly  flat 
and  level.  The  shadow  casting  elements  are  given  by  the 
sides  and  junctions  of  the  selected  rectangle  hypotheses. 
The  shadow  boundaries  are  located  among  the  lines  and 
junctions  computed  earlier  from  the  image. 

Tho-e  are  a  number  of  difficulties  that  prevent  the 
accurate  establishment  of  ccnrespondences  howevo^. 
Building  sides  are  usually  surrounded  by  a  variety  of 
objec^  such  as  loading  ramps  and  docks,  ^ass  areas 
and  sidewalks,  trees,  plants  and  shrubs,  vehicles,  light 
and  dark  areas  of  various  matoials.  Nearby  structures 
may  reflect  light  into  the  shadowed  areas  making  the 
objects  in  it  more  visible,  and  so  on.  To  deal  with  these 
problems  we  have  adopted  the  following  definitions, 
criteria  mid  geometric  constraints  to  analyze  the  shad¬ 
ows  adjacem  to  rectangles  (see  Figure  7): 
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Figure  7.  Shadow  features 
Strong  Junctions 

Matching  junctions  along  the  direction  of  illumination, 
having  a  consign!  shc^e  and  a  consistent  attitude. 
These  junctions  constitute  the  strongest  monocular  cue 
to  the  presence  of  a  3-D  structure.  We  use  then  also  to 
fmm  and  select  rectangle  hypotheses. 

Strong  Lines 

Vertical  building  edges  cast  shadow  lines  in  a  direction 
similar  to  the  direction  of  the  projection  of  the  sun  rays. 
We  use  this  evidence  also  during  hypotheses  selection. 

Medium  Lines 

The  rectangle  sides  that  are  supposed  to  cast  shadows 
must  have  cmresponding  shadow  lines. 

Medium  Junctions 

The  junctions  formed  by  strong  and  medium  lines, 
found  along  the  direction  of  the  strong  lines. 

Weak  Junctions  and  Lines 

Junctions  and  breaks  in  the  shadow  boundaries  between 
the  strong  and  weak  junctions. 

Strong  Regions 

Dark  regions  surrounded  by  strong  and  medium  junc¬ 
tions.  We  require  that  this  region  be  daiker  than  the  rect¬ 
angle  region  regardless  of  tteir  gray  level. 

Weak  Regions 

In  the  absence  of  geometric  correspondences  of  junc¬ 
tions  and  lines,  a  d^  r^ion  adjacent  to  rectangle,  con¬ 
sistent  with  the  direction  of  illumination. 

5.2  Shadow  Process 

The  shadow  process  consists  of  four  steps: 

Extraction  of  Potential  Shadow  Evidence 
Potential  shadow  evidence  consists  of  lines,  junctions 
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and  intensity  statistics.  We  extract  the  following; 

•  Lines  parallel  to  the  projection  of  the  sun  rays.  They 
represent  potential  shadow  lines  cast  by  vertical  edg¬ 
es  of  3-D  structures  in  the  image. 

•  Lines  having  their  dark  side  on  the  side  of  the  illumi¬ 
nation  source  are  potential  shadow  lines. 

•  Junctions  among  the  lines  above. 

•  Pixel  statistics  to  compare  relative  brightness. 

The  potential  shadow  lines  and  junctions  extracted 
from  the  lines  in  our  FtHood  example  are  shown  solid 
in  Figure  8.  The  uiutolying  edges  are  shown  in  gray. 


Figures.  Potential  shadow  lines  and  junctions 

Search  for  Shadow  Evidence 
For  each  rectangle  we  look  in  a  search  window  (dashed 
lines  in  Flgtro  9)  and  collea  all  the  potential  shadow 
evidence  in  it  The  search  distance  is  arbitrarily  chosen 
as  a  function  of  the  maximum  expected  building  height 
and  the  sun  incidence  angle.  There  is  the  possibility  that 
lines,  not  relevant  to  the  current  rectangle,  be  included. 
They  however,  have  a  reduced  effect  in  the  presence  of 
the  real  evidence. 


Figure  9.  Windows  to  search  for  shadows 

We  favOT  medium  and  weak  lines  that  are  parallel  to 
the  rectangle  side.  In  smne  cases  there  may  be  various 
sets  of  lines,  all  parallel  to  the  building  side  but  at  vari¬ 
ous  distances  from  the  rectangle  side.This  is  actually  a 
common  occurrence  since  many  side  walks,  grass  areas, 
streets,  vehicles  and  so  on,  will  be  found  to  be  arranged 
or  located  parallel  to  building  sides.  In  this  case  we 
choose  those  shadow  lines  at  the  distance  from  the  rect¬ 
angle  side  such  that  the  sum  of  their  Imigths  is  greater, 
but  not  exceeding  the  length  of  the  rectangle.  We  deter¬ 


mine  the  width  of  the  shadow  by  averaging  the  distance 
to  the  lines  sdected.  The  selected  evidmice  is  then  con¬ 
sidered  to  surround  the  shadow  r^on.  We  compute  the 
mean  intensity  of  this  r^on  and  compare  it  to  the  rect¬ 
angle  r^oaThe  evider^  collected  for  both  sides  is 
comUn^  to  give  the  evidence  for  the  rectangle. 

Evaluation  of  Shadow  Evidence 
We  evaluate  the  shadow  evidence  and  give  a  confidence 
value  as  a  wei^ted  sum  of  the  evident  of  strong  junc¬ 
tions,  medium  junctions,  strong  line,  wetdt  lines,  strong 
and  weak  regions.  We  designated  five  levds  of  confi¬ 
dence.  Each  level  of  confi&nce  requires  that  a  mini¬ 
mum  amount  of  tte  different  kin^  of  evidence  be 
present  high  ctxifidence  requires  that  every  kind 
of  evidoice  be  detected.  Very  low  evidence  is  riq>mied 
whra  no  geometric  correspondences  can  be  established 
but  the  presmice  of  a  r^on,  adjacoit  to  and  darker  thaa 
the  rectangle  r^on  itself,  is  found. 

The  rectangles  selected  on  the  basis  of  shadow  evi¬ 
dence  are  shown  in  Figure  10. 


Figure  10.  Rectangles  with  evidmice  of  shadows 
Use  of  Shadow  Evidence 

The  rectangles  validated  by  shadows  are  used  to  give 
the  footprint  of  buildings  or  portions  of  buildings.  The 
shadow  widths  are  used  to  estimate  their  height  The  re¬ 
sult  is  and  elevation  map  encoding  the  height  computed 
for  each  rectangle.  A  3-D  rendered  view  computed  from 
the  rectan^es  verified  in  our  Ft  Hood  example  is 
shown  in  Figure  11. 


Figure  11. 3-D  view  from  anothm*  viewpoint 

6  Results 

We  have  tested  our  technique  on  many  images  from  the 
Ft  Hood  site  and  from  a  moddboard  site.  We  selected 
a  few  to  demonstrate  the  perfcxmance  of  our  system.  In 
the  remaining  figures  (except  Figme  16),  (a)  is  the  im¬ 
age,  (b)  the  line  segments,  (c)  the  lines  arid  junctions,  (d) 


the  rectangles,  (e)  the  selected  hypothnes,  and  (f)  the 
hypotheses  verified  by  shadows.  In  particular,  note  fig¬ 
ure  (e),  the  excellent  peifrarmance  of  the  new  selection 
technique.  In  the  absoice  of  shadow  infixmation,  the 
selected  rectangles  can  be  matched  by  our  system  if  ste¬ 
reo  views  are  available,  thus  providing  verification  and 
a  3-D  modd. 

Figure  12  shows  a  set  of  four  building  and  part  of 
anothff .  The  difficulty  here  is  with  the  budding  with  the 
patterned  arrangmnem  of  small  objects  on  the  roof.  The 
shadows  cast  by  tl»se  reach  one  side  of  the  building 
causing  it  to  be  fragmented.  The  shadow  occluding  the 
top  left  c<xn^  of  the  building  and  the  poor  boundary 
ddinition  on  the  top  right  are  also  a  smirce  of  difficulty. 
The  strong  shadow  cues  however  help  form  rectangle 
hypotheses  for  most  of  the  building 


Figure  12.  Modelboard  -  Scene  1 


In  Figure  13  the  small  building  on  the  top  left  cmner 
of  the  large  one  is  detected  s^aratdy.  Note  that  ptx- 
tions  of  buildings  not  in  full  view  are  dso  detected. 

Figure  14  shows  two  dark  buildings.  The  boundaries 
between  buildings  and  shadows  in  cases  like  this  has 
low  contrast  and  are  difficult  to  detect 

Figure  13  shows  a  complex  building  with  numerous 
recta^ular  components  on  the  roof.  We  are  able  to  ex¬ 
ploit  the  presence  of  strong  shadow  evidence.  It  allows 
the  system  to  form  a  hypothesis  for  the  entire  building 
in  spite  of  the  Ix'oken  and  fragmented  boundaries.  Note 
that  the  sdection  mechaiusm  is  able  to  select  most  of 
the  rectangular  components  on  the  roof  as  well.  In  this 
example  the  shadows  are  well  defined,  and  it  is  possible 
to  measure  their  width  accurately.  Figure  16  shows  a  3- 
D  rendered  view  of  the  building. 

Figure  17  shows  the  same  building  from  an  oblique 
view  (30°).  The  roof  rmnains  a  rectangle  with  small  per¬ 
spective  effects  that  the  system  is  able  to  tolerate.  Pro¬ 
cessing  oblique  views  is  the  subject  of  our  current  wOTk. 
Figure  18  shows  a  building  in  Ft  Hood,  where  some 


Figure  13.  Modelboard  -  Scene  2 


of  the  details  of  one  of  its  sides  is  visible,  ^aroitly 
doors.  These  and  the  vehicles  parked  on  the  otha  side 
result  in  highly  fragmented  boundaries.  The  rectangles 
vOTfied  by  shadows  include  one  that  is  formed  from 
various  aligned  parked  trailers  which  collectively  cast  a 
shadow.  The  small  rectangle  on  the  bottom  has  a  strong 
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Figure  13.  Modelboard  -  Scene  4 


Figure  16. 3-D  view  from  another  viewpoint 


shadow  junction  cmresponding  to  an  actual  narrow 
shadow  cast  by  a  vehicle.  The  lower  wing  of  the  build¬ 
ing  has  a  strong  line  and  a  cmre^onding  medium  junc¬ 
tion.  The  rest  of  the  shadow  is  diffused  and  only  gives  a 
“dark”  region  adjacem  to  the  building  wings. 

In  Figure  19  the  I-shaped  building  has  no  strong  ev- 
idoice  of  shadows.  The  rectangles  are  weakly  validated 
on  die  basis  of  a  strong  r^on  which  up  a  given  maxi¬ 
mum  search  distance  remains  “strongly”  dt^. 

In  Figme  20  shows  a  not  uncommon  situation.  The 
building  is  surrounded  by  rectangles  rqnesenting 
grassy  areas  and  sidewalks  paraUd  to  the  building.  The 
only  shadow  evidence  is  the  darir  region  adjacem  to  the 


Figure  17.  Modelboard  -  Scmie  4  (odique) 


Figure  18.  Fort  Hood  -  Scene  2 


building.  Note  that  the  other  rectangles  have  adjacmit 
bright  r^ons,  namdy,  the  building  itself  and  its  wings. 

Figure  21  shows  a  group  of  small  buildings  arranged 
in  a  parallel  Ikshion,  and  surrounded  by  other  paralM 
structures.  In  spite  the  large  numbo'  of  hypmhKes  die 
system  is  able  to  sdect  the  rdevam  ones. 

7  Current  Work:  Oblique  Views 
In  the  scenes  analyzed,  many  of  die  objects  have  re¬ 
stricted  shqies  and  often  the  viewpmnt  is  restricted.  For 
some  applications,  it  is  necessary  to  impair  informa- 
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Figure  19.  Fort  Hood  -  Scene  3 


Figure  20.  Fort  Hood  -  Scene  4 


Figure  21.  Fort  Hood  -  Scene  5 


tion  extracted  from  images  of  a  scene  acquired  from 
various  viewpoints  or  acquired  through  various  types  of 
sensors.  We  are  ciurendy  paforming  extensive  testing 
of  our  system  and  reviewing  our  m^ods  to  determine 
the  feasibility  of  relaxing  viewpoint  restrictions.  We 
have  begun  investigating  cvthogonal  trihedral  vmices 
(OTVs)  in  oblique  views.  If  we  continue  to  assume  that 
we  restrict  the  shape  of  the  objects  to  rectangles,  the 
most  significant  change  is  that  right  angles  in  the  real 
world  no  longer  necessarily  projea  onto  right  angles  in 
the  image. 
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Abstract 

The  results  of  the  “JISCT”  Stereo  Evaluation  (named 
after  the  five  groups  contributing  imagery;  JPL,  INRIA 
(in  France),  SRI,  CMU,  and  Teleos)  are  presented.  The 
goals  of  this  evaluation,  which  was  the  first  phase  of  a 
multiphase  evaluation  process,  were  (1)  to  get  an  initial 
estimate  of  the  effectiveness  of  current  stereo  techniques 
applied  to  Unmanned  Ground  Vehicle  (UGV)  tasks,  (2) 
to  identify  key  problems  for  future  research,  and  (3)  to 
debug  the  evaluation  process  so  that  it  can  be  repeated 
with  a  larger  group  of  participants.  SRI  collected  49 
pairs  of  images,  distributed  them  to  the  five  participants, 
and  received  complete  results  from  three  groups  —  IN¬ 
RIA,  SRI,  and  Teleos.  SRI  compared  the  results  by  in¬ 
teractively  analyzing  them  and  automatically  gathering 
statistics. 

We  were  surprised  by  the  completeness  of  everyone’s 
results.  On  the  eight  image  pairs  that  we  thought  were 
the  most  representative  of  UGV  tasks,  the  techniques 
computed  disparities  for  as  much  as  87%  of  the  points 
with  only  a  few  “spike”  errors  and  some  scattered  regions 
of  points  without  matches.  Although  the  missing  points 
(and  mistakes  in  the  reported  matches)  could  cause  prob¬ 
lems  for  vehicle  navigation,  this  level  of  completeness  is 
an  indication  that  there  is  a  solid  basis  for  building  a 
passive  ranging  system  for  an  outdoor  vehicle.  On  the 
other  hand,  none  of  these  techniques  have  “solved  the 
stereo  problem”  —  we  selected  a  number  of  important 
areas  for  future  research,  including  filtering  out  gross  er¬ 
rors  and  handling  the  wide  dynamic  range  of  intensities 
common  in  outdoor  imagery. 

1  Introduction 

Stereo  analysis,  which  for  a  long  time  had  been 
viewed  as  an  interesting,  but  too-costly-to-be-practical 
technique,  has  emerged  as  a  viable  tool  for  realtime  ap¬ 
plications  such  as  vehicle  navigation.  This  has  happened 
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for  two  reasons.  First,  advances  in  hardware  have  made 
it  practical  to  compute  stereo  matches  “in  real  time.” 
And  second,  advances  in  algorithm  development  have 
made  it  possible  to  correctly  match  large  portions  of  out¬ 
door  scenes. 

An  important  next  step  in  the  development  and  use  of 
practical  stereo  systems  is  the  characterization  of  their 
capabilities.  Potential  users,  such  as  system  integrators 
and  automatic  task  planner^,  need  to  know  their  compu¬ 
tational  requirements,  their  speeds,  their  precision,  their 
mistakes,  and  so  forth,  in  order  to  model  their  behav¬ 
ior  and  reason  about  their  use.  With  this  in  mind,  SRI, 
JPL,  and  Teleos  began  a  multiphase  evaluation  process 
last  year  within  the  Unmanned  Ground  Vehicle  (UGV) 
Project.  The  first  phase  of  that  evaluation  has  been 
completed,  and  the  second  phase  has  begun.  This  paper 
describes  the  results  of  the  first  phase. 

The  overall  plan  for  our  complete  evrduation  process  is 
to  pursue  a  three-pronged  approach,  including  analytic 
models,  qualitative  “behavioral”  models,  and  statistical 
performance  models.  The  analytic  models  would  be  used 
to  estimate  such  things  as  the  expected  depth  precision 
computable  with  a  specific  camera  configuration.  The 
qualitative  models  would  be  used  to  identify  key  prob¬ 
lems  for  future  research,  for  example,  detection  of  holes, 
analysis  of  shadowed  regions,  and  measurement  of  bland 
areas.  The  statistical  models  would  be  used  to  produce 
quantitative  estimates  of  such  key  fetetors  as  the  smallest 
obstacle  detectable  at  a  specified  distance.  SRI  has  taken 
the  lead  in  the  qualitative  evaluation;  JPL  has  taken  the 
lead  in  the  quantitative  analysis. 

For  the  qualitative  analysis,  we  decided  to  start  by  ex¬ 
amining  a  small  number  of  techniques  in  order  to  debug 
the  process,  and  then  expand  the  evaluation  to  include 
a  much  larger  set  of  participants.  The  goals  of  the  first 
phase  were  to  get  an  initial  estimate  of  the  effectiveness 
of  current  stereo  techniques  applied  to  UGV  tasks  and, 
from  this,  to  identify  key  problems  for  future  research. 

One  of  the  high-level  guidelines  we  adopted  was  to  de¬ 
velop  and  maintain  an  atmosphere  of  cooperation  and 
constructive  criticism  among  the  researchers  participat¬ 
ing  in  the  evaluation.  Without  this  we  would  not  be 
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able  to  focus  on  our  ultimate  goal  of  producing  a  se¬ 
quence  of  increasingly  capable  stereo  systems.  To  help 
establish  a  cooperative  atmosphere,  we  decided  to  con¬ 
centrate  on  the  positive  aspects  of  each  algorithm  and 
highlight  ways  to  strengthen  existing  techniciues,  realiz¬ 
ing  that  they  were  developed  for  different  domains  and 
different  applications.  We  also  decided  to  share  all  the 
raw  results  with  the  participants  so  they  could  duplicate 
our  analysis  or  develop  their  own. 

For  the  first  phase  of  the  qualitative  evaluation, 
SRI  collected  imagery  from  five  groups,  JPL,  INRIA 
(in  France),  SRI,  CMU,  and  Teleos  (hence  the  name 
“.JISCT”  for  the  first  evaluation  phase);  selected  49  pairs 
for  analysis;  converted  them  into  a  standard  format;  dis¬ 
tributed  the  dataset  to  the  five  groups  for  processing, 
along  with  an  extensive  set  of  instructions;  collected  the 
results;  characterized  them;  and  finally  distributed  the 
results  and  the  associated  report  to  the  participants. 

We  intentionally  asked  each  group  to  process  a  large 
number  of  pairs  (10  training  pairs  and  45  “test”  pairs 
...  6  pairs  were  in  both  the  training  and  test  sets;  we 
made  an  administrative  mistake  on  one  of  the  test  pairs, 
reducing  the  total  to  44),  because  we  wanted  to  force 
them  to  establish  a  standard  algorithm  that  was  auto¬ 
matically  applied.  As  a  result  of  this,  there  are  now  four 
groups  around  the  world  that  can  readily  apply  end-to- 
end  stereo  techniques  to  new  data  and  compare  their 
results.  As  part  of  the  second  phase  we  hope  to  expand 
this  community  to  10  or  more  groups.  This  process  is 
opening  up  a  new  form  of  interaction  within  the  com¬ 
puter  vision  community  that  we  feel  will  help  stimulate 
advances  and  reduce  redundant  development. 

In  the  instructions  to  the  participants,  we  asked  each 
group  to  produce  several  results  for  each  matched  point 
in  addition  to  its  computed  disparity.  For  each  point 
we  asked  for  an  x  and  a  y  disparity,  an  estimate  of  the 
precision  associated  with  each  reported  disparity,  an  es¬ 
timate  of  the  confidence  associated  with  each  match,  and 
an  annotation  for  each  unmatched  point,  indicating  why 
the  technique  could  not  find  a  match.  Possible  explana¬ 
tions  for  no  match  included  “area  too  bland,”  “multiple 
choices,”  and  “inconsistent  with  neighbors.”  Although 
none  of  the  groups  produced  all  this  additional  informa¬ 
tion  (they  all  produced  some  of  it),  we  felt  that  it  was 
important  to  begin  the  process  with  the  goal  of  produc¬ 
ing  this  auxiliary  information,  which  will  be  invaluable 
for  the  higher-level  routines  using  the  stereo  results.  We 
foresee  a  time  in  the  not  too  distant  future  when  the 
calling  routine  will  use  the  precisions,  confidences,  and 
annotations  to  actively  control  the  sensor  parameters  for 
the  next  data  acquisition  step.  For  example,  if  tlie  cur¬ 
rent  stereo  results  contain  a  large  region  of  points  with¬ 
out  disparities  and  the  image  region  is  quite  dark,  the 
controlling  routine  could  open  the  irises  or  increase  the 
integration  time  to  reexamine  these  dark  regions. 

Four  groups  returned  results  and  write-ups  to  SRI  - 


Teleos,  SRI,  and  two  from  INRIA.  One  of  the  INRIA 
sets  was  from  a  technique  that  locates  linear  features 
and  then  matches  these  features.  Since  this  technique 
reports  only  disparities  along  the  matched  edges,  it  was 
not  possible  to  directly  compare  its  results  to  the  oth¬ 
ers.  Therefore,  we  concentrated  our  analysis  on  the  three 
correlation- based  algorithms. 

Each  participating  group  analyzed  its  own  results.  In 
addition,  Harlyn  Baker  and  Marsha  Jo  Hannah  of  SRI 
analyzed  the  results  from  all  the  groups  on  all  44  pairs 
and  wrote  short  reviews  of  them.  In  the  full  report 
[Bolles,  Baker,  &  Hannah],  their  comments  are  included 
as  appendices.  These  comments,  plus  the  automatically 
compiled  statistics,  form  the  core  of  this  evaluation. 

Initially,  we  were  a  little  reluctant  to  compute  and 
publish  statistics  that  may  be  taken  out  of  context.  On 
the  other  hand,  statistics,  if  reported  with  sufficient 
caveats,  can  provide  a  convenient  basis  for  comparing 
techniques.  In  this  paper,  we  summarize  the  qualitative 
results  and  quantitative  statistics.  The  validities  of  both 
are  limited  by  the  dataset,  which  implicitly  defines  the 
range  of  <lata  for  which  the  conclusions  directly  apply, 
and  by  the  analyzers,  who  naturally  focused  on  issues 
they  were  most  interested  in. 

This  paper  is  organized  as  follows.  In  Section  2,  we 
briefly  describe  the  key  strategies  and  parameters  of  the 
three  principal  techniques,  highlighting  their  similarities 
and  differences.  In  Section  3,  we  describe  our  experi¬ 
mental  procedure.  In  Section  4,  we  present  the  auto¬ 
matically  gathered  statistics,  which  we  refer  to  as  the 
believe-everything-they-tell-you  statistics  because  they 
are  based  on  the  number  of  “reported”  disparities  in 
specified  regions  of  the  test  data,  not  on  the  number 
of  “correct”  disparities.  In  Section  5,  we  summarize  our 
qualitative  analysis  and  briefly  discuss  open  issues  for  fu¬ 
ture  research.  In  Section  6,  we  conclude  with  an  evalua¬ 
tion  of  the  JISCT  evaluation  and  make  some  suggestions 
for  the  next  step  in  the  evaluation  process. 

2  Technique  Summaries 

We  evaluated  three  techniques,  whose  key  aspects  are 
highlighted  below. 

2,1  INRIA 

This  technique  was  originally  implemented  as  part  of 
a  European  space  project  to  produce  three-dimensional 
models  of  scenes  containing  rocks  and  sand.  It  is  im¬ 
plemented  in  C  on  a  Sun.  A  similar  technique  is  im¬ 
plemented  on  a  Connection  Machine  (by  Pascal  Fua)  at 
SRI.  Key  aspects  are 

•  The  algorithm  computes  a  disparity  for  every  pi.xel 
in  an  image  by  matching  patches  (usually  11x11  pix¬ 
els)  at  one  or  two  image  resolutions,  independently. 
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The  basic  algorit  hm  “INRlA-l”  matches  only  at  one 
resolution. 

•  The  technique  uses  an  approximation  to  normalized 
correlation,  referred  to  as  C5,  because  it  can  be  im¬ 
plemented  efficiently  using  a  sliding  computation  of 
the  betsic  sums. 

•  The  algorithm  searches  only  along  epipolar  lines, 
which  are  assumed  to  be  horizontal. 

•  The  algorithm  expects  a  range  of  disparities  to  be 
specified  for  each  image  pair  to  be  analyzed. 

•  The  technique  verifies  all  matches  by  independently 
matching  patches  from  the  left  image  in  the  right 
image  and  patches  from  the  right  image  in  the  left 
image.  If  the  match  for  a  patch  from  the  left  image  is 
not  mapped  back  to  within  a  pixel  of  its  location  in 
the  left  image,  the  point  is  not  assigned  a  disparity. 

•  The  technique  computes  a  subpixel  location  for  each 
match  by  fitting  a  second-order  curve  to  the  corre¬ 
lation  values  surrounding  the  best  match. 

•  After  computing  disparities  for  as  many  pixels  in 
the  left  image  as  possible,  the  algorithm  filters  out 
isolated  matches  by  morphologically  shrinking  the 
regions  of  matches.  It  typically  shrinks  the  regions 
three  times,  grows  the  result  three  times,  and  then 
ANDs  this  result  with  the  original  image  of  results. 
This  process  can  erase  regions  as  large  as  6x6  pixels. 

•  The  algorithm  computes  a  confidence  value  for  each 
disparity  by  differencing  the  heights  of  the  two  high¬ 
est  matching  peaks. 

•  The  technique  estimates  the  precision  of  a  disparity 
value  by  fitting  a  Gaussian  to  the  matching  peak, 
using  its  standard  deviation  as  the  precision  mea¬ 
sure. 

•  The  technique  does  not  attempt  matches  near  the 
edges  of  an  image. 

•  The  second  set  of  results  provided  for  this  evaluation 
often  was  produced  by  matching  at  two  image  reso¬ 
lutions  and  picking  the  highest  resolution  for  which 
there  was  a  valid  match. 

2.2  SRI 

This  stereo  system  has  evolved  over  20  years,  begin¬ 
ning  with  early  Martian  Rover  research,  migrating  into 
the  aerial  mapping  domain,  and  now  coming  back  to 
ground-level  analysis.  Its  goal  has  been  to  produce  a 
set  of  high-quality  matches  from  a  wide  range  of  (pos¬ 
sibly  uncalibrated)  imagery.  The  algorithm  is  a  multi¬ 
stage  process  that  uses  one  matching  technique  to  get  a 
few  solid  matches  at  high-information  points,  and  then 


uses  these  matches  to  guide  another  matching  technique, 
whose  results  become  anchors  for  yet  another  technique, 
etc,  with  culling  of  mistakes  occurring  at  many  levels. 
At  each  stage,  the  algorithm  acquires  more  supporting 
matches  to  suggest  limits  for  the  disparity  search,  so  the 
algorithm  can  attempt  to  match  points  that  have  less 
“interesting”  information,  using  less  hierarchy.  For  this 
evaluation,  code  was  added  to  produce  “dense”  matches; 
this  included  stages  that  grow  regions  of  matches  around 
previously  matched  points,  and  fill  in  a  regular  grid  of 
matches.  In  total,  the  standard  algorithm  for  this  evalu¬ 
ation  involved  seven  stages  of  matching  and  three  filter¬ 
ing  steps.  The  algorithm  is  implemented  in  C  on  a  Sun; 
speed  has  not  been  a  priority. 

Some  key  aspects  are 

•  The  algorithm  applies  a  version  of  hierarchical 
matching  for  each  point  that  it  analyzes.  At  the 
early  stages  of  the  process,  it  uses  all  available  im¬ 
age  resolutions,  starting  at  the  coarsest,  using  the 
match  found  at  that  level  to  predict  the  location  of 
the  match  at  the  next  finer  level,  then  refining  it, 
and  so  forth.  At  the  final  stage,  where  the  dense 
grid  of  points  is  computed,  the  algorithm  uses  only 
one  or  two  levels. 

•  At  each  image  resolution  (level),  the  algorithm  does 
a  two-dimensional  search  near  the  epipolar  line  and 
then  hill-climbs  around  the  best  match.  The  epipo¬ 
lar  lines  can  be  at  any  angle  in  the  second  image, 
and  if  there  is  no  camera  model  ( due  to  bad  matches 
at  early  stages,  or  becainse  the  camera  isn’t  mode- 
lable  by  a  pinhole  camera),  the  algorithms  search 
over  ueas — (dx,dy)  boxes — defined  by  surrounding 
matches. 

•  The  algorithm  uses  normalized  cross  correlation 
(correcting  for  a  linear  intensity  change  from  im¬ 
age  to  image)  on  11x11  patches  typically.  Later 
stages,  such  as  the  region-growing  step,  can  use 
smaller  patches.  The  final  match  includes  a  sub¬ 
pixel  estimate  of  the  disparity,  computed  by  fitting 
two  parabolas  to  the  nearby  correlation  values. 

•  Each  match  from  one  image  to  another  is  verified 
by  applying  the  same  technique  to  match  back  into 
the  original  image.  If  the  return  match  is  not  within 
a  pixel  of  the  original  point,  the  match  is  discarded 
as  unreliable. 

•  The  edgorithm  applies  several  other  “filters”  to 
weed  out  mistakes,  including  a  threshold  on  interest 
value,  thresholds  on  relative  and  absolute  correla¬ 
tion  values,  tests  for  matches  outside  an  image,  and 
tests  for  unusual  disparity  values  within  a  region  of 
the  image. 

•  Later  stages  of  the  algorithm  use  previously  com¬ 
puted  disparities  in  the  neighborhood  of  a  new  point 
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to  be  matched,  to  specify  the  range  of  disparities 
to  be  considered.  The  neighborhoods  are  typically 
large,  beginning  at  l/4th  of  the  image  area,  and 
gradually  reducing  to  l/64th  of  the  image  for  this 
experiment.  This  technique  assumes  that  the  scene 
is  composed  of  relatively  large  continuous  surfaces. 

•  Since  a  confidence  for  each  match  was  requested  for 
this  experiment,  one  was  supplied  by  computing  the 
ratio  of  the  correlation  value  to  the  autocorrelation 
threshold. 

2.3  TELEOS 

This  technique  has  been  designed  for  efficient  imple¬ 
mentation  and  recently  has  been  geared  toward  active 
vision  in  which  the  basic  stereo  process  matches  100  to 
200  selected  points  in  a  l/30th  of  a  second.  It  is  im¬ 
plemented  on  a  combination  of  two  special  boards  and 
a  Datacube  system.  For  this  evaluation,  however,  the 
hardware  was  not  available  and  so  a  Lisp  version  of  the 
algorithm  (running  on  a  Lisp  Machine)  was  used.  Some 
key  aspects  are 

•  The  algorithm  uses  large  correlation  windows  (rang¬ 
ing  from  24x24  to  96x96  pixels). 

•  The  algorithm  computes  binary  correlation  values 
from  the  Laplacian  of  Gaussian  of  the  original  im¬ 
ages. 

•  The  algorithm  analyzes  the  data  only  at  one  reso¬ 
lution.  It  automatically  selects  the  size  of  the  con¬ 
volution  operators  by  analyzing  the  peak  shapes  of 
matches  at  25  points  in  each  new  image  pair.  It  se¬ 
lects  the  smallest  window  size  that  produces  a  sig¬ 
nificant  difference  between  the  heights  of  the  top 
two  highest  peaks. 

•  At  each  point  in  the  image,  the  algorithm  starts 
with  the  disparity  computed  for  the  neighboring 
pixel  and  tries  to  locate  a  match  at  a  similar  dis¬ 
parity.  A  serpentine  search,  which  analyzes  the  first 
row  from  left  to  right,  the  second  row  from  right 
to  left,  and  so  forth,  is  used  in  order  to  reduce  the 
computation  time  on  the  Lisp  Machine. 

•  The  algorithm  searches  off  the  epipolar  line  for  the 
best  match. 

•  The  algorithm  also  examines  the  effect  of  skewing 
the  patch  being  matched.  It  analyzes  skews  ranging 
from  -.5  pixels  per  line  to  -(-.5  pixels  per  line.  This 
analysis  is  applied  only  at  the  end  of  the  search  when 
the  best  match  has  been  selected. 

•  The  algorithm  estimates  a  subpixel  disparity  value 
by  fitting  a  quadratic  function  to  the  best  peak. 

•  The  algorithm  does  not  try  to  match  points  near 
the  edges  of  an  image. 


3  Experimental  Procedure 

The  goal  of  this  initial  evaluation  was  to  produce  a 
qualitative  characterization  of  the  capabilities  of  current 
stereo  techniques  applied  to  UGV  tasks.  The  intent,  as 
stated  in  the  instructions  distributed  to  each  participant, 
was  to  produce  a  description  such  as  the  following; 

On  the  44  image  pairs  in  the  database 
our  techniques  correctly  measured  disparities 
to  65%  of  the  points  on  the  ground  and  40%.  of 
the  points  on  obstacles,  such  e»s  trees,  bushes, 
and  rocks.  The  top  five  problems  for  our  tech¬ 
niques  were  dynamic  range,  holes,  bland  areas, 
repeated  structure,  and  poor  range  resolution. 

We  estimate  that  these  problems  occur  in  the 
UGV  scenarios  with  frequencies  of  ... 

The  idea  was  to  produce  a  characterization  that  would 
focus  future  work  on  key  UGV  problems. 

Our  basic  approach  to  developing  this  type  of  charac¬ 
terization  was  to  apply  the  techniques  to  a  large  dataset, 
visually  display  the  results  in  ways  to  highlight  unusual 
events,  gather  basic  statistics,  and  where  possible,  sum¬ 
marize  our  observations  in  descriptions  that  link  ob¬ 
served  behaviors  to  aspects  of  the  techniques. 

To  start  the  process,  SRI  compiled  a  databa.se  of  49 
image  pairs  from  JPL,  INRIA,  SRI,  CMU,  and  Teleos. 
We  converted  the  images  into  a  standard  format  and 
then  distributed  them  to  the  five  contributing  groups  for 
analysis.  The  groups  were  instructed  to  use  10  pairs  as  a 
training  set,  “freeze”  their  algorithm,  and  then  process 
the  whole  set  of  45  pairs.  Results  and  commentary  from 
four  stereo  systems  were  returned  to  SRI  —  Teleos,  SRI, 
and  two  from  INRIA.  One  of  the  INRIA  sets,  using  edge- 
based  feature  analysis,  could  not  easily  be  compared  with 
the  others.  We  concentrated  our  analysis  on  the  three 
correlation-based  system  results. 

To  assist  in  the  analysis  of  the  results,  SRI  developed 
two  sets  of  routines,  one  to  gather  statistics  and  one  to 
display  the  disparities  in  a  variety  of  ways.  Since  we 
did  not  have  ground  truth  for  the  distributed  imagery, 
we  were  not  able  to  compare  the  computed  disparities 
with  objective  values.  However,  we  were  able  to  gather 
statistics  on  two  of  the  three  types  of  mistakes  that  we 
are  interested  in  by  outlining  special  regions  in  the  im¬ 
agery  and  counting  the  occurrence  of  results  within  these 
regions. 

We  made  a  distinction  between  the  following  three 
types  of  mistakes: 

False  Negatives:  No  disparities  computed  for  points 
that  should  have  results. 

False  Positives  in  Unmatchable  Regions:  Disparities 
reported  for  points  that  don’t  have  matches  in  the 
second  image,  for  example,  points  occluded  in  one 
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image  or  points  out  of  the  field  of  view  of  one  of  the 
images. 

False  Positives  in  Matchable  Regions:  Incorrect  dis¬ 
parities  reported  for  matchable  points. 

By  interactively  outlining  regions  of  occluded  points, 
regions  of  points  out  of  the  field  of  view  of  the  second 
image,  and  regions  of  points  in  the  sky,  we  were  able 
to  directly  measure  statistics  for  the  first  two  types  of 
mistakes.  In  addition,  we  outlined  regions  corresponding 
to  expected  problems,  such  as  dark  shadows,  foliage,  and 
bland  areas.  In  this  way  we  could  gather  statistics  on  the 
behavior  of  the  algorithms  on  these  special  problems. 

As  part  of  the  initial  instructions  we  asked  each  group 
to  extend  its  algorithm  to  produce  an  image  of  annota¬ 
tions  that  summarizes  the  result  of  the  analysis,  pixel 
by  pixel.  At  each  pixel  we  asked  for  a  code  from  the 
following  list: 

0:  no  match  attempted 
1:  matched  fine 

10  NATCH  BECAUSE 

2:  too  bland,  no  information  to  key  on 
3:  lov  match  value  (e.g.,  correlation  value) 
4;  multiple  choices  (ie,  repeated  structure) 
6:  back-match  inconsistency 
6:  point  out  of  camera’s  field  of  vies 
7;  point  occluded  by  an  object  in  the  scene 
8:  point  too  far  off  the  epipolar  line 
9:  point  inconsistent  «ith  neighbors 
10:  other 

The  reason  for  requesting  these  codes  is  to  encourage 
future  algorithms  to  provide  this  additional  information, 
which  can  be  used  by  the  higher-level  vision  techniques 
to  decide  what  should  be  done  next.  For  example,  if  no 
results  are  reported  for  a  region  directly  ahead  of  the 
vehicle  and  the  region  is  too  bland  and  very  dark,  one 
option  might  be  to  open  the  irises  on  the  cameras  (or 
increase  the  integration  time)  in  order  to  see  into  the 
dark  area. 

INRIA  reported  codes  of  1  and  10;  SRI  reported  xll 
codes  except  for  4  and  7;  and  Teleos  reported  codes  o\  ', 
1,  2,  and  3.  Therefore,  we  were  able  to  count  the  number 
of  matches  attempted  in  each  region  and  the  number  of 
disparities  reported. 

To  estimate  the  frequency  of  incorrectly  reported  dis¬ 
parities  (the  third  type  of  mistake),  we  either  compared 
them  to  interactively  selected  values  or  located  an  aber¬ 
ration  in  the  local  pattern  of  disparities  when  they  were 
displayed  on  the  screen.  We  experimented  with  a  variety 
of  display  techniques,  including  displaying  the  dispari¬ 
ties  as  color-coded  dots  in  stereo,  heights  above  a  three- 
dimensional  “ground”  plane,  and  disparity-displaced 
vertical  lines.  We  are  continuing  to  look  for  better  ways 


to  display  three-dimensional  results,  because  most  cur¬ 
rent  techniques  encourage  the  human  eye  to  “smooth 
over”  differences,  making  the  results  look  better  than 
they  actually  are. 

4  Statistics  Summary 

The  statistics  that  we  refer  to  as  believe-everything- 
they-tell-you  statistics  are  based  on  the  number  of  re¬ 
ported  disparities  in  specified  regions  of  the  test  data. 
These  statistics  do  not  distinguish  between  “correct”  and 
“incorrect”  disparity  values,  just  reported  values  and  un¬ 
reported  values.  They  do,  however,  provide  enough  in¬ 
formation  to  estimate  three  important  quantities,  the 
number  of  false  negatives  (matchable  points  that  were 
not  assigned  a  disparity),  the  number  of  false  positives 
occurring  in  unmatchable  regions,  and  the  number  of 
matchable  pixels  that  were  assigned  disparities. 

To  help  focus  attention  of  key  areas  of  the  test  data, 
we  interactively  outlined  regions  in  the  left  images  of 
20  of  the  44  image  pairs  (see  Figure  1  and  Figure  2). 
One  of  the  most  important  regions  is  what  we  called 
“matchable-data.”  It  eliminates  several  types  of  points 
that  do  not  have  matches  in  the  right  image,  including 
null  bands  that  do  not  contain  grayscale  data  (but  are 
included  in  the  images  to  fill  them  out  to  a  standard  size, 
such  as  512  by  512  pixels)  and  pixels  that  are  out  of  the 
field  of  view  of  the  right  camera.  In  the  20  images  we 
examined,  the  percentage  of  unmatchable  points  ranged 
from  4.3%  to  46.0%  and  averaged  12.3%. 

The  statistics  were  gathered  by  a  program  that 
counted  the  number  of  disparities  (dx  disparities)  re¬ 
ported  in  the  specified  region  (or  the  whole  image,  if 
that  was  appropriate). 

Figure  3  shows  the  results  on  all  44  image  pairs.  Note 
that 

•  The  dataset  contains  a  wide  variety  of  imagery; 
some  of  it  is  realistic  (containing  dirt  roads  and 
cross-country  scenes)  and  some  is  designed  to  test 
the  algorithms  along  one  dimension,  such  as  base¬ 
line  and  noise  tolerance.  Some  of  the  imagery  is 
even  trick  imagery  (the  shoe  images  from  CMU). 

•  The  numbers  in  parentheses  after  each  group’s  name 
(along  the  top  of  the  table)  indicate  the  number  of 
test  pairs  in  the  dataset  from  that  group. 

•  The  INRIA-2  results  are  in  parentheses  because  dif¬ 
ferent  parameter  settings  were  used  for  different  im¬ 
age  pairs.  However,  the  usual  change  was  for  the 
technique  to  match  at  two  spatial  resolutions  in¬ 
stead  of  just  one,  and  then  combine  the  results.  If  a 
second  set  of  parameters  was  not  tried  for  a  pair,  we 
left  the  entry  blank  and  used  the  INRIA- 1  results 
in  our  computation  of  INRIA-2’s  average. 
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Figure  1:  Interactively  outlined,  special-interest  regions  for  the  J1  image  pair  from  INRIA. 


•  If  we  did  not  outline  a  “matchable-data”  region  for  a 
pair,  we  used  the  full-image  statistics  in  our  compu¬ 
tations.  This  reduces  the  effectiveness  totals  some¬ 
what  (possibly  by  as  much  as  7%). 

Given  the  diversity  of  the  data,  we  were  pleased  with  the 
completeness  of  the  results. 

In  order  to  examine  the  behavior  of  the  techniques  on 
typical  UGV  imagery,  we  selected  the  eight  images  from 
the  dataset  that  were  the  most  appropriate  for  UGV 
tasks  and  collected  statistics  on  that  subset.  Figure  4 
shows  the  results  on  these  data.  The  INRIA-2,  SRI-2, 
and  Teleos-1  techniques  performed  well,  computing  dis¬ 
parities  for  86  or  87%  of  the  matchable  points.  Note, 
however,  that  these  images  did  not  contain  difficult  ob¬ 
stacles,  such  as  holes,  ditches,  and  small  rocks — the  ob¬ 
stacles  were  large  rocks,  bushes,  and  trees. 

Figure  5  shows  the  results  on  the  17  large  obstacles 
in  the  dataset.  The  techniques  did  an  excellent  job  of 
detecting  these  objects,  which  stick  up  above  the  ground 
—  they  only  had  a  little  trouble  in  shadowed  regions  on 
them. 

With  respect  to  shadows,  the  techniques  had  a  signif¬ 
icantly  harder  time  computing  disparities  for  points  in 
shadowed  regions  than  in  sun-lit  regions.  Figure  6  shows 
the  results  for  points  in  shadows. 

The  techniques  also  had  trouble  with  bland  regions, 
as  expected.  Figure  7  shows  the  results  on  these  areas. 
The  techniques  typically  computed  results  around  the 
edges  of  the  regions  —  the  larger  the  correlation  win¬ 
dows,  the  more  points  were  computed,  because  correla¬ 
tion  windows  naturally  extend  matches  into  the  interior 
of  bland  regions  by  about  half  their  diameter. 

There  are  several  potentially  important  problem  areas 
that  were  not  covered  in  this  initial  dataset,  including 
holes,  sand,  small-  to  medium-sized  rocks  and  bushes, 
reflective  surfaces  (water  or  windows),  and  moving  ob¬ 
jects.  One  of  our  goals  for  the  second  phase  of  this  eval¬ 
uation  is  to  include  examples  of  these  problems. 


5  Qualitative  Analysis 

We  were  surprised  by  the  completeness  of  everyone’s 
results.  Even  though  the  dataset  contained  a  wide  range 
of  imagery,  including  some  sequences  designed  to  stretch 
the  analysis  along  specific  dimensions,  such  as  noise  tol¬ 
erance  and  disparity  range,  the  techniques  computed  dis¬ 
parities  for  64%  of  the  matchable  points.  On  the  eight 
image  pairs  that  we  selected  as  the  most  appropriate  for 
UGV  applications,  the  techniques  computed  disparities 
for  as  much  as  87%  of  the  points.  Although  the  miss¬ 
ing  points  (and  mistakes  in  the  reported  matches)  could 
cause  problems  for  vehicle  navigation,  this  level  of  com¬ 
pleteness  is  an  indication  that  there  is  a  solid  basis  for 
building  a  passive  ranging  system  for  an  outdoor  vehicle. 

The  number  of  gross  errors  varied  considerably  from 
image  pair  to  image  pair.  For  most  “realistic”  images  the 
number  was  relatively  small,  ranging  from  a  few  “spike” 
errors  to  small  regions  of  mistakes.  We  estimate  that  for 
these  images  there  were  between  1  eind  5%  gross  errors  in 
the  results.  In  many  cases,  the  worst  errors  cluster  into 
areas  that  are  “breaking  up”  for  one  reason  or  another 
(usually  poor  information  plus  a  poor  “guess”  for  the 
disparity  range);  if  we  can  “fix”  these  areas,  then  the 
remaining  “spike”  errors  should  be  amenable  to  culling 
techniques.  In  any  case,  most  of  these  errors  would  have 
to  be  eliminated  in  order  for  the  data  to  be  used  directly 
for  planning  navigable  routes. 

The  techniques  made  different  mistakes,  most  of  which 
could  be  explained  by  their  correlation  patch  size,  search 
technique,  or  match  verification  technique.  However, 
since  they  made  different  mistakes,  there  is  a  possibil¬ 
ity  of  combining  them  in  a  way  to  check  each  other  and 
fill  in  missing  data. 

All  the  techniques  could  be  improved  significantly  with 
a  relatively  small  amount  of  effort.  This  was  the  first  test 
of  this  type,  requiring  the  analysis  of  a  large  dataset,  and 
it  uncovered  some  weaknesses  that  can  be  corrected.  One 
area  to  be  considered  is  the  development  of  preanalysis 
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Figure  2:  Special-interest  regions  for  the  STANFORD  image  pair  from  SRI. 
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Figure  3:  Percentage  of  “niatchable”  pixels  assigned  disparities  on  all  44  image  pairs. 
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techniques  to  automatically  set  key  parameters,  such  as 
patch  size  and  search  areas  (as  Teleos  does).  Another 
place  for  improvement  is  in  the  filtering  of  the  results 
to  eliminate  matches  that  differ  significantly  from  their 
neighbors  (as  SRI  and  INRIA  do). 

There  were  a  few  surprises,  such  as  Telcos’s  successful 
solution  to  one  set  of  image  pairs  from  CMU  that  in¬ 
cludes  a  carpet  with  a  repetitive  pattern  on  it.  Teleos’s 
large  patches  were  able  to  detect  large  regions  of  subtle 
differences,  which  led  to  the  correct  disparities. 

5.1  Technique-Oriented  Summaries 

No  one  of  these  algorithms  has  completely  solved  the 
stereo  problem,  although  all  can  produce  basically  usable 
results  on  most  reasonable  imagery.  Each  has  strengths 
and  weaknesses — and  very  often  an  algorithm’s  strength 
on  one  dataset  is  its  weakness  on  another! 

INRIA ’s  algorithms  assume  that  the  images  are  in 
epipolar  alignment.  This  makes  their  searches  more  ef¬ 
ficient,  and  keeps  matches  from  wandering  off  of  the 
epipolar  lines  (for  instance,  “climbing”  the  edges  of  tree 
trunks).  However,  when  presented  with  nonepipolar  im¬ 
agery,  INRIA-1  fell  apart;  INRIA-2  did  better,  but  had  a 
persistent  problem,  producing  rough  disparity  contours, 
which  are  apparently  due  to  the  way  the  pyramid  was 
handled.  The  low-resolution  results  were  simply  zoomed- 
out  using  pixel  replication.  This  epipolar  line  constraint 
also  limits  the  usefulness  of  INRIA’s  algorithms  on  im¬ 
agery  from  nonpinhole  cameras. 

SRI’s  algorithm  mostly  disregards  the  epipolar  con¬ 
straint.  Consequently,  it  had  no  particular  problems 
handling  nonepipolar  imagery.  However,  it  failed  to 
match  many  of  the  very  smooth  tree  edges  in  the  EPI 
sequence,  probably  because  its  matches  “slid”  up  the 
linear  sides  of  the  trees. 


INRIA’s  algorithms  search  the  entire  width  of  the 
epipolar  line.  This  helped  them  to  do  well  on  some 
datasets,  but  when  the  ground  texture  was  ambiguous, 
their  technique  tended  to  return  no  match  because  of 
multiple  choices. 

SRI’s  algorithm  depends  on  early  matches  to  “set  the 
context”,  so  that  later  searc!  es  for  matches  can  be  con¬ 
fined  to  the  disparities  in  that  neighborhood.  When 
there  is  enough  global  texture  for  the  initial  matches  to 
give  a  good  sampling  of  the  disparities,  this  works  well, 
enabling  SRI-2  to  produce  ground  plane  matches  where 
the  others  couldn’t.  However,  when  lack  of  foreground 
detail  keeps  SRI-2  from  having  the  right  initial  matches, 
it  fails  to  match,  or  finds  random  mismatches. 

Teleos’s  algorithm  uses  very  large  windows  dynami¬ 
cally  skewed  to  accommodate  tilted  planes.  This  causes 
it  to  do  well  on  some  ground  planes  where  it  was  able  to 
disambiguate  the  pattern  through  minor  variations,  but 
not  on  others  where  the  ground  plane  tilt  was  out  of  the 
allowed  range  of  skewing.  Of  course,  these  large  windows 
also  cau.se  it  to  have  problems  with  any  scene  containing 
depth  discontinuities — it  either  finds  no  match,  or  tries 
to  blend  the  foregrornd  object  into  the  background  ob¬ 
jects,  or  widens  the  foreground  object  out  onto  the  back¬ 
ground.  In  addition,  Teleos- Ts  scanning  heuristic  cre¬ 
ates  some  rather  peculiar  artifacts —  extending  objects 
in  opposite  directions  on  alternate  scan  lines.  However, 
its  ability  to  “see”  into  low-contrast  situations  is  very 
good. 

The  Teleos  system,  with  its  large  correlation  windows, 
also  produces  smaller  range  images,  because  it  limits 
matching  to  areas  where  the  full  correlation  patch  is 
within  the  image.  In  an  active  vision  system,  the  sensors 
could  be  reoriented  to  center  objects  of  interest  that  may 
initially  appear  on  the  boundary  of  an  image. 
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Both  INRIA’s  and  SRI’s  algorithms  use  fairly  small 
windows.  This  removes  much  of  the  need  for  win¬ 
dow  skewing  and  warping,  although  on  extremely  tipped 
planes,  warping  would  be  helpful.  INRIA-1,  lNRIA-2, 
and  SRI-2  all  do  better  on  tilted  planes  if  the  informa¬ 
tion  is  slightly  “fuzzy” .  These  algorithms  don’t  do  nearly 
as  well  in  the  presence  of  man-made  ambiguous  patterns. 

SRI's  algorithm  tends  to  leave  more  holes  in  the 
data — low-information  places  that  it  refuses  to  try  to 
match,  ambiguous  places  where  it  can’t  backmatch  suc¬ 
cessfully,  or  error  matches  that  it  has  detected  and  re¬ 
moved.  This  gives  the  data  a  “lacey”  appearance,  and 
it  should  probably  be  followed  by  an  interpolation  step, 
to  fill  in  these  problem  areas.  (The  SRI  technique  is 
capable  of  interpolation,  but  it  was  not  used  in  this  eval¬ 
uation.)  SRI-2  often  leaves  a  nice  band  of  no-niatches 
outlining  depth  discontinuities,  where  one  doesn’t  really 
want  separate  objects  “smoothed”  together.  SRI-2  also 
often  refuses  to  match  areas  like  the  sky,  which  techni¬ 
cally  don’t  have  a  match. 

None  of  the  algorithms  currently  distinguishes  be¬ 
tween  good  image  data  and  the  “null  data”  areas  caused 
by  image  digitization,  reprojection,  and  so  forth.  This 
can  lead  to  rather  peculiar  mismatches  around  these  ar¬ 
eas  of  null  data.  All  of  the  algorithms  should  add  the 
ability  to  accept  a  mask  telling  what  parts  of  the  image 
not  to  try  to  match.  Better  yet  would  be  a  preprocessing 
step  to  construct  these  masks  automatically. 

It  was  interesting  to  see  how  much  better  all  of  the 
algorithms  did  on  the  imagery  taken  by  JPL  than  on 
the  SRI  imagery.  A  major  factor  is  the  unusual  aspect 
ratio  of  the  SRI  imagery  caused  by  digitizing  individual 
fields,  since  the  vehicle  was  moving  fast  enough  to  show  a 
significant  difference  between  fields.  JPL’s  imagery  was 
taken  while  the  vehicle  was  standing  still.  Other  differ¬ 
ences  that  may  have  contributed  include  image  contrast, 
epipolar  geometry,  and  look  angle  (SRI’s  cameras  were 
looking  far  forward,  whereas  JPL’s  were  looking  down 
a  bit  more).  We  note  that  the  exchange  of  imagery  can 
help  in  algorithm  development  by  avoiding  inadvertently 
“tuning”  one’s  algorithm  to  one’s  particular  style  of  im¬ 
agery. 

5.2  Open  Research  Problems 

After  examining  the  results  from  this  dataset,  we 
have  selected  the  following  topics  for  future  research  in 
the  area  of  low-level  passive  range  sensing: 

1.  Filtering  out  gross  errors  caused  by  erroneous 
matches. 

2.  Handling  the  wide  dynamic  range  in  intensities  com¬ 
mon  in  outdoor  imagery,  from  dark  shadowed  re¬ 
gions  up  to  specularities  off  shiny  surfaces. 

3.  Handling  the  large  range  in  adjacent  disparities  aris¬ 
ing  from  narrow  foreground  obstacles. 


4.  Adjusting  algorithm  parameters  automatically  to 
properly  handle  different  image  regions,  such  as 
bland  areas  and  texture  regions. 

5.  Detecting  multiple  matches  and  selecting  the  cor¬ 
rect  one,  possibly  by  analyzing  multiple  images. 

6.  Providing  validation  and  confidence  estimation 
mechanisms. 

7.  Detecting  occlusion  edges  and  reporting  accurate 
depths  on  both  sides  of  them. 

8.  Detecting  and  characterizing  small-  to  medium¬ 
sized  obstacles,  such  as  rocks  and  bushes. 

9.  Detecting  “negative”  ol>stacles,  such  as  holes  and 
ditches. 

Although  the  JISCT  dataset  did  not  include  examples  of 
the  last  two  areas,  they  are  clearly  important  for  cros.s- 
country  navigation. 

6  Conclusion 

As  a  result  of  this  phase  of  our  stereo  evaluation,  we 
can  make  a  few  general  observations  and  develop  a  few 
ideas  for  the  project’s  next  phase. 

First,  the  time  is  right  for  evaluation.  If  promising 
computer  vision  techniques,  such  as  stereo  analysis  and 
road  following,  are  to  make  the  transition  from  the  re¬ 
search  laboratory  to  practical  systems,  their  characteris¬ 
tics  will  have  to  be  well  enough  documented  that  system 
engineers  can  understand  them  and  predict  their  behav¬ 
ior.  We  view  this  evaluation  as  the  first  tentative  step 
toward  developing  this  type  of  characterization. 

Second,  evaluations  of  this  type  require  a  significant 
effort.  To  give  an  idea  of  what  is  involved  in  such  an 
evaluation,  SRI  did  the  following;  gathered  imagery  from 
five  groups,  converted  it  into  a  standard  format,  designed 
the  experimental  procedure,  distributed  the  imagery  to 
the  participants,  collected  the  results,  converted  them 
into  a  uniform  format  (correcting  for  a  few  mistakes  in 
the  original  specifications),  developed  visualization  rou¬ 
tines,  used  these  routines  to  interactively  examine  all  the 
results,  developed  statistics  gathering  routines,  applied 
these  routines  to  the  results,  wrote  the  report,  and  finally 
distributed  the  report  and  copies  of  everyone’s  results. 

Third,  ideally  an  evaluation  of  this  type  should  be  per¬ 
formed  periodically  to  provide  estimates  of  the  relative 
improvements  of  the  techniques. 

6.1  Critique  of  the  JISCT  Evaluation 

Some  things  that  were  done  correctly; 

•  We  developed  a  cooperative  attitude  among  the  par¬ 
ticipants.  This  was  the  first  time  our  community 
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had  tried  establishing  an  ongoing  evaluation  pro¬ 
cess  and  we  knew  that  we’d  make  mistakes.  We  also 
knew  that  the  participants  have  their  egos  involved 
in  their  systems,  and  we  wanted  to  emphasize  the 
constructive  aspects  of  comparing  techniques. 

•  The  experimental  procedure  was  almost  right.  The 
idea  of  distributing  a  large  number  of  stereo  pairs, 
using  some  for  a  training  set,  freezing  the  “official” 
algorithm,  and  then  applying  it  to  45  test  pairs  is 
correct.  The  large  number  of  pairs  virtually  forced 
the  groups  to  implement  an  automatic  technique, 
which  they  could  apply  to  any  image  pair.  As  a 
result,  there  are  now  four  systems  around  the  world 
that  can  be  eeisily  tested  on  new  imagery. 

•  The  idea  of  asking  for  precision  estimates,  confi¬ 
dence  estimates,  and  annotations  was  correct.  Al¬ 
though  no  group  produced  them  all,  future  systems 
will  be  expected  to  because  this  information  is  so 
important  for  higher-level  users  of  the  results. 

•  The  basic  idea  of  sharing  data  from  several  groups 
was  good  because  applying  the  algorithms  to  this 
diverse  set  of  images  brought  to  light  several  im¬ 
plicit  and  explicit  assumptions  and  parameters  in 
the  algorithms. 

•  Since  any  evaluation  of  this  type  can  only  include  a 
limited  set  of  imagery  that  attempts  to  cover  all  po 
sible  dimensions,  the  idea  of  including  several  sm.  I 
controlled  experiments  worked  well.  For  example, 
the  set  of  images  from  Teleos  explored  the  ability  of 
the  algorithms  to  handle  increasing  noise;  the  SRI 
EPI  sequence  tested  a  range  of  baselines. 

Some  things  that  should  be  changed; 

•  The  lack  of  ground  truth  significantly  limited  the 
types  of  automatic  “objective”  evaluations  possible. 
Ground  truth  is  expensive,  but  there  is  no  substitute 
for  assessing  quantitative  issues. 

•  For  this  initial  phase  we  built  our  dataset  primar¬ 
ily  from  existing  data.  In  the  future  we  need  to 
gather  data  that  is  more  realistic  and  appropriate 
to  the  task.  In  particular,  for  UGV  tasks,  the  data 
should  be  from  the  demonstration  sites  and  include 
examples  of  the  common  “obstacles,”  such  as  ruts, 
bushes,  rocks,  ditches,  and  water.  Future  datasets 
should  also  include  sequences  of  images  and  trinoc- 
ular  data,  not  just  individual  pairs. 

•  The  whole  process  took  too  long  (almost  a  year). 
Techniques  can  change  faster  than  that.  To  be  rel¬ 
evant,  the  results  should  be  returned  within  a  few 
months.  This  turnaround  time  is  more  possible  now 
that  we  have  been  through  the  process  once  and 
have  developed  routines  for  analyzing  the  data. 


•  More  auxiliary  data  (e.g.,  calibration  information) 
should  be  supplied  with  the  dataset.  Some  tech¬ 
niques  rely  on  this  information  to  reduce  search  and 
set  key  parameters.  Also,  it  will  generally  be  avail¬ 
able  in  most  applications. 

6.2  Plans  for  the  Next  Evaluation  Phase 

We  plan  to  include  three  types  of  imagery  in  the  next 
dataset;  demonstration-related  pairs  and  sequences,  a 
few  image-intensified  pairs,  and  some  synthetic  pairs 
that  are  less  artifactual  than  previous  ones.  One  of  our 
goals  for  this  pha.se  is  to  explore  more  rugged  off-road 
scenes,  including  dc-p  ruts,  tall  grass,  and  ditches,  so 
we  are  including  several  examples  of  each  in  the  new 
dataset.  The  image-intensified  data  will  provide  our  first 
look  at  applying  our  techniques  to  night-vision-type  im¬ 
agery.  The  synthetic  data  is  formed  from  real  pairs  by 
modifying  a  se‘  of  computed  disparities,  and  then  form¬ 
ing  a  new  right  image  based  on  these  disparities.  This 
data,  although  still  not  completely  reaii.stic,  is  signifi¬ 
cantly  better  than  previous  versions  and  provides  com¬ 
plete  ground  truth. 

We  plan  to  distribute  the  dataset  to  10  or  15  research 
groups  for  analysis.  After  debugging  the  process,  we  are 
in  a  position  to  open  up  the  evaluation  to  include  a  wider 
group  of  participants. 
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Abstract 


The  Department  of  Defense  Unmanned  Ground  Vehicle 
Program  is  a  multi-year  effort  involving  the  Services. 

OSD  and  DARPA.  The  objective  of  the  program  is  to 
first  Held  a  teleoperated  unmanned  ground  vehicle 
system,  followed  by  preplanned  product  improvements 
leading  to  self-navigating  systems  performing 
reconnaissance,  surveillance,  target  acquisition  and 
designation  missions.  This  paper  describes  the  recent 
DARPA  procurement  for  research  in  reconnaissance, 
surveillance,  and  target  acquisition  (RSTA)  for  the 
Unmanned  Ground  Vehicle  (UGV).  The  research  topics 
of  interest  are  described,  followed  by  a  description  of 
the  UGV  program  structure. 

1.0  Background 

In  recent  years.  Congress  “has  been  concerned  about  the 
direction  and  composition  of  the  many  diverse  robotics 
projects  undertaken  by  the  armed  services  and  defense 
agencies.”^  Congress  therefore  directed  the  establishment 
of  a  DoD  robotics  master  plan  in  1989.  The  diversity  of 
robotics  projects  that  were  described  in  the  1989  plan  led  to 
Congressional  request  in  1990  to  consolidate  all  of  the 
ground  robotics  vehicle  projects  “under  OSD  policy  and 
program  direction.” 


In  response  to  this  Congressional  mandate,  previously 
separate  ground  vehicle  related  robotics  efforts  were 
consolidated  in  a  single  program  element,  under  the 
direction  of  the  Tactical  Warfare  Programs  (TWP)  office  of 
the  Director  for  Defense  Research  and  Engineering 
(DDR&E).  Since  1990,  the  TWP  office  has  been 
responsible  for  reporting  on  the  activities  of  this  program 
element  to  Congress,  providing  direction,  allocating 
appropriated  funds  to  projects,  and  carefully  monitoring 
the  progress  of  all  DoD  Unmanned  Ground  Vehicle 
activities.  The  Services  and  the  Defense  Advanced 
Research  Projects  Agency  (DARPA)  ate  responsible  for  the 
conduct  and  daily  management  of  the  {vojects. 

1.1  Introduction 

This  paper  describes  a  recent  DARPA  solicitation  for  un¬ 
classified  research  projects  in  Autonomous  Systems  Tech¬ 
nology,  focusing  on  Image  Understanding  technology 
needed  by  Unmanned  Air  Vehicles,  by  the  Surrogate  Semi- 
autonomous  Vehicle  (SSV),  and  later  by  the  Tactical  Un¬ 
manned  Ground  Vehicle  (TUGV)  reconnaisance,  surveil¬ 
lance,  and  target  acquisition  (RSTA)  function.  Both  the 
SSV  and  the  TUGV  are  currently  under  development  The 
RSTA  requirements  include  the  detection,  tracing,  nwdel- 
based  identification  and  location  of  military  ground  vehi¬ 
cles  at  ranges  from  200  m  to  3  km.  The  RSTA  system  is  de¬ 
signed  to  use  a  wide  field  of  view  (wfov)  forward  looking 
infrared  (FLIR)  sensor ,  rqrproximately  IS  degrees  x  8  de¬ 
grees,  for  target  detection.  It  will  then  use  a  narrow  field  of 
view  (nfov)  Laser  Radar  (LADAR)  to  gather  three-dimen¬ 
sional  information  from  detected  target  locations  for  target 
identification.  Target  tracking  functions  will  be  performed 
using  either  the  wfov  FLIR  or  video  imagery. 


1.  Report  101-132  from  the  Senate  Committee  on  Appropriations 
on  the  Department  of  Defense  Appropriations  Bill,  1990.  Quota¬ 
tions  is  this  section  ate  from  Report  101-132. 
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Research  topics  of  interest  specifically  solicited  in  the 
Broad  Agency  Announcement  included  (1)  natural  scene 
understanding,  (2)  model-based  object  recognition  in  Laser 
Radar  (LADAR)  imagery,  (3)  motion  compensation  (dig¬ 
ital  image  stabilization),  (4)  target  detection  in  Forward 
Looking  Infrared  (FLIR)  imagery  with  natural  scene  clutter 
rejection,  and  (5)  Multi-sensor  fusion  using  FLIR  and  LA¬ 
DAR  identification  and  accurate  target  location. 

We  sought  research  projects  whose  results  could  be  inte¬ 
grated  and  demonstrated  in  SSV  RSTA  operational  scenar¬ 
ios  in  1995.  Emphasis  was  on  lU  techniques  embedded  in 
an  end-to-end  system,  rather  than  on  isolated  techniques. 
The  solicitation  required  that  the  proposed  lU  research  be 
more  than  a  concept,  resulting  in  programs  that  could  be  di¬ 
rectly  transferred  to  a  Unix-based  workstation  such  as  the 
Sun  or  the  Silicon  Graphics  for  testing,  evaluation,  and 
SSV  RSTA  system  integration.  RSTA  algorithms  depen¬ 
dent  on  special  purpose  hardware  were  specifically  exclud¬ 
ed  from  this  solicitation. 

Two  possible  avenues  for  technology  transfer  to  the  SSV 
RSTA  system  were  identified.  Mo^I-based  object  iden¬ 
tification  technology  could  be  optimally  integrated  into  the 
SSV  RSTA  target  identification  system,  a  tiKxlular  model- 
based  object  identification  system  based  on  C/Unix/X- 
Windows.  Where  possible,  contractors  performing  re¬ 
search  in  other  technology  areas  would  use  C-H-,  (XOS, 
aixl  the  objects  and  methods  defined  in  the  Image  Under¬ 
standing  Environment  (lUE)  described  by  Mundy  [1].  This 
target  platform/environment  is  meant  to  assure  technology 
transfer  of  the  final  research  results,  as  well  as  easy  inter¬ 
change  for  testing,  upgrading,  and  evaluation  within  the 
university,  government  laboratories,  and  the  contractor 
community.  The  SSV  program  has  preliminary  plans  to  do 
all  of  its  system  development  in  C  and  C-h-  and  will  make 
available  to  the  community  at  large  critical  interface  de¬ 
signs  as  the  project  progresses. 

Close  coordination  of  contractors  with  the  existing  SSV/ 
TUGV  projects  was  required  as  well  as  a  series  of  short¬ 
term  assignments  (several  weeks)  of  SSV/TUGV  team 
members  to  their  research  sites  to  aid  in  technology  trans¬ 
fer.  Bidders  were  additionally  encouraged  to  plan  for  short 
term  assignments  of  their  technical  personnel  to  the  SSV 
contractor’s  site  in  Denver,  CO,  to  support  modification,  in¬ 
tegration,  and  evaluations  of  their  systems  on  the  SSV. 
Contractors  were  expected  to  attend  and  to  present  results 
at  two  of  the  four  annually  scheduled  UGV  Demo  II  Work¬ 
shops  meetings  and  to  prepare  an  annual  paper  describing 
progress  for  the  DARPA  Image  UnderstandingWorkshop. 

Projected  total  funding  for  the  lU  research  is  tq)proximately 
$7,5()0,(X)0  over  three  years.  Approximately  ten  three-year 


contracts  will  result  from  this  solicitation.  Industry/re¬ 
search  center  teaming  was  encouraged  to  support  the  use 
and  extension  of  existing  sensor  and  scene  simulatioa  tech¬ 
nology  into  novel  lU  research  approaches  to  SSV  RSTA 
problems. 

1.2  RSTA  Research  Areas 

Analysis  of  the  RSTA  subsystem  requirements  was  con¬ 
ducted  to  identify  technological  shortcomings  in  the  exist¬ 
ing  program.  The  following  potential  research  areas  were 
identified  for  the  lU  tech-base  effort: 

1.2.1  Natural  scene  understanding. 

The  results  from  this  general  research  area  will  supply 
guided  search  process  information  to  the  sensor  needed  to 
govern  efficient  sensor  search  for  threats.  In  the  absence  of 
high  resolution  elevation  and  cultural  map  data,  the  RSTA 
system  itself  must  use  sensor  data  to  determine  which  areas 
of  the  scene  are  likely  to  contain  enemy  threats.  This  ca¬ 
pability  may  also  support  long  range  navigation  planning  - 
the  current  SSV  system  uses  only  map  information  to  per¬ 
form  route  planning.  This  research  area  includes  the  fol¬ 
lowing  sub-components: 

Scene  component  identification  (sky,  trees,  fields, 
roads,  occlusion  ridges)  in  FLIR  and  RGB  imagery  that  has 
been  acquired  from  forward-looking,  ground-based  sen- 

S(»S. 

The  use  of  known  and  monitored  host  vehicle  motion, 
feature  extraction,  and  image  sequence  processing,  to  cre¬ 
ate  3-D  scene  maps  of  SSV  operations  areas.  For  example, 
host  motion  may  be  useful  in  identifying  terrain  occlusion 
boundaries  that  conceal  targets. 

The  use  of  active  vision.  Focus-of-attention  processing 
using  FLIR  or  video  imagery  of  natural  scenes  for  target 
cueing. 

1.2.2  Model-based  object  recognition. 

This  research  topic  supports  the  SSV  target  recognition  re¬ 
quirements  of  target  vehicle  recognition.  The  SSV  rtuxlel- 
based  target  recognition  approac  '  has  a  modular  design 
which  allows  the  incorporation  of  .esearch  approaches  (for 
indexing,  feature  matching,  belief  propagation,  etc.)  devel¬ 
oped  in  the  lU  community.  lU  research  could  focus  on 
complete  recognition  approaches  or  sub-component  tech¬ 
nology.  Research  topics  in  this  area  include: 
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Recognition  of  tactical  targets  in  ladar  imagery.  A  par¬ 
ticular  challenge  is  object  recognition  in  sub-optimal  im¬ 
agery:  lower  resolution,  noisy,  or  attenuated  by 
atmospheric  conditions. 

The  recognition  of  occluded  targets  in  ladar  imagery. 
This  includes  the  model-based  recognition  of  whole  targets 
via  the  recognition  of  distinctive  unoccluded  sub-parts. 

Ladar  sensor,  target  and  scene  modeling.  Algorithms 
for  simulating  target  and  background  characteristics,  ge¬ 
ometries,  and  probabilities  are  key  elements  of  the  SSV 
model-based  target  identification  system. 

1 .2.3  Motion  compensation:  digital  image 
stabilization 

This  research  is  for  techniques  that  use  processing  to  main¬ 
tain  a  stable  view  of  the  world  from  a  moving  platform. 
Such  approaches  must  exceed  the  current  technology  base 
that  rely  on  large  frame  to  frame  scene  overlap.  Addition¬ 
ally,  approaches  in  this  area  must  utilize  general  purpose 
scalable  parallel  computation,  as  opposed  to  dedicated 
hardware  for  this  single  function. 

1.2.4  Target  detection  in  FUR  imagery 
with  natural  scene  clutter  rejection 

Open  research  issues  in  this  field  include  the  use  of  object, 
terrain,  sensor  and  atmospheric  transmission  models  to  pre¬ 
dict  target/background  contrast  at  the  operational  site,  and 
the  use  of  such  data  to  select  optimal  target  detection  mod¬ 
ule  parameters. 

1 .2.5  Multi-sensor  fusion  using  FUR  and 
LADAR  identification. 

Research  here  investigates  the  cooperative  combination  of 
approaches  of  image  segmentation  from  infrared  images 
and  depth  maps  from  LADAR  to  facilitate  object  recogni¬ 
tion. 

1.2.6  LADAR/FLIR  Multisensor  Fusion. 

LADAR/FLIR  multisensor  fusion  research  deals  with 
noodel-based  target  identification  processes  that  use  multi¬ 
ple  sensor  information  to  increase  algorithm  performance. 
Emphasis  will  be  placed  on  the  combination  of  information 
from  FLIR  and  LADAR  imagery  to  increase  classification 
performance. 


1.3  UGV/TUGV  Program  Structure 

The  technology  associated  with  unmanned  systems*  is 
maturing  faster  than  the  concepts  of  etrqrloyment  are  being 
developed.  The  structure  of  the  DoD  Unmanned  Ground 
Vehicle  Program  reflects  this  reality,  in  a  coordinated 
evaluation  and  development  program  with  the  objective  of 
first  fielding  of  an  unmatmed  system  by  1998. 

The  TUGV  is  the  principal  effort  of  the  current  UGV 
advanced  development  program.  The  TUGV  program  is 
being  planned  and  managed  with  an  awareness  that  it 
represents  an  initial  step  in  the  evolution  and  fielding  of 
UGVs  for  combat  applications  and  that  its  success  or 
failure  may  have  far-reaching  consequences. 

Three  principal  foci  encompass  the  DoD  UGV  progiarrL 
First,  several  Surrogate  Teleoperated  Vehicles  (STVs)  will 
be  developed  and  used  to  support  Early  User  Test  and 
Evaluation  (EUTE)  of  UGV  concepts.  Second,  a  full  scale 
Engineering  and  Manufacturing  Development  (EMD) 
program  will  develop  the  first  fielded  Teleoperated 
Unmanned  Ground  Vehicle.  Third,  the  Unmanned  Ground 
Vehicles  Technology  Enhancement  and  Exploitation 
(UGVTEE)  program  will  focus  on  maturing  those  robotics 
technologies  of  particular  interest  to  UGV  systems.  The 
UGVTEE  program  is  a  demonstration-directed  effort, 
including  Demo-I ,  whose  principal  aims  are  to  mature  and 
transition  near-term  technology,  and  Demo-ll,  whose  goal 
is  to  develop  semi-autonomous  navigation  technology. 

2.0  The  Surrogate  Teleoperated  Vehicle 
Program 

The  STV  program,  managed  by  the  Joint  Unmanned 
Ground  Vehicles  Office  (JUGVO)  at  the  U.S.  Army  Missile 
Command  (MICOM)  in  Huntsville,  AL  will  develop  14 
Surrogate  Teleoperated  Vehicles  (STVs).  These  will  be 
used  to  conduct  Early  User  Test  and  Evaluation  (EUTE),  by 
placing  six  STVs  in  a  USMC  infantry  brigade,  arid  six  in  a 
U.S.  Army  brigade  for  a  period  of  one  year,  starting  in 
1992. 


I.  The  teini  UGV  is  used  in  a  general  sense  to  include  a  range  of 
applications.  The  term  Tactical  Unmanned  Ground  Vehicle 
(ITJGV)  will  refer  to  a  specific  project,  which  is  developing  one 
class  of  UGVs. 
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FIGURE  1.  The  Surrogate  Teleoperated  Vehicle 


The  STV  is  a  six-wheel-drive,  fully  amphibious  platform. 
It  contains  all  automotive  and  navigational  components, 
including  sensors  and  control  for  teleoperated  driving 
under  day,  night,  and  adverse  environmental  conditions. 
The  platform  is  powered  by  a  hybrid  25  horsepower  (HP) 
diesel  engine  and  a  3  hp  electric  motor,  for  silent 
locomotion  when  required.  The  STV  will  be  able  to 
traverse  roads  at  35  miles  per  hour  (mph)  and  travel  off¬ 
road  at  25  mph.  Its  remote  driving  speeds  will  depend:  1) 
on  the  skill  of  the  operator  in  teleoperated  mode  or,  2)  on 
the  sophistication  of  the  software  as  semi-autonomous 
capabilities  are  added.  By  placing  these  teleoperated 
vehicles  into  the  hands  of  soldiers  and  marines,  the  JUGVO 
will  acquire  direct  access  to  employment  concepts  aeated 
by  users  in  tactical  environments. 

3.0  TUGV  Engineering  and 

Manufacturing  Deveiopment 

In  the  Engineering  and  Manufacturing  phase,  the  selected 
system  contractor(s)  will  be  responsible  for  fabricating  a 
production-ready  TUGV.  The  Government  will  conduct 
Developmental  Test  and  Evaluation  and  Initial  Operational 
Test  and  Evaluation  of  the  contractor-provided  TUGV  pro¬ 
totypes.  These  tests  will  determine  readiness  for  production 
of  a  first  generation  TUGV.  Milestone  III  is  planned  for  the 
end  of  1997. 

4.0  UGV  Technoiogy  Enhancement  and 
Exploitation 

UGVTEE  consists  of  technology  base  efforts  supporting 
current  and  future  UGV  projects,  and  involve  participation 
from  academe,  industry,  DoD,  DoE  and  NASA 
laboratories.  The  UGVTEE  program  is  directed  to  exploit 
robotics  advances  and  mature  those  technologies  that  are 
critical  to  the  robotization  of  UGV  systems.  The  near-term 
focus  of  this  program  is  on  providing  the  mission 
capabilities  and  technological  enhancement  required  for 


the  TUGV.  This  part  of  the  program  (Demo-I)  will 
conclude  in  FY1992.  The  long  term  focus  is  on  image 
understanding,  planning  and  control  technologies  that  will 
enhance  operational  capability  and  survivability.  This  part 
of  the  program  (DEM(>II)  has  been  initiated  in  FY199I, 
with  the  main  focus  of  developing  autonomous  navigation 
under  battlefield  conditions.  Technology  development  will 
include  support  for  RSTA  functions  while  the  UGV  is 
moving,  distributed  artificial  intelligence  supporting 
automated  communication  with  other  vehicles,  and  work¬ 
load  partitioning  between  vehicles  to  accomplish  mission 
objectives. 

5.0  UGVTEE  Demo-I 

The  principal  purpose  of  Demo-1  is  to  mature  critical 
system  component  technologies  for  first  generation 
teleoperated  UGVs  and  demonstrate  their  readiness  for 
acquisition  programs.  Based  on  the  results  of  Demo-I, 
selected  technologies  will  be  integrated  into  the  basic  STV 
for  the  development  of  a  complete  TUGV  prototype.  The 
emphasis  is  on  reducing  operator  work-load  while 
enhancing  performance  of  the  RSTA  mission. 

6.0  UGVTEE  Demo-ll 

The  purpose  of  Demo-Il  is  to  develop  and  mature  those 
navigation  technologies  that  are  critical  to  evolving  UGVs 
from  labor  intensive  teleoperated  systems  requiring  fibre- 
optic  cables  for  communication  to  supervised  autonomous 
systems  utilizing  low-bandwidth  non-line-of-sight 
communication.  The  objective  of  the  program  will  be  to 
demonstrate  four  semi-autonomous  cooperating  unmanned 
ground  vehicles  performing  navigation,  reconnaissance, 
surveillance,  target  acquisition  and  target  designation. 

As  shown  in  Figure  2,  Demo-II  is  a  four-phase  five  year 
program  with  three  interim  demonstrations  directed  at 
transitioning  research  results  onto  the  surrogate  vehicles. 

6.1  Demo-ll  Technologies 

Realization  of  the  Demo-ll  objectives  will  require 
moderate  to  substantial  increases  in  capabilities  from 
current  state-of-the-art  in  Image  Understanding, 
Navigation,  Planning,  Control,  and  Distributed  Artificial 
Intelligence.  The  recommendation  for  approving  the 
Demo-II  program  was  based  on  research  results  developed 
under  support  by  the  DARPA  Image  Understanding', 
Planning  ,  and  Robotics  Science^  Programs,  and  others.^ 
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FIGURE  2.  Oemo-ll  Overview 
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7.0  Conclusion 

We  have  described  the  recent  DARPA  procurement  for  re¬ 
search  in  RSTA  for  the  UGV.  These  efforts,  to  begin  in 
mid-1993,  will  lead  to  needed  advancements  in  natural 
scene  understanding,  model-based  object  recognition,  mo¬ 
tion  compensation,  target  detection  in  FLIR  imagery,  and 
LADAR/R,IR  multisensor  fusion. 
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Abstract 

The  Image  Understanding  Environment(IUE) 
project  is  a  five  year  program,  sponsored  by 
DARPA,  to  develop  a  common  software  envi¬ 
ronment  for  the  development  of  algorithms  and 
application  systems.  The  ultimate  goal  of  the 
project  is  to  provide  the  basic  data  structures 
and  algorithms  which  are  required  to  carry  out 
Image  Understanding(IU)  research  and  to  de¬ 
velop  lU  applications.  This  paper  provides  an 
overview  of  the  lUE  and  an  update  on  the  sta¬ 
tus  of  the  project. 

Introduction 

What  is  the  lUE? 

The  Image  Understanding  Environment(IUE)  is  an 
object-oriented  software  system  which  provides  the  ba¬ 
sic  data  structures  and  operations  required  to  imple¬ 
ment  Image  Understanding(IU)  algorithms.  These  data 
structures  and  operations  are  based  on  the  classical 
mathematical  abstractions  used  in  lU  research,  such  as 
pointsets,  transformations  and  topology.  The  primary 
purpose  of  the  Image  Understanding  Environment(IUE) 
is  to  facilitate  exchange  of  research  results  within  the  lU 
community.  The  lUE  will  also  provide  a  platform  for  var¬ 
ious  demonstrations  and  tools  for  DARPA  applications. 
These  demonstrations  and  tools  will  become  a  primary 
channel  for  lU  technology  transfer.  The  lUE  will  also 
serve  as  a  conceptual  standard  for  lU  data  models  and  al¬ 
gorithms.  The  availability  of  standard  implementations 
for  algorithms  will  facilitate  performance  evaluation  of 
new  techniques  and  to  track  progress  in  algorithm  im¬ 
provements.  The  lUE  is  designed  to  support  evolution 
and  testing  of  lU  techniques  and  provide  an  efficient  pro¬ 
gramming  environment  for  rapid  prototyping. 

History  of  the  lUE 

In  late  1989,  Rand  Waltzman  of  DARPA,  then  manager 
for  Image  Understanding  programs  conceived  and  devel- 

*The  members  of  the  lUE  Committee  are:  Tom  Dinford- 
Stanford;  Terry  Boult-Columbia;  Bob  Haralick,  V.  Ramesh- 
U.  Washington;  A1  Hanson,  Chris  Connolly-Umass;  Ross 
Beveridge-Colorado  State;  Charlie  Kohl-  AAl;  Daryl  Lawton- 
Georgia  Tech;  Doug  Morgan-ADS;  Joe  Mundy-GE;  Keith 
Price-USC;  Tom  Strat-SRI 


oped  a  new  program  called  HUS.  The  decoding  of  this 
acronym  is  Intelligent  Integrated  Interactive  Image  Un¬ 
derstanding.  The  name  has  since  been  shortened  to  lUE, 
for  Image  Understanding  Environment. 

The  lUE  program  was  announced  at  a  meeting  for 
DARPA  Principal  Investigators  in  Scottsdale,  Arizona 
at  the  end  of  February,  1990.  The  project  goal,  as  an¬ 
nounced  by  Rand,  was  a  five  year  program  to  design 
and  implement  a  common  software  environment  for  the 
development  and  demonstration  of  image  understanding 
algorithms  and  techniques. 

Rand  Waltzman  convened  and  chaired  three  meetings 
during  the  1990-1991  period  to  develop  an  consensus  in 
the  lU  community  about  the  requirements  for  the  lUE. 
A  number  of  teams  were  established  to  suggest  specific 
application  scenarios  and  propose  skeleton  architectures 
for  the  lUE.  In  April  1991,  these  team  reports  were  re¬ 
viewed,  and  the  lUE  committee  was  formed  from  repre¬ 
sentatives  of  each  team.  In  June  1991,  Oscar  Firscbein 
replaced  Rand  Waltzman  as  the  lU  program  manager  at 
DARPA. 

Since  April  1991,  the  lUE  committee,  along  with  the 
DARPA  lU  Program  Manager,  has  met  eight  times  to 
develop  the  design  and  produce  a  number  of  documents 
which  specify  the  design  as  well  as  an  lU  data  exchange 
standard.  In  September  1992,  the  committee  provided 
a  draft  version  of  the  lUE  design  to  DARPA  to  provide 
the  basis  of  a  solicitation  to  select  the  lUE  contractor. 
On  January  8th  1993,  the  lUE  BAA  was  published  in 
the  Commerce  Business  Daily  by  the  contracting  agent, 
the  Topographic  Engineering  Center. 

The  basic  structure  of  the  lUE  is  captured  by  a  partial 
summary  of  the  class  hierarchy  shown  in  Figure  1.  The 
figure  illustrates  the  central  idea  of  the  lUE  design  and 
its  relationship  to  mathematical  concepts.  The  following 
sections  provide  a  brief  summary  of  the  lUE  design. 

Image 

The  lUE  image  object  class  supports  many  forms  of  im¬ 
age  data,  from  intensity  images  to  color  images  to  com¬ 
plex  composites  such  as  pyramids.  lUE  images  fall  into 
one  of  two  subclasses:  simple  or  composite.  Simple  im¬ 
ages  have  two  primary  dimensions,  x  and  y,  with  possible 
additional  dimensions  such  as  color  and/or  time. 
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Figure  1:  A  partial  view  of  the  lUE  class  hierarchy.  There  are  hundreds  of  clciss  definitions  in  the  current  lUE,  and 
this  figure  represents  only  a  sketch  to  illustrate  the  general  nature  of  the  hierarchy. 


A  simple  image  can  be  defined  as  a  mapping  from 
Z  X  Z  to  V,  where  Z  is  the  set  of  positive  and  negative 
integers  and  V  is  the  set  of  allowable  pixel  values: 

I  :Zx  Z — 

I  may  be  abstractly  viewed  as  a  discrete  function  of  two 
variables  /(r,  c)  where  r,  c  6  Z.  For  computational  rea¬ 
sons  it  is  desirable  to  restrict  the  domain  of  the  map¬ 
ping  to  a  specific  subset  of  Z  x  Z,  usually  a  rectangular 
bounded  region  of  the  plane.  The  set  of  pixels  comprising 
the  restricted  domain  is  often  called  a  region- of-iniertsi 
(or  ROI)  and,  by  extension,  the  logical  image  defining 
the  characteristic  function  is  also  called  an  ROI.  Note 
that  there  is  no  restriction  on  the  connectedness  of  the 
pixels  in  this  restricted  domain;  that  is,  it  is  not  a  true 
region  as  defined  in  region  segmentation  and  may  consist 
of  a  set  of  completely  isolated  pixels.  In  the  lUE,  there 
are  two  ROIs  which  affect  the  domain.  While  both  are 
called  ROIs,  their  semantics  are  slightly  different.  Every 
image  in  the  IDE  has  an  ROI  associated  with  it,  called 
the  image-ROI.  This  ROI  may  be  explicitly  represented 
as  a  logical  image  having  the  same  extents  and  index  set 
as  the  original  image  or  it  may  be  implicitly  represented 
as  the  entire  image.  In  addition,  each  method  has  an 
optional  parameter  which  specifies  an  ROI,  called  the 
in-ROI.  The  set  of  pixels  comprising  the  pixel  domain  is 
obtained  by  intersecting  the  image-ROI  and  in-ROI. 

Simple-images  can  map  onto  different  slices  of  the 
same  underlying  pixel  data,  thus  avoiding  redundant 
copies  in  many  common  situations.  To  illustrate,  an 
RGB  image  is  represented  as  a  3  dimensional  shared- 
array.  Associated  with  it  are  three  2  dimensional  arrays: 
slices  corresponding  to  each  of  the  three  color  planes. 
These  four  objects — RGB,  red,  green,  and  blue  images — 
present  different  views  of  the  same  underlying  data.  In 
addition  to  abstr:  ct  high  level  interfaces  to  pixel  data 
provided  by  the  image  objects,  the  lUE  will  provide  a 
raw-data  interface  to  pixel  data.  Raw-data  objects  are 
one  dimensional  containers  for  large  amounts  of  numeric 
data,  and  they  support  direct  pointer  access  to  numeric 
pixel  data.  The  lUE  must  efficiently  support  large  im¬ 
ages  (say  lOKxlOK)  such  as  those  commonly  occurring  in 
photo-interpretation  tasks.  Thus,  a  direct  tile-mapping 
mechanism  must  be  associated  with  raw-data  objects 
which  allows  large  images  to  be  block  mapped  into  mem¬ 
ory  on  a  demand  basis;  this  mechanism  must  be  efficient 
enough  to  support  smooth  scrolling  (roaming)  and  zoom¬ 
ing. 

Image-types  commonly  occurring  in  image  under¬ 
standing  include:  A  color  image  with  red,  green  and  blue 
components,  where  each  component  is  a  simple  grayscale 
image.  A  range  image  represents  a  depth  image  acquired 
from  laser  triangulation  or  a  time-of-flight  range  finder. 
An  image  sequence  can  be  viewed  as  a  queue  of  images  in¬ 
dexed  by  time.  For  simple-image-sequences  queue  length 
is  fixed  when  the  sequence  is  created  and  all  elements 
must  be  simple-images  of  the  same  size.  In  contrast, 
composite  image  sequences  may  contain  images  of  any 
type  and  may  dynamically  grow  and  contract. 

The  class  of  composite-images  is  very  broad  and 
the  semantics  vary  considerably  between  specializations. 


Composite-images  provide  the  flexibility  required  to  de¬ 
velop  objects  such  as  image  pyramids  and  mosaics.  An 
Image-Pyramid  is  an  ordered  set  of  images,  each  a  power 
of  two  reduction  of  the  predecessor.  This  makes  several 
restrictions  evident.  The  first  image  must  be  square,  the 
dimension  being  a  power  of  2,  i.e.,  2”.  The  depth  is 
of  the  pyramid,  or  number  of  images,  is  simply  n  -t-  1. 
An  image  of  size  2"a;2’*  is  said  to  reside  at  level  n  of 
the  pyramid.  A  Mosaic-Image  is  a  patchwork  of  images, 
possibly  overlapping,  and  partially  covering  an  extended 
2D  area.  The  proper  metaphor  is  a  stack  of  photographs 
on  a  table.  Hence,  the  get-pixel  method  returns  the  pixel 
value  for  the  first  image  in  the  set  which  is  defined  at  the 
specified  point. 

Spatial  Object 

A  key  element  of  the  lUE  design  is  the  spaiial-object 
which  has  the  mathematical  properties  of  a  pointset  in 
3?".  The  fundamental  structures  are  organized  along 
classical  notions  of  intrinsic  dimension,  i.e.,  point,  curve, 
surface  and  volume.  Further  distinction  is  made  between 
implicit  and  parametric  entities.  Implicit  structures  are 
defined  by  the  vanishing  of  systems  of  equations,  usually 
polynomials.  Implicit  forms  are  useful  for  determining 
incidence  or  containment.  Implicit  forms  of  curves  such 
as  the  line  and  conic  are  used  throughout  lU  research. 
An  example  of  a  commonly  used  implicit  surface  is  the 
superquadric. 

Parametric  structures  involve  a  mapping  from  a  set  of 
parameters  to  3i".  The  parametric  mapping  function  is 
a  particular  type  of  relation  which  defines  a  mapping  be¬ 
tween  two  sets  of  n-tuples,  the  Domain  and  the  Range. 
We  further  restrict  the  mapping  to  be  order  preserving 
and  one  to  one.  With  these  properties  we  can  always 
find  a  unique  point  in  the  domain  for  a  given  point  in 
the  range  and  the  natural  dimension  and  neighborhood 
properties  are  preserved.  This  is  a  much  stronger  con¬ 
dition  than  is  usually  associated  with  the  idea  of  para¬ 
metric  curves  or  surfaces.  The  curves  here  are  perhaps 
more  properly  called  “well-parametrized,”  where  there  is 
a  unique  inverse  for  each  point  in  the  range  of  the  curve. 
A  typical  example  of  a  parametric  curve  is  the  spline.  A 
common  parametric  surface  used  in  lU  is  the  ribbon^ 

A  coordinate  system  is  associated  with  the  base 
spatial-object  class  in  order  to  maintain  a  consistent  def¬ 
inition  of  coordinates  derived  from  the  equational  defi¬ 
nition  of  geometric  structures.  Also  associated  with  all 
spatial-objects  are  a  bounding-box  and  centroid  point. 
These  ancillary  structures  enable  efficient  processing  of 
distance  and  intersection  operations. 

There  are  many  other  components  of  the  spatial- 
object  hierarchy  which  are  described  more  fully  else¬ 
where  in  these  proceedings  [Hamesh  and  Committee, 
1993]. 

Coordinates 

The  geometric  relationship  between  sensors  and  scenes, 
among  physical  objects,  and  between  pixels  and  the 

'Only  restricted  classes  of  ribbon  surfaces  satisfy  the 
unique  inverse  property. 
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world  has  been  a  core  component  of  the  science  of  im¬ 
age  understanding  since  its  inception.  Nearly  every  lU 
system  makes  use  of  coordinate  systems  and  transforms 
either  implicitly  or  explicitly.  The  multitude  of  repre¬ 
sentations  that  have  been  devised,  some  of  which  incor¬ 
porate  arbitrary  conventions,  has  been  a  key  obstacle 
precluding  the  transfer  and  sharing  of  code  and  results. 
The  following  are  definitions  associated  with  coordinate 
spaces. 

Coordinate  System  A  coordinate  space,  in  the  math¬ 
ematical  sense.  It  is  represented  in  the  lUE  by  an 
instance  of  a  coordinate-system  class. 

Coordinate  The  coordinate(s)  of  a  point  are  repre¬ 
sented  by  a  vector  and  implemented  as  a  ID-array. 
Coordinates  are  implicitly  associated  with  a  coordi¬ 
nate  system. 

Coordinate- Transform  A  specification  of  a  mapping 
between  two  coordinate  spaces.  It  is  implemented  in 
the  lUE  by  an  instance  of  a  coordinate- transform 
class. 

A  coordinate  transform  specifies  the  mapping  between 
two  coordinate  systems,  and  is  represented  in  the  lUE 
by  an  instance  of  a  coordinate  transform  class.  There 
is  an  implied  directionality  to  each  transform,  and  any 
individual  transform  may  or  may  not  be  invertible. 

The  relationships  among  coordinate  systems  and 
transforms  form  a  graph.  Coordinates  can  be  mapped 
between  any  two  coordinate  systems  by  finding  a  se¬ 
quence  of  transforms  that  connect  them  in  the  graph. 
There  is  a  potential  problem  when  more  than  one  path 
exists  between  two  coordinate  systems.  Ideally,  all  such 
paths  specify  the  same  transform,  but  the  existence  of 
numerical  errors,  and  the  need  for  high-performance  ap¬ 
proximations  preclude  the  treatment  of  all  such  paths  as 
equivalent.  In  lUE,  the  first  path  established  between 
two  coordinate  systems  is  treated  specially  -  all  others 
are  considered  to  be  derived  transforms.  Whenever  a 
transform  is  requested  from  the  coordinate  transform 
graph,  the  basal  (non-derived)  transform  is  retrieved. 
Derived  transforms  can  be  employed  when  desired  by 
specifying  them  explicitly.  This  policy  ensures  that  coor¬ 
dinate  transforms  are  performed  consistently  and  with¬ 
out  introduction  of  excessive  numerical  error. 

Image  Features 

Central  to  any  Image  Understanding  researcli  or  applica¬ 
tion  program  is  the  extraction  and  use  of  image  features. 
Image  features  are  a  part  of  tlie  general  spatial  object  hi¬ 
erarchy  in  that  they  combine  both  geometric  and  image 
signal-theoretic  concepts. 

Image  features  provide  implementations  of  methods 
for  extraction,  property  value  computations,  display, 
spatial  indexing  operations,  input,  output,  grouping,  and 
the  various  iterators  over  sets  of  spatial  objects  (subsets, 
all  in  an  area,  etc.).  Image  features  are  often  be  used  as 
the  basis  for  the  rcgion-of-iiitcrest  in  image  jirocessing 
operations  and  thus  must  support  geometric  and  topo¬ 
logical  operations. 

There  are  several  rca.sons  why  developing  image  fea¬ 
tures  as  spatial  objects  is  crucial.  First,  there  is  a  nat¬ 


ural  correspondence  between  the  sequence  of  topologi¬ 
cal  constructs,  e.g.  vertex,  edge,  and  face  used  for 
spatial  objects,  and  the  descriptions  of  image  features 
for  points  and  junctions,  edges,  and  regions.  Access  to 
this  topological  representation  is  especially  important 
for  describing  composite  image  features  such  as  linked 
line  segments,  adjacencies  between  regions  found  in  seg¬ 
mentations,  and  perceptual  groups.  Second,  since  image 
features  generally  correspond  to  the  projection  of  three 
dimensional  object  models,  it  is  useful  to  have  the  same 
underlying  operations  and  representations  used  for  both 
of  them. 

Images  features  collections  can  be  grouped  on  a  variety 
of  properties  such  as  proximity,  alignment,  curvature, 
etc.  These  grouping  operations  are  supported  by  various 
spatial  indexing  schemes  such  as  K-D  Trees,  quadtrees 
and  the  Hough  transform. 

Image  features  are  discussed  in  more  detail  elsewhere 
in  the  proceedings  [Price  and  Committee,  1993]. 


Sensors 

Unlike  some  other  aspects  of  the  lUE,  sensors  are  an  area 
of  active  research  where  few  de  facto  standards  exist. 

Two  important  aspects  of  sensing  in  the  context  of  lU 
are  the  device  and  the  data  produced  by  the  device,  rep¬ 
resented  by  the  classes,  sensor  and  sensor-inodcl  re¬ 
spectively.  The  class  sensor  is  the  analog  of  the  physical 
device  and  is  capable  of  many  operations  associated  with 
sensing  and  the  production  of  sensor-models.  The  out¬ 
put  of  a  sensing  operation  is  stored  in  an  object  of  type 
sensor-model.  The  sensor-model  not  only  contains  a 
pointer  to  the  generated  data,  it  has  a  copy  of  relevant 
sensor- parameters  and  provides  methods  to  reason  about 
the  geometry  of  the  sensor  mapping,  and  uncertainty  in 
data  locations  and  measured  values.  The  sensor  may  in¬ 
teract  with  a  external  device  (e.g.  a  frame  grabber)  to 
get  real  data,  or  may  generate  synthetic  data  either  by 
embellishing  a  stored  image  with  additional  (assumed) 
properties  or  by  rendering  in  conjunction  with  a  scene 
object.  In  addition  to  getting  the  data,  it  is  also  the 
sensor's  task  is  to  determine,  from  the  various  attributes 
of  the  sensor  (e.g.  lens  parameters,  digitizer  parameters, 
etc.),  the  attributes  of  the  sensor-model  (related  to  its 
geometric  mappings  and  its  uncertainty  measurements). 

While  one  might  think  of  sensors  producing  only 
image-like  data,  the  mapping  concept  on  which  sen¬ 
sors  are  based  is  not  restricted  to  physical  transducers 
of  energy.  Hence,  it  naturally  extends  to  include  the 
production  of  spatial  objects.  This  allows  us  to  define 
a  gcomctric-soiisor  as  something  that  can  have  the 
same  "lens”  slot  as  a  camera  and  that  uses  the  same 
sampling  pattern  as  a  camera  but  that  maps  a  scene 
with  many  instances  of  the  class  siiatial-objoct  into  a 
snisor-iiiodcl  which  contains  a  collection  of  inslanci's 
of  the  class  spatial-object.  For  example,  a  .scene  full  of 
polygons  might  be  mapped  into  a  collection  of  vertices 
and/or  line  segments. 
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The  lUE  User  Interface 

The  lUE  will  make  extensive  use  of  graphical  interaction 
to  support  the  examination  of  features  and  to  provide 
convenient  tools  for  model  construction,  recognition  and 
etc.  These  tools  will  be  constructed  within  a  uniform 
user  interface  methodology  and  will  allow  convenient  se¬ 
lection  and  modification  of  graphic  items. 

There  are  three  interfaces  for  dealing  with  user  inter¬ 
actions  with  the  lUE:  The  user  interface  (lUEUI),  the 
lUE  graphical  user  interface  (lUEGUI),  the  programmer 
interface. 

The  Image  Understanding  Environment  User  Interface 
(lUEUI)  is  intended  to  provide  flexible,  simple,  and  pow¬ 
erful  tools  for  exploring  data,  algorithms,  and  systems. 
In  addition  to  general  principles  of  good  interface  design, 
there  are  several  important  objectives  which  are  specific 
to  image  understanding  and  developing  the  lUE: 

Use  existing  interface  standards  The  interface 
should  be  supported  by  ongoing  and  future  devel¬ 
opments  in  software  environments  and  graphical 
user  interfaces.  The  interface  components  should 
be  built  on  top  of  existing  and  emerging  interface 
packages  and  interface  construction  toolkits.  This 
is  critical  for  the  long  term  use  of  our  environment 
because  we  can  depend  on  continuous  advances  in 
these  areas  that  we  will  want  to  take  advantage  of  in 
terms  of  capabilities  and  cost  (other  issues  involved 
with  this  are  discussed  in  section  on  graphics  soft¬ 
ware).  A  good  example  is  the  evolving  Open  GL 
standard. 

A  few,  powerful  interface  classes  The  same  general 
principles  of  object  oriented  design  used  in  the  lUE 
should  be  applied  to  the  interface:  abstraction  over 
common  operations  to  provide  a  small  number  of 
types  of  interface  objects  which  can  be  freely  com¬ 
bined  by  a  user.  The  interface  should  not  involve 
understanding  a  large  numbers  of  unrelated  things. 

Consistent  interaction  with  other  lUE  objects 
The  interface  should  make  it  straightforward  to  ma¬ 
nipulate  and  investigate  lUE  objects.  For  example, 
in  displaying  a  spatial  object,  a  user  wants  to  con¬ 
trol  all  aspects  of  how  the  domain  and  the  range 
are  displayed.  Another  aspect  of  this  is  provid¬ 
ing  intelligent  default  behavior  for  interacting  with 
lUE  objects,  such  eis  setting  up  appropriate  types 
of  browsers  for  different  types  of  objects. 

Control  of  the  display  and  presentation  While  in¬ 
telligent  defaults  and  context-based  behavior  is  es¬ 
sential,  a  user  should  always  be  able  to  override 
them  and  have  complete  control  and  flexibility. 

Support  for  sophisticated  users  Naive  users  will 
want  support  for  running  tailored  applications  with 
several  interaction  aids  such  as  menus  while  expe¬ 
rienced  users  who  want  programmability  and  sig¬ 
nificant  compression  and  abbreviation  in  specifying 
actions. 

To  realize  these  objectives  the  interface  of  the  lUE  is 
described  in  terms  of  three  levels.  The  Graphics  Level  is 


the  underlying  “machine  independent”  package  for  ba¬ 
sic  display  and  graphic  operations  which  tell  the  screen 
what  to  do.  Examples  would  be  X  and  Postscript.  The 
Interface  Kit  Level  involves  packages  for  the  creation  and 
rapid  prototyping  of  user  interfaces  and  related  tools 
which  are  built  on  top  of  graphics  level  software.  Exam¬ 
ples  are  such  things  as  Interviews,  DevGuide,  and  TAE. 
This  level  also  includes  the  tools  found  in  the  selected 
software  development  environment  such  as  editors  and 
debuggers.  It  is  important  that  these  all  be  thoughtfully 
integrated.  It  should  not  feel  like  starting  up  completely 
different  processes  when  moving  from  the  debugger  and 
editors  for  code  development  to  the  display  and  browsing 
operations  of  the  interface.  The  lUEUI  Level  consists  of 
the  interface  objects  specialized  for  image  understand¬ 
ing.  This  includes  such  things  as  object  displays,  plot¬ 
ting  displays,  several  types  of  browsers,  and  structures 
for  describing  the  interface  context.  The  lUEUI  consists 
of  a  small  set  of  objects  which  can  be  freely  combined 
for  very  powerful  results. 

The  lUE  User  Interface  is  described  in  these  proceed¬ 
ings  [Lawton  and  Committee,  1993]. 

lUE  Process  Control 

Large  grain  lU  operators  typically  have  complex  param¬ 
eter  structures.  A  significant  portion  of  the  time  spent 
in  developing  lU  applications  is  spent  in  the  exploration 
of  the  search  space  defined  by  the  parameters  of  the  op¬ 
erators.  For  example,  a  common  type  of  question  that 
needs  to  be  answered  by  lU  researcher  is:  ’’What  is  the 
most  effective  Laplacian  radius  when  performing  a  Zero 
Crossing  segmentation  on  aerial  images  of  cities?”.  A 
great  deal  of  time  and  effort  is  spent  in  determining  ap¬ 
propriate  values  for  parameters  of  a  particular  operator 
in  a  particular  domain.  Once  these  parameter  values 
have  been  determined,  it  becomes  natural  to  think  of 
the  parameterized  operator  as  a  new  entity  that  is  dif¬ 
ferent  from  the  unspecialized  generic  operator.  This  view 
leads  to  the  concept  of  an  operator  eis  a  Task  object  that 
can  use  inheritance  and  specialization  to  represent  these 
parameter  structures. 

lU  research  also  involves  a  very  large  amount  of  pro¬ 
cessing  and  data  generation;  in  this  type  of  environment 
it  becomes  important  to  be  able  to  examine  the  process¬ 
ing  history.  The  Task  class  readily  supports  the  mainte¬ 
nance  of  a  processing  history  through  the  e.xplicit  repre¬ 
sentation  of  TEisk  parameters.  A  Task  instance  can  exist 
in  one  of  three  states  depending  on  the  specification  of 
the  input  and  output  parameters  for  the  Task:  it  may  ei¬ 
ther  be  partially  specified,  fully  specified,  or  completed. 
In  this  way.  Task  instances  describe  both  the  complete 
input  and  output  specification  and  the  processing  sta¬ 
tus  of  Tasks.  The  Task  objects  thus  provide  a  complete 
description  of  the  large  granularity  image  understanding 
processing  that  has  occurred  in  a  user  environment. 

Associated  Classes 

Another  aspect  of  large  grain  lU  processes  is  that  they 
are  used  by  the  lU  researcher  as  a  set  of  tools  within 
an  experimental  toolbox.  Researchers  require  a  flexi¬ 
ble  mechanism  for  control  and  data  chaining  that  al- 
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lows  them  to  construct  experiments  that  combine  indi¬ 
vidual  Tasks  into  more  complex  algorithms.  The  Task 
class  hierarchy  provides  this  mechanism  through  the 
CompoundTask  and  DataflowGraph  object  classes.  With 
these  classes,  the  user  may  chain  individual  Task  objects 
together,  either  through  programs  or  through  the  use  of 
a  interactive  graphical  interface. 

DataflowGraphs  allow  the  user  to  specify  data  path¬ 
ways  between  Task  objects.  The  Tasks  in  a  Dataflow- 
Graph  can  have  some  subset  of  their  input  parameters 
specified  dynamically;  the  input  values  of  the  dynamic 
parameters  are  specified  by  the  values  of  output  param¬ 
eters  of  other  Tcisk  objects.  Whenever  values  have  been 
specified  for  all  required  input  parameters,  the  Task  ob¬ 
ject  is  executed.  Other  DataflowGraph  objects,  such  as 
DataflowConditional  nodes  and  DataGenerator  nodes, 
provide  the  control  constructs  that  make  the  Dataflow- 
Graph  an  effective  programming  tool.  A  DataflowGraph 
may  be  constructed  either  in  C,  LISP,  or  at  the  interface 
level,  to  form  these  complex  processes.  In  C  or  Lisp,  this 
complex  Task  control  can  be  also  implemented  through 
a  message  passing  paradigm  in  which  Task  instances  are 
parameterized  and  controlled  through  messages  (generic 
function  calls)  from  a  controlling  program. 

Applications 

The  Tasks  that  will  be  supported  by  the  lUE  will  cover 
a  wide  range  of  algorithms  and  tools.  It  will  be  expected 
that  the  set  of  Tasks  that  are  included  with  the  lUE  will 
expand  rapidly  as  the  lUE  begins  to  receive  wide  use  and 
the  research  groups  using  the  system  begin  to  contribute 
their  own  research  tools.  The  following  are  examples  of 
TaskGroups  specified  by  the  lUE  design. 

ImageProcessing  Tools  that  map  image  data  to  image 
data. 

ImageSegmentation  Tools  that  map  image  data  to 
symbolic  data. 

PcrccptualOrganization  Grouping  tools  to  map  sym¬ 
bolic  data  to  symbolic  data. 

GeoinetricFitting  Tools  that  fit  geometric  entities  to 
symbolic  data. 

ObjectMatcliing  Tools  mapping  object  descriptions 
to  symbolic  data. 

ModelConsti'uction  Tools  for  creation  and  manipula¬ 
tion  of  object  descriptions. 

The  Future  of  the  lUE 

The  lUE  is  planned  to  be  developed  in  three  pheises: 
basic,  core  and  version  1.0.  The  basic  version  of  the 
lUE  accounts  for  the  elementary  classes  and  methods 
required  to  support  development  of  core  lU  classes  and 
algorithms.  The  core  lUE  represents  a  useful  intersection 
of  current  practice  in  lU  research.  The  core  will  provide 
representations  for  the  major  structures  and  methods 
used  to  implement  lU  algorithms,  such  as  segmentation, 
grouping,  matching  and  modeling.  Finally,  version  1.0 
will  consist  of  the  core  plus  selected  support  for  curves 
and  curved  surfaces  as  well  as  demonstration  algorithms 
which  illustrate  the  use  of  the  core  class  library. 


Currently,  the  lUE  committee  is  developing  a  data  ex¬ 
change  standard  which  will  be  immediately  useful  f.* 
exchanging  research  results.  The  data  exchange  for¬ 
mat  is  based  on  the  lUE  class  hierarchy  and  provides 
an  ASCII  representation  for  the  construction  of  class  in¬ 
stances.  The  syntax  is  Lisp-like  and  can  be  easily  parsed 
by  LEX/YACC.  The  initial  goal  of  the  data  exchange 
specification  is  to  support  technology  transfer  within  the 
RADIUS  project.  The  format  is  discussed  in  more  detail 
in  these  proceedings  [Mundy  et  ai,  1993]. 

At  this  writing,  the  solicitation  process  is  underway 
to  select  the  lUE  integrating  contractor.  It  is  expected 
that  the  selection  will  be  made  in  the  first  half  of  1993. 
The  implementation  of  the  basic  and  core  versions  of  the 
lUE  are  expected  to  take  about  two  years  from  start  of 
contract  with  version  1.0  being  released  at  the  end  of  the 
project. 

During  these  project  phases  there  will  be  continuous 
review  and  consultation  to  insure  that  the  lUE  meets 
the  requirements  of  the  lU  community.  Any  suggestions 
or  comments  concerning  design  or  implementation  issues 
are  welcome  and  may  be  directed  to  mundy@crd.ge.com. 
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Abstract 

We  describe  the  design  and  initial  prototyp¬ 
ing  of  the  user  interface  of  the  DARPA  Image 
Understanding  Environment  (lUE)  and  tools 
for  documentation,  tutorials,  and  publication 
that  will  facilitate  the  use  and  adoption  of  the 
lUE. 

1  Introduction 

The  user  interface  of  the  Image  Understanding  Envi¬ 
ronment  (lUE)  is  intended  to  provide  flexible,  simple, 
and  powerful  tools  for  exploring  data,  algorithms,  and 
systems  [Mundy  and  others,  1992].  The  general  princi¬ 
ples  of  object  oriented  design  used  in  developing  the  lUE 
object  hierarchy  and  programming  constructs  have  also 
been  applied  to  the  interface;  abstraction  over  common 
operations  to  provide  a  small  number  of  interface  objects 
which  can  be  freely  combined  by  a  user.  The  interface 
has  been  designed  to  have  a  consistent  interaction  with 
lUE  objects  and  their  semantics,  especially  the  abstrac¬ 
tion  in  the  lUE  object  hierarchy.  Thus,  the  display  and 
browsing  operations  are  sensitive  to  the  class  similarities 
for  objects  such  as  images,  image  registered  features,  and 
spatial  objects.  Using  and  becoming  comfortable  with 
the  interface  hopefully  will  not  involve  understanding  a 
large  numbers  of  unrelated  things. 

An  equally  important  part  of  the  user  interface  is 
what  it  does  not  develop.  The  lUE  user  interface  must 
leverage  extensively  oil  of  existing  (and  emerging)  inter¬ 
face  and  graphics  packages  and  standards.  The  interface 
needs  to  be  supported  by  ongoing  and  future  develop¬ 
ments  in  software  environments  and  graphical  user  in¬ 
terfaces.  This  is  critical  for  the  long  term  use  of  the 
lUE.  We  can  depend  on  continuous  advances  in  all  these 
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areas  that  the  lUE  will  need  to  take  advantage  of  in 
terms  of  capabilities  and  cost. 

To  realize  this,  the  interface  is  being  developed  in 
terms  of  three  levels.  The  Graphics  Level  is  the  un¬ 
derlying  “machine  independent”  package  for  display  and 
graphic  operations  which  tell  the  screen  what  to  do.  Ex¬ 
amples  would  be  X,  GL,  OpenGL,  and  Phigs.  The  In¬ 
terface  Support  Level  involves  packages  for  the  cre¬ 
ation  and  rapid  prototyping  of  user  interfaces  and  related 
tools  which  are  built  on  top  of  graphics  level  software. 
This  also  includes  the  tools  found  in  the  selected  soft¬ 
ware  development  environment  such  as  editors  and  de¬ 
buggers.  The  Image  Understanding  Environment 
User  Interface  (lUEUI)  Level  consists  of  the  inter¬ 
face  objects  specialized  for  image  understanding.  This 
includes  such  things  as  object  displays,  plotting  displays, 
several  types  of  browsers,  and  structures  for  describing 
the  interface  context.  The  lUEUI  consists  of  a  small  set 
of  objects  which  can  be  freely  combined.  The  specifi¬ 
cations  of  these  objects  is  relatively  independent  of  the 
other  two  levels  although  much  of  the  current  prototyp¬ 
ing  activities  are  directed  towards  understanding  how  to 
best  realize  the  functionality  of  the  lUEUI  objects  with 
respect  to  these  two  levels,  especially  for  accessibility 
and  limiting  the  eventual  cost  of  the  lUE  for  users. 

The  basic  functional  components  of  the  lUE  interface 
are: 

•  Displays:  These  deal  with  mapping  spatial  objects 
and  images  (or  sets  of  spatial  objects  and  images) 
onto  two-dimensional  display  windows.  There  are 
types  of  displays  for  displaying  images  and  image 
registered  features,  for  plotting  functional  relations 
between  attributes  and  components  of  spatial  ob¬ 
jects;  and  for  displaying  surfaces. 

•  Browsers:  These  deal  with  presenting  textual  and 
symbolic  information  about  objects.  There  are  dif¬ 
ferent  types  of  browsers  for  operations  such  as  in¬ 
specting  the  values  in  a  spatial  object,  for  perform¬ 
ing  interactive  queries  with  respect  to  databases 
and  sets  of  objects,  and  for  inspecting  relational 
graphs  and  networks. 

•  Interface  Context  Descriptors:  These  are  for 
describing  the  state  of  the  interface  and  interface 
objects.  Examples  are  such  things  as  the  current 
color-look-up  table  for  a  given  display;  the  current 
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display  window  or  browser;  and  links  between  in¬ 
terface  objects  which  describe  related  views.  This 
information  supports  intelligent  default  behaviors. 

•  Command  Language  and  Command  Buffer: 
A  user  can  control  his  interaction  with  objects  using 
an  interactive  command  language.  The  commands 
can  be  used  in  code  and  to  create  scripts.  This  also 
provides  a  complete  description  of  the  functionality 
of  the  user  interface. 

•  Simplified,  programmable  access  to  Graphi¬ 
cal  User  Interface  (GUI)  objects;  This  is  in¬ 
tended  to  provide  programmer  access  to  several  of 
the  objects  commonly  found  in  Graphical  User  In¬ 
terface  Construction  Kits  such  as  knobs,  sliders, 
text  buffers,  menus.  These  can  then  be  used  in 
applications  and  to  extend  the  interface 

With  the  exception  of  GUI  objects,  we  now  look  at 
each  of  these  in  more  detail.  Our  focus  here  is  on  the 
functional  components  of  the  user  interface  and  their 
attributes. 

2  Object  Displays 

Object  Displays  are  for  viewing  and  interacting  with 
objects  by  mapping  them  onto  two-dimensional  display 
windows.  This  involves  nearly  all  lUE  objects:  images, 
curves,  regions,  object  models,  surfaces,  vector  fields, 
etc.  Object  displays  involve  a  wide  range  of  actions 
such  as  displaying  an  image  and  image  registered  fea¬ 
tures;  displaying  networks  of  objects  such  as  stereo  im¬ 
ages,  multi-resolution  pyramids,  image  sequences  (and 
this  can  involve  having  several  linked  windows  for  the 
different  images;  cycling  through  displays  of  the  differ¬ 
ent  components;  or  mapping  the  different  components 
onto  different  planes  of  the  display  buffer  and  combinin" 
the  images  through  transparency  or  color  addition);  dis¬ 
play  of  models  and  predicted  segmentations  as  overlays; 
and  interactively  inspecting  and  manipulating  displayed 
objects  and  applying  operations  to  them. 

There  is  a  strong  relation  between  spatial  objects  and 
displays.  Most  lUE  objects  are  expressed  as  relations 
between  sets.  In  displaying  an  object,  values  from  one 
set  are  used  to  specify  a  position  in  a  display  window 
and  the  corresponding  values  from  another  set  are  used 
to  specify  an  attribute  such  as  intensity,  color,  overlay, 
transparency  effect,  etc.,  that  is  displayed  at  the  posi¬ 
tion.  A  basic  example  is  an  image,  where  one  set  is 
described  by  the  indices  of  the  coordinate  system  of  the 
image  and  the  other  by  the  color  or  intensity  values  as¬ 
sociated  with  the  particular  image  coordinate.  A  dis¬ 
crete  curve  is  a  mapping  from  integer  indices  onto  two- 
dimensional  positions  with  respect  to  an  image  coordi¬ 
nate  system.  Displaying  a  curve  as  an  overlay  on  top  of 
an  image,  involves  mapping  two-dimensional  positions 
along  the  curve  onto  window  positions  using  the  same 
transformation  that  was  used  for  the  display  of  the  im¬ 
age.  The  color/intensity  of  the  display  at  these  points 
can  be  based  upon  registered  values  associated  with  the 
curve  (such  as  approximated  curvature).  For  example,  a 
user  might  want  to  display  an  intensity  image  in  8-bits 
of  grey-level  intensity  and  then  overlay  extracted  curves 


on  top  of  this  with  the  display  of  curvature  values  along 
the  curve  mapped  onto  8-bits  of  red  intensity. 

We  refer  to  the  operations  that  are  involved  with  spec¬ 
ifying  the  mapping  onto  a  screen  position  the  position 
methods.  These  involve  operations  such  as  panning, 
zooming,  perspective  transformations,  and  warping  op¬ 
erations.  The  machinery  for  this  comes  directly  from 
the  coordinate  system  transformation  methods.  The  op¬ 
erations  that  involve  mapping  onto  a  particular  window 
value  are  called  the  value  methods  and  these  involve 
such  things  as  setting  up  the  GLUT,  point-mapping 
functions  (functions  applied  to  the  value  at  an  object 
position  prior  to  display),  transparency  effects,  etc. 

The  basic  processing  flow  for  displays  is  shown  in  fig¬ 
ure  1.  Displays  take  place  with  respect  to  a  context 
which  involves  such  things  as  the  current  object,  the  cur¬ 
rent  description  of  the  transformation  from  an  object  to 
the  display  window,  the  current  color  look-up  table,  links 
between  lUE  objects,  and  several  other  attributes.  Sev¬ 
eral  display  operations  involve  se'  '‘ng  up  these  contex¬ 
tual  attributes.  Displaying  an  Ol  ect,  such  as  an  image 
or  image  registered  features,  involves  iterating  over  the 
object  and  applying  the  specified  position  methods  to  de¬ 
termining  the  position  in  a  window  at  which  to  generate 
a  display  and  also  applying  the  specified  value  methods. 
In  interactive  processing,  a  selection  operation  is  per¬ 
formed  with  respect  to  the  display  context  to  return  a 
value  at  a  selected  location.  Graphics  are  also  done  with 
respect  to  the  display  context  (The  processing  flow  in  fig¬ 
ure  1  is  idealized  in  several  respects.  Many  display  oper¬ 
ations  don’t  involve  iterating  over  an  object  but  manipu¬ 
lating  the  color  look  up  table  and  display  buffer  directly. 
Rendered  objects  or  displays  which  involve  warping  or 
interpolation  are  more  naturally  expressed  as  iterating 
over  the  display  window  itself.  Displays  can  also  involve 
a  discrete  sampling  of  other  objects  using  square  pixel 
neighborhoods). 

The  object  display  methods  are  organized  into  the  fol¬ 
lowing  classes; 

Value  Methods:  These  are  methods  that  control 
how  values  in  the  specified  object(s)  get  mapped 
onto  screen  attributes  such  as  color  and  intensity; 
how  to  configure  planes  in  the  screen  buffer  for  the 
display  of  color  images;  how  many  panes  to  use  for 
overlays;  particular  functions  and  conditions  to  ap¬ 
ply  to  object  values  prior  to  disp'  •.  This  includes 
operations  such  as  overlays,  mapj  ■  onto  different 
color  bands,  transparency,  and 

Position  Methods:  These  are  methods  that  con¬ 
trol  how  positions  in  specified  object(s)  get  mapped 
onto  a  display  window.  This  includes  operations 
such  as  panning,  zooming,  perspective  views,  and 
some  types  of  warping. 

Display  Window  Attributes;  These  involve  con¬ 
trolling  attributes  of  the  display  window  such  as 
position,  size,  attributes  of  the  title  bar,  event  han¬ 
dling  for  the  mouse;  and  resizing  operations. 

Link  Methods:  For  linking  different  displays  (and 
browsers  and  GUI  widgets).  Examples  are  window 
to  window  zooming,  display  of  stereo  and  pyramidal 
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Figure  1;  Basic  Processing  Flow  for  Displays 

images  in  multiple  images.  This  involves  creating 
links  and  specifying  the  operations  and  transforma¬ 
tions  associated  with  links  between  interface  objects 

Interaction  Methods:  These  involve  interaction 
and  manipulation  of  displayed  object(s)  in  a  dis¬ 
play.  Operations  include  object  selection,  recover¬ 
ing  object  positions  and  values  from  a  mouse  click, 
applying  functions  to  selected  objects 

Graphics  Methods:  Display  registered  graphics 
for  drawing  lines,  text,  three-dimensional  objects, 
etc. 

History  Methods:  Many  objects  can  be  mapped 
to  the  same  display  window.  These  are  methods 
to  coordinate  displays  over  time,  such  as  cycling 
through  an  image  sequence,  playing  an  animation 
of  displays. 

Archiving  Methods:  Printing  an  object  display, 
writing  an  object  display  to  file,  writing  an  object 
display  to  video  tape 

2.1  Value  Methods  and  the  GLUT 

Value  methods  are  used  to  specify  how  values  in  an 
object  are  mapped  onto  display  window  attributes  such 


as  color  and  intensity.  The  are  several  types  of  meth¬ 
ods  for  this.  CLUT-segment  methods  involve  setting 
up  a  color  look  up  table  (GLUT).  They  involve  creating 
named  segments  and  associating  some  number  of  bits 
with  each  segment,  such  as  specifying  8  bits  for  red, 
8  bits  for  green,  and  8  bits  for  blue.  GLUT-value 
methods  involves  associating  color  and  intensity  val¬ 
ues  with  GLUT  indices.  These  operations  can  be  ap¬ 
plied  to  CLUT-segments.  For  example,  the  red  compo¬ 
nent  of  the  color  look  up  table  can  be  mapped  onto  red 
values  in  several  different  ways;  by  linear  interpolation 
between  specified  shades  of  red  or  by  a  spline  through 
specified  color  values.  Overlay  methods  involve  set¬ 
ting  up  overlay  planes.  Overlay  planes  can  be  can  be 
displayed  and  cleared  separately  of  underlying  intensity 
displays.  Object-mapping  methods  deal  with  taking 
object  values  and  mapping  them  onto  color  table  indices. 
For  example,  the  CLUT-segment  for  red  intensity  could 
be  set  for  a  linearly  interpolated  255  shades  of  red  but  the 
actual  object  values  in  an  object  could  range  from  val¬ 
ues  such  as  -1000  to  2000.  The  Linear  object-mapping 
method  specifies  to  linearly  interpolate  from  this  range 
of  object- values  to  the  available  shades  of  red.  There 
can  be  a  linear  mapping  from  the  object  onto  color  ta¬ 
ble  indices,  but  the  the  color  table  may  be  set  up  for  a 
non-linear  mapping  onto  actual  intensities  displayed  on 
the  screen.  Value-function  methods  are  user  specified 
functions  that  are  applied  to  specified  object  values  to 
map  them  onto  GLUT  indices.  Examples  are  conditional 
expressions  that  determine  what  value  to  map  a  particu¬ 
lar  object  value  onto.  In  Lisp  this  is  straightforward.  In 
G  and  C-H-,  it  involves  a  run-time  interpreter  which  we 
want  to  be  of  as  minimal  complexity  as  possible.  Other 
value  methods  deal  with  transparency,  blinking,  and  log¬ 
ical  operations  on  bitplanes. 

2.2  Position  Methods 

Position  methods  are  used  to  specify  mappings  from 
spatial  objects  and  images  onto  two  dimensional  display 
windows.  They  specify  where  to  display  something  in  a 
display  window.  For  example,  position  methods  are  used 
to  specify  pan,  zoom,  and  scaling  parameters  to  relate 
an  image  to  a  display  window. 

The  position  methods  and  transformation  networks 
used  by  the  interface  rure  defined  by  the  coordinate- 
transforms  used  in  the  lUE.  The  interface  generally  re¬ 
quires  transforms  of  images  and  image  registered  objects 
onto  display  windows  and  simple  types  of  interpolation 
and  sampling.  More  complicated  mappings,  such  as  im¬ 
age  warping  and  the  generation  of  rendered  objects  for 
a  specific  sensor  use  methods  from  the  sensor  and  scene 
classes  and  image  packages  for  warping.  These  are  either 
used  to  generate  an  image  which  is  then  displayed  or  the 
object  specific  display  methods  can  be  invoked  through 
the  interface. 

The  coordinate  transformations  and  networks  are  very 
important  for  the  interaction  methods.  In  this  case,  the 
user  indicates  some  position  in  the  display  window  and 
then  the  mapping  from  the  object  to  the  display  win¬ 
dow  is  inverted  so  that  the  corresponding  values  and 
position  in  the  object  can  be  determined  and  accessed. 
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The  object  display  is  able  to  do  this  for  images  and  in¬ 
vertible  geometric  transformations.  For  others,  such  as 
interacting  with  a  potentially  complex,  rendered  solid, 
methods  from  the  other  classes,  such  as  the  sensor  and 
scene  class,  are  needed.  It  is  also  possible  that  there 
are  other  representations  of  a  rendered  scene,  such  as 
an  image-registered  depth  map  that  contains  pointers  to 
all  the  surfaces  that  project  to  a  given  pixel,  ordered  by 
depth  that  can  be  generated  to  simplify  the  interactive 
processing. 

There  is  often  hardware  supported  pan  and  zoom  for 
images  that  should  be  accessible  through  the  position 
methods,  even  though  this  is  machine  dependent. 

2.3  Interaction  Methods 

The  Interaction  Methods  are  for  interacting  with  and 
inspecting  objects  through  the  context  associated  with  a 
display,  such  as  the  current  object  to  display  transforma¬ 
tion.  The  methods  associated  with  this  are  built  on  top 
of  the  event-handling  mechanisms  of  the  supporting  en¬ 
vironment.  Interacting  with  an  object  through  a  display 
involves  using  the  position  mapping  from  the  object  to 
the  window.  This  is  straightforward  if  the  mapping  is  in¬ 
vertible  and  there  is  no  interpolation  or  warping.  This  is 
usually  the  case  for  images  and  image  registered  objects. 
It  can  also  involve  geometric  intersection  using  the  ray  of 
projection  corresponding  to  a  selected  window  position. 
For  other  objects,  such  as  closed  analytical  surfaces,  the 
reverse  mapping  is  more  complicated  and  involves  gen¬ 
eral  spatial  object  methods  that  need  to  be  accessed  by 
the  interaction  methods. 

The  current  object  to  be  interacted  with  can  be  ex¬ 
plicitly  specified  or  selected.  Selection  can  require  dis¬ 
ambiguation  if  there  are  multiple  overlapping  objects  or 
complex  spatial  objects.  The  user  may  be  required  to 
use  a  spatial  index  image  (an  image  of  pointers  to  ob¬ 
jects  which  occupy  a  given  position)  or  use  geometrical 
data  base  operations  in  the  lUE.  ^th  are  potentially 
expensive  and  don’t  reflect  operations  specific  to  the  in¬ 
terface  but  are  general  lUE  spatial  data  base  operations 
that  need  to  be  invoked  through  the  interface  to  return 
the  selected  object(s)  and  object  position  from  the  ob¬ 
ject  to  display  mapping. 

There  can  be  a  variety  of  interaction  devices  (mini¬ 
mally  it  should  specify  a  screen  location),  but  we  are  as¬ 
suming  a  mouse  with  at  least  two  distinguished  buttons 
and  text-input  from  the  keyboard.  In  the  interactive 
mode: 

•  The  current  position  of  the  mouse  is  stored  in 
variables  for  the  (mottse-x,mottse-y)  of  the  current 
display  window.  Associated  with  this  is  the  cor¬ 
responding  positions  {curreni-position)  and  values 
(currtnt-valttes)  in  the  specified  objects.  The  de¬ 
fault  is  to  only  deal  with  geometry  only  to  inverting 
coordinate  system  transforms  and  not  sampling  or 
interpolation  relative  to  the  objects.  Since  several 
objects  can  be  interacted  with  at  the  same  time, 
these  lists  will  consists  of  lists  of  positions  and  val¬ 
ues. 

•  When  the  mouse  is  clicked,  the  values  for  the  win¬ 
dow  position,  the  corresponding  object  positions 


and  values  rtre  stored  in  separate  lists.  Interactively 
selected  Functions  can  then  use  items  in  these  lists 
as  parameters. 

•  A  user  can  associate  functions  and  command- 
sequences  with  keys  and  mouse-states  so  the  func¬ 
tions  can  be  called  interactively.  The  functions 
are  stored  in  a  table  index  by  mouse-state  or 
keyboard-event.  The  functions  can  be  a  sequence 
of  specified  interface  commands  and  can  be  in¬ 
teractively  applied  to  the  values  in  the  different 
lists.  Functions  are  selected  using  keyboard  input 
(numbers)  or  mouse-state  (mouse-down,  mouse¬ 
up, mouse-drag  for  the  left,  right,  and,  if  it  is  avail¬ 
able,  the  middle  mouse  button).  Function  selection 
may  also  be  based  upon  a  count  of  the  number  of 
mouse-clicks  for  specified  mouse  button. 

•  There  may  be  a  default  spatial  index  associated 
with  a  display  window.  This  is  memory  intensive 
but  can  help  with  a  lot  of  the  interactive  operations, 
especially  the  selection  operations. 

2.4  Graphics 

Often  a  user  will  want  to  perform  graphical  displays  of 
text,  two  dimensional  graphics,  and  three  dimensional 
graphics.  Examples  are  annotating  a  display,  indicat¬ 
ing  where  some  action  is  occurring  (the  position  of  an 
epipolar  line,  translational  flow  paths,  etc.),  projecting 
a  wire-frame  of  a  model  onto  an  image.  Much  of  this 
functionality  will  come  directly  from  an  existing  graphics 
packages  that  the  lUE  will  utilize.  The  graphic  displays 
need  to  take  place  in  three  different  modes; 

•  they  can  occur  in  the  coordinate  system  of  the  dis¬ 
play  window.  In  this  case  displays  only  occur  with 
respect  to  the  window  coordinate  system. 

•  they  can  occur  in  the  coordinate  system  of  a  dis¬ 
played  object,  such  as  drawing  a  line  with  respect 
to  the  inverse  mapping  from  window  to  object  co¬ 
ordinates 

•  generating  corresponding  lUE  objects.  Thus,  in 
drawing  a  line  in  image  coordinates,  an  correspond¬ 
ing  instance  of  an  lUE  line  object  would  be  pro¬ 
duced.  When  the  wire-frame  model  is  displayed, 
each  line-segment  and  junction  would  be  created  as 
an  object  in  the  lUE.  This  is  very  useful  for  pro¬ 
ducing  data  for  testing  routines.  This  mode  can  be 
coupled  with  the  interactive  processing  mode  to  al¬ 
low  for  the  interaction  creation  of  data.  This  maybe 
restricted  to  relatively  simple  objects  such  as  poly¬ 
gons,  curves,  and  so  forth. 

2.5  Types  of  Object  Displays 

We  have  distinguished  four  types  of  object  displays; 

•  The  image  or  pixel  display  is  for  viewing  images 
and  image  registered  features. 

•  The  local  graphics  display  displays  objects  by 
mapping  their  values  onto  parameterized  graphic 
objects  such  as  lines  and  cubes.  Examples  are  dis¬ 
playing  vector  fields  and  edgels. 
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•  The  surface  display  is  for  displaying  objects  that 
get  mapped  onto  mesh  or  rendered  surfaces. 

•  The  plot  display  is  for  displaying  functional  re¬ 
lations  between  objects.  Examples  are  one- 
dimensionzd,  two-dimensional,  three-dimensional 
graphs,  histograms,  scatter  grams,  and  views  of 
functions  and  tables. 

These  different  types  are  distinguished  by  specific 
methods  but  all  inherit  a  large  number  of  similar  meth¬ 
ods  from  the  general  display  class.  For  example,  overlay 
operations  are  similar  for  a  surface  display  and  for  an 
image  display,  although  they  can  look  quite  different  (In 
one  case  it  appear  like  drawing  in  solid  colors  in  image 
registered  coordinates  on  top  of  a  displayed  image  and 
in  the  other  it  would  be  rendering  the  colors  onto  a  dis¬ 
played  surface). 

2.5.1  Local  Graphics 

Local  Graphic  Displays  are  a  subclass  of  object 
displays  which  map  object  values  onto  parameterized 
graphics,  such  as  a  line,  a  square,  a  perspective  view 
of  a  cube,  Chernoff  faces,  or  a  user  specified  function. 
A  common  example  is  a  vector  display  which  will  map 
each  component  from  a  pair  of  image  onto  the  x  and  y 
components  of  a  vector.  Using  the  general  display  meth¬ 
ods,  the  vectors  can  be  displayed  as  an  overlay  on  top  of 
an  image  or  through  indices  in  a  GLUT.  For  visualizing 
three  dimensional  attributes  in  register  across  an  image, 
the  user  can  display  unit  cubes  with  their  orientation 
computed  from  the  specified  components  of  the  display 
object.  The  graphic  display  can  be  a  piece  of  graphics 
code  which  will  be  positioned  to  the  projected  location 
of  the  pixel. 

There  will  be  specialized  local  graphic  displays  for  vec¬ 
tors  and  different  types  of  edges  because  of  their  heavy 
usage.  It  will  be  possible  to  display  the  horizontal  and 
vertical  edgels  in  the  cracks  between  pixels  or  to  place  a 
single  edge  at  the  center  of  a  pixel  with  it’s  orientation 
determined  by  the  specified  components  objects 

2.5.2  Plot  Displays 

There  are  several  different  types  of  plot  displays: 
one-dimensional,  two-dimensional,  three-dimensional 
graphs,  histograms,  scattergrams,  perspective  views  of 
functions  auid  others.  Examples  of  plot  displays  can  be 
found  in  several  data  visualization  packages  and  math¬ 
ematics  packages  such  as  Mathematics  and  GNUPlot 
[Wolfram,  1991J-  In  using  such  packages  in  the  lUE,  it  is 
important  to  bear  in  mind  cost  limitations  on  bundled 
software  and  potential  problems  with  data  compatibility 
and  speed.  Plots  also  need  to  be  compatible  with  the 
general  display  methods  for  such  things  as 

•  mouse  interaction  inethods:  for  selecting  a  position 
in  a  graph  and  then  having  access  to  the  domain 
point  and  the  range  point.  An  example  is  interac¬ 
tive  segmentation  from  a  histogram  displayed  as  a 
plot 

•  links  between  plot  displays  and  other  types  of  dis¬ 
plays 


•  most  of  the  view  tr2uisformations  for  such  things  as 
scaling  and  zooming 

•  overlaying  plots  in  different  colors 

We  are  currently  exploring  the  use  of  the  plotting  ca¬ 
pabilities  in  GNUPlot  to  be  used  for  use  in  the  lUE.  It 
is  essentially  free  and  all  the  source  code  is  avsulable. 

3  Browsers 

Browsers  are  used  for  interacting  with  text  based  or  sym¬ 
bolic  descriptions  of  objects.  They  are  used  for  actions 
such  as  queries  over  set  of  objects,  determining  and  in¬ 
specting  relationships  between  objects,  process  monitor¬ 
ing,  auid  inspecting  values  in  an  object.  The  browser  and 
browser-related  classes  are  being  designed  so  they  can 
readily  be  built  on  top  of  existing  interface  construction 
kits. 

There  are  two  general  types  of  browsers:  Field- 
Browsers  and  Graph-Browsers  of  which  only  field 
browsers  are  currently  being  implemented.  Field 
browsers  are  built  from  component  objects  which  are 
found  in  several  GUIs: 

•  A  field  appears  as  a  rectangular  box  which  caui  be 
filled  with  text,  icons,  colors,  colored  text,  text  in 
particular  fonts,  or  user-specified  graphics.  Fields 
can  have  actions  associated  with  them  when  they 
are  selected  or  a  user  changes  the  values  in  them. 

•  Fields  can  be  organized  into  connected  horizon¬ 
tal  or  vertical  field  groups  where  each  field  as 
a  unique  index  in  the  Field  Group.  The  fields  in 
field  groups  will  generally  have  different  objects  dis¬ 
played  in  them.  An  example  comes  from  the  object 
registered  browser  where  a  field  group  can  corre¬ 
spond  to  a  display  of  registered  values  from  differ¬ 
ent  objects.  For  better  visualization,  these  can  be 
displayed  in  different  colors,  fonts,  etcs,  in  addition 
to  their  position  in  the  field  group.  A  field  group 
can  also  have  a  distinct  boundary 

•  Field  Groups  can  be  organized  into  field  matrices 
where  in  each  group  as  a  unique  index  set  in  the 
field  matrix.  Objects  and  sets  of  objects  can  be 
mapped  onto  the  matrix. 

•  Field  Matrices  can  be  scrollable  as  a  way  to  control 
the  mapping  of  an  object  (or  object  set)  onto  the 
Field  Matrix 

We  distinguish  between  four  types  of  field  browsers 
which  inherit  from  the  general  Field  browser  class: 

«  Object-Registered  Browser:  This  contains  val¬ 
ues  extracted  from  a  spatial  object,  such  as  the 
intensity  values  in  some  square  neighborhood  of 
an  image.  Depending  on  the  dimensionality  of 
the  object  (or  relationships  between  component  ob¬ 
jects),  this  can  be  presented  as  a  one-dim  'nsional 
array,  a  two-dimensional  Array,  or  multiple  two- 
dimensional  arrays  and  describe  curves,  images,  im¬ 
age  sequences,  pyramids. 

•  Set/Database  Browser:  This  is  presented  as  an 
array  of  fields.  Each  row  of  fields  corresponds  to 
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selected  attributes  of  a  particular  object  and  each 
column  corresponds  to  common  attributes  over  the 
set  (or  database)  of  objects.  An  example  would  be 
browsing  the  database  which  describes  the  current 
active  objects  in  the  lUE  to  find  the  most  recently 
created  image  from  some  operations. 

•  Object  Attribute  Browser:  Each  row  corre¬ 
sponds  to  the  value  of  an  attribute  for  an  object. 
This  is  used  for  inspecting  a  single  object. 

•  Hierarchical  Browser:  Useful  for  text  based  in¬ 
spection  of  graph  structures  and  trees.  When  an 
item  is  selected,  the  related  items  (along  some  rela¬ 
tional  dimension)  are  displayed  in  the  next  column. 

The  methods  associated  with  browsers  are  very  sim¬ 
ilar  to  those  with  displays,  suggesting  a  more  general 
lUE  Interface  object  class.  The  position  methods  for 
browsers  involve  how  an  object  (or  set  of  objects)  gets 
mapped  onto  the  fields  of  a  browser.  For  object  regis¬ 
ter^  browsers,  these  are  essentially  the  same  as  with 
displays  (see  figure  2).  The  fields  are  analogous  to  pix¬ 
els  in  a  display  window,  although  they  can  be  filled 
with  textual  ii^ormation.  For  DataBase  browsers  the 
position  methods  specify  how  objects  are  mapped  onto 
rows  of  the  browser  and  how  attributes  are  mapped 
onto  columns  (See  figure  4).  The  position  methods  for 
mapping  from  graphs  and  networks  onto  a  hierarchical 
browsers  involve  keeping  track  of  different  paths  through 
a  networks  and  nodes  and  arcs  that  have  been  travers^. 
Browsers  can  also  be  linked  to  browsers,  displays,  and 
user  specified  interface  widgets.  The  following  examples 
have  been  implemented  using  the  FORMS  GUI  kit  on 
SGIs  [Overmars,  1991]. 

3.1  Object  Registered  Browser 

The  object  registered  browser  is  used  to  inspect  the  val¬ 
ues  in  a  neighborhood  of  a  spatial  object.  A  common 
example  is  inspecting  the  image  values  about  a  selected 
point.  It  is  very  much  like  the  display  of  a  spatial  ob¬ 
ject  in  a  display  window,  but  instead  of  the  values  be¬ 
ing  mapped  onto  window  positions  and  screen  intensities 
and  colors,  values  are  mapped  onto  field  locations  and 
general  field  attributes  such  as  colored  text  in  specified 
fonts,  colors,  and  icons.  The  attributes  of  and  the  spec¬ 
ification  of  the  mapping  from  an  object  onto  an  object 
registered  browser  is  shown  in  figure  2.  A  set  of  spatial 
objects  are  mapped  onto  a  matrix  of  fields  in  the  object 
registered  browser.  This  mapping  involves  several  parts: 
a  coordinate  truisform  from  the  N-dimensional  spatial 
objects  to  the  two-dimensional  object  registered  browser; 
the  type  of  interpolation  to  be  performed  if  this  mapping 
doesn’t  involve  discrete  values;  what  to  do  when  brows¬ 
ing  beyond  the  boundaries  of  the  spatial  object.  Also 
shown  is  a  navigation  tool  to  interactively  access  posi¬ 
tion  methods  to  position  the  object  registered  browser 
with  respect  to  a  set  of  spatial  objects.  The  browser 
is  linked  to  a  display  window  which  shows  the  position 
of  the  browser  with  respect  to  the  bounding  rectangu¬ 
lar  prism  of  the  spatial  object.  This  display  would  be 
updated  when  the  browser  is  moved  with  respect  to  the 
spatial  objects.  Figure  3  shows  an  implemented  two- 
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Figure  2:  Object  Registered  Browser 


dimensional  Object  Registered  Browser  in  which  is  dis¬ 
played  two  images  and  the  computed  difference  of  the 
two  images,  each  in  separate  fields.  Each  image  is  dis¬ 
played  in  a  different  color  and  the  field  containing  the 
difference  image  uses  the  background  color  to  encode 
the  magnitude  of  the  difference. 


3.2  Set /Database  Browser 

The  set/database  browser  is  for  inspecting  the  attributes 
of  a  set  of  objects.  It  enables  interactive  queries  can  be 
performed  via  the  browser.  This  is  especially  useful  for 
keeping  track  of  instances  of  objects  (an  object  selected 
in  a  Set/Database  browser  should  probably  default  to 
the  cvmnUohjtci  so  it  could  be  displayed  immediately). 
There  are  two  structures  used  for  describing  the  map¬ 
ping  from  a  database  onto  the  browser.  One  is  the  set 
of  selected  attributes  which  corresfmnd  to  the  columns. 
The  other  is  the  current  set  of  items  which  satisfy  a 
query  and  the  indices  into  the  first  and  last  element  of 
this  set  which  are  displayed  in  the  corresponding  rows  of 
the  browser  (see  figure  4).  Figure  5  shows  an  example 
using  the  Set/Database  browser  to  inspect  a  set  of  line 
segments  and  then  to  sort  them  by  slope. 
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Figure  3:  Object  Registered  Browser  ^plied  to  an  image 


3.3  Object  Attribute  Browser 

An  Object  Attribute  Browser  is  for  inspecting  the 
attributes  of  an  object  or  several  object  with  the  same 
types  of  attributes.  It  uses  an  order^  list  of  object  at¬ 
tributes  to  determine  which  attributes  of  the  object  to 
display.  Figure  6  shows  an  example  of  a  Object  Attribute 
Browser  applied  to  the  attributes  of  an  image  object. 

3.4  Hierarchical  Browser 

A  hierarchical-browser  is  for  inspecting  graphs  and 
network  objects.  Instead  of  one  large  field-matrix,  it 
consists  of  linked  Nxl  field-matrices.  Each  column 
corresponds  to  a  set  of  nodes.  When  a  node  is  se¬ 
lected,  the  types  of  relations  (arcs)  are  displayed  in  the 
current  —  arc  —  browser.  When  a  type  of  arc  is  selected, 
the  nodes  with  that  type  of  relation  are  displayed  in  the 
adjacent  (right)  column.  Several  structures  are  used  to 
describe  (and  update)  the  mapping  from  the  graph  onto 
the  successive  browser  columns.  The  current  node  is  the 
most  recently  selected  node.  The  current  path  is  stored 
as  well  as  the  nodes  that  have  been  visited.  Figure  7 
shows  a  hierarchical  browser  applied  to  an  instance  of  a 
polyhedral  mesh. 

4  Interface  Context 

There  are  several  data  structures  for  describing  the  con¬ 
text  of  the  interface.  These  are  used  for  intelligent  de¬ 
faulting  and  for  saving  the  state  of  the  interface.  These 
include: 


Database  Browser 


Figure  4:  Set/DataBase  browser 


•  Object-Display-Mapping:  Structures  which  de¬ 
scribe  the  mapping  from  an  object  onto  a  display. 
This  includes  the  viewing  transformation  between 
an  object  and  a  display  window,  the  value-mapping 
of  how  the  object  is  displayed  and  a  reference  to  a 
particular  GLUT. 

•  Object-Browsing-Mapping:  Structures  which 
describes  the  mapping  from  an  object(8)  or 
database  onto  a  browser 

•  Display  Context:  Structures  which  describes  the 
current  context  for  a  display  for  such  thinp  as  the 
current  window,  the  current  object,  the  current 
object  display  mapping,  the  current  display  com¬ 
mand,  the  current  mouse-selected  object  position 
and  value,  the  lists  of  interactively  selected  object 
values  and  positions.  For  example,  if  neither  a  dis¬ 
play  or  an  object  is  specified,  it  will  default  to  the 
most  recently  used. 

•  Browse  Context:  Related  structures  for 

browsers.  Such  things  as  the  current  browser,  the 
current  data  base,  query  history,  and  others. 

•  History:  The  sequence  of  display  or  browsing  ac¬ 
tions  for  a  particular  window  or  browser  are  saved 
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Figure  7;  Hierarchical  Browser  applied  to  a  description 
of  a  polyhedral  mesh  from  the  Data  Exchange  Format 

and  can  be  reaccessed  and  used  for  creating  ani¬ 
mations.  In  addition  which  objects  have  been  dis¬ 
played  or  browsers  is  also  stored. 

•  Default  layouts  for  windows  and  browsers: 
The  desired  layout  of  windows  and  browsers  can  be 
saved  and  be  available  to  a  user  when  he  starts  using 
the  lUE.  Users  may  prefer  different  interfaces  (ar¬ 
rangement  and  instantiation  of  the  basic  interface 
objects)  depending  on  the  task  or  level  of  sophisti¬ 
cation. 

•  Object  Display  Links:  A  structure  which  de¬ 
scribes  the  concatenation  of  a  display  or  browsing 
operation  between  lUE  interface  objects. 

The  context  description  is  an  extension  to  the  under¬ 
lying  context  usually  provided  by  the  graphics  level.  It 
should  be  possible  to  read  and  save  context  descriptions. 

4.1  Links 

Links  support  operations  such  as  window  to  window 
zooming,  displaying  the  same  object  from  different  views 
or  using  different  value-mappings,  and  controlling  dis¬ 
plays  using  interactive  widgets  like  sliders  and  knobs. 
Linked  displays  are  useful  for  displaying  composite  data 
such  as  stereo  image  pairs  or  pyramids.  When  something 
happens  in  a  parent  display  (or  browser),  another  display 
will  perform  an  action  using  information  from  the  parent 
display.  The  action  can  be  a  display  operation  operation 
or  executing  a  sequence  of  commands  associated  with 
the  link,  such  as  a  set  of  commands  from  the  interactive 
command  language. 
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The  mapping  between  a  spatial  object  and  a  display 
window  in  one  window  can  be  concatenated  with  the 
display  specification  in  another.  A  common  example  is 
using  one  window  to  zoom  onto  the  display  in  another 
or  using  one  window  to  display  a  selected  portion  of  an¬ 
other  (Panning  and  Zooming  are  so  common  they  will 
be  directly  supported  via  an  interactive  tool). 

We  have  specified  constraints  on  links  to  avoid 
many  complex  and  pathological  things  that  can  hap¬ 
pen.  Linked  displays  and  browsers  are  only  updated 
when  a  display  action  is  performed,  not  when  changes 
are  made  to  the  displayed  object.  Individual  links  are 
bi-directional,  but  no  cycles  ue  allowed  in  the  graph 
formed  from  all  of  the  links  between  lUE  interface  ob¬ 
jects. 

5  Command  Language 

Users  will  be  able  to  specify  all  interface  actions  through 
an  interactive  command  language  and  be  able  to  access 
all  the  functionality  of  the  interface.  Display  operations 
can  be  performed  interactively  through  the  command 
buffer.  The  command  language  will  have  intelligent  de¬ 
faults  and  abbreviations  (such  as  displaying  to  the  cur¬ 
rent  window  if  none  is  specified).  In  addition,  the  com¬ 
mands  will  be  be  usable  in  non-interactive  code  for  cre¬ 
ating  scripts  and  general  display  routines.  All  of  the 
functionality  of  the  interface  is  accessible  through  an 
interactive  command  language  which  encompasses  the 
overall  functionality  of  the  interface. 

A  concern  with  the  interface  command  language  is 
that  it  becomes  another  language  that  people  will  need 
to  memorize.  This  is  not  an  issue  for  development  in 
Lisp  since  the  display  operations  can  be  called  interac¬ 
tively  like  any  other  function,  but  it  is  a  significant  issue 
with  C-h-l-.  We  intend  for  the  command  language  to  be 
as  simple  as  possible,  with  a  limited  syntax.  Most  argu¬ 
ments  are  specified  via  keywords  and  correspond  directly 
to  interface  object  methods.  There  are  also  defaults 
for  command  specification.  And  the  lUE  will  probably 
eventually  support  intelligent  prompting  to  complete  the 
commands.  The  general  syntax  is 

I U E-interface-object  object-set  [keyword  arguments]* 

For  example, 

[*vl*  iaagel  :p] 

means  to  display  image  1  using  a  pixel- type  display  in 
window  *wl*  using  the  current  display  context  associ¬ 
ated  with  the  display  in  window  *wl*.  The  brackets  are 
used  to  indicate  separate  commands.  If  the  last  display 
operation  was  of  type  :p  in  *wl*,  then  only: 

[iaagel] 

needs  to  be  specified.  More  detailed  examples  are  pre¬ 
sented  below. 


An  importamt  operation  for  displaying  spatial  objects 
is  the  ability  to  apply  functions  to  objects  prior  to  dis¬ 
playing  or  interacting  with  them.  These  operations  al¬ 
most  always  don’t  involve  creating  a  new  object.  An  ex¬ 
ample  is  manipulating  the  underlying  color  look  up  table 
to  perform  a  thresholding  operation.  In  this  case,  there 
is  no  thresholded  image  object  produced,  only  what  is 
displayed  in  a  window.  This  goes  by  many  names  in 
different  systems  such  as  Pixel  Mapping  Functions,  Dy¬ 
namic  Color,  Generalized  Color  Look-Up  Tables. 

There  are  two  aspects  to  such  functions.  First,  there 
are  limitations  on  the  types  of  functions  that  should  be 
specified  for  application  to  an  object  when  it  is  displayed. 
Operations  such  as  zooming,  panning,  manipulating  the 
color  look  up  table,  specifying  which  planes  in  the  dis¬ 
play  buffer  are  used,  and  simple  point-wise  algebra  with 
limited  conditional  evaluation,  are  very  useful  and  will 
be  supported.  But,  it  doesn’t  make  sense  for  operations 
such  as  generalized  warping  or  detailed  processing  over 
a  neighborhood  or  generalized  intersection  to  be  done 
by  via  interface  commands.  Second,  there  are  also  lan¬ 
guage  specific  aspects  for  specifying  function  application 
to  objects  prior  to  display.  In  Lisp,  it  is  straightforward 
to  pass  lambda  expression  or  closures  which  are  applied 
to  each  position  or  value  prior  to  display.  In  C,  this  re¬ 
quires  a  library  of  standard  functions  and  an  interpretor. 

In  the  actual  operation  of  the  lUE,  it  is  not  necessary 
that  all  interactions  take  place  through  this  command 
language:  some  will  be  invoked  by  menus  and  special 
keys  and  refer  to  the  current  display  context.  An  im¬ 
portant  part  of  the  design  of  the  lUE  interface  entails 
how  commands  (and  which  commands)  are  mapped  onto 
menus  and  other  interactive  devices.  This  is  especially 
important  since  the  interface  will  support  a  wide  com¬ 
munity  of  users  ranging  from  naive  users  who  are  inter¬ 
acting  with  hardened  applications  to  developers.  Naive 
users  may  want  lots  of  interactive  devices  such  as  spe¬ 
cialized  menus  while  experienced  users  will  want  more 
powerful,  abbreviated  commands.  Advanced  users  will 
become  very  adept  at  shortcuts  that  should  be  provided. 

5.1  Examples 

The  following  presents  some  examples  of  specified  dis¬ 
play  operations  using  the  command  language. 

[*vl*  iaagel 

:P 

: linear  0  128  *scT««n-Bin*  *scraan-naz*] 

This  would  display  to  window  *wl*  using  the  current 
defaults.  The  range  of  object  values  from  0  to  128  are 
linearly  mapped  onto  the  range  of  values  *screen-min* 
and  *screen-max*. 

Ciaaga  :p] 

[edgs-iaags  .'overlay  red3 

An  image  is  displayed  in  the  current  window  using  a 
pixel-type  display.  The  edge  image  is  then  overlayed  on 


297 


top  of  this.  Wherever  the  edge-image  is  equal  to  0  noth¬ 
ing  is  displayed  in  the  red  overlay  plane  and  wherever 
the  edge-image  is  equal  to  1,  the  corresponding  pixel  is 
set  in  the  red  overlay  plane. 

[: overlay-colors  (red,  green,  blue,  violet)] 
[image  :p  : value-function 

(if  (inage. value  >  10)  red  blue)] 

The  first  command  tells  the  current  display  to  use  the 
specified  overlay  colors.  The  second  will  display  red  in 
the  overlay  plane  at  a  screen  pixel  corresponding  to  an 
image  pixel  if  the  image  value  is  greater  than  10,  other¬ 
wise  it  will  display  blue. 

[spat iallndexlnago 

: value-f unct ion 

(if  (labol-inage. value  »  lULL) 

0 

(length  (spatialindexinage. value))) 

: linear  0  20  0  *8creen-nax*] 

This  function  displays  a  spatial  index  image  (an  image 
where  each  pixel  contains  a  list  of  all  the  objects  which 
occupy  that  pixel).  The  value  function  determines  the 
number  of  objects  in  this  list  and  the  linear  function 
maps  this  onto  available  screen  intensities. 

[image 1  image2 

:p 

: value-f unct ion 
(imagel.val  -  image2.val) 

: linear  -20  20  *min*  *maz*] 

to  display  the  difference  between  two  images.  Other 
common  value  functions  would  be  for  type  conversion 
and  display  histogram  transforms.  The  user  can  also 
specify  functions  in  the  interactive  mode  to  be  applied 
to  the  values  in  the  different  queues.  For  example: 

[imago 

:i 

:1  [p  : overlay-plan#  clear] 

[p  image  : value-function 

(if  (image. value  > 
object-values [1] ) 

red  blue)] 

The  user  has  selected  an  image  location  with  a  mouse 
click  and  the  corresponding  queues  have  been  filled  with 
the  window  and  object  positions  and  values.  There¬ 
after,  when  the  user  hits  the  terminal  key  1,  the  over¬ 
lay  planes  will  be  cleared  and  all  image  locations  with 
a  value  greater  than  the  value  at  the  selected  image  lo¬ 
cation  will  be  displayed  in  red,  otherwise  blue,  in  the 
overlay  planes,  image,  value  is  a  dummy  variable  that 
refers  to  the  current  value  in  image  which  is  being  dis¬ 
played.  ohject-values[l]  Klen  to  the  value  selected  using 


a  mouse  click  in  the  display  window  and  stored  in  the 
object-value  queue,  red  blue  refers  to  globally  defined 
overlay  colors.  Recall  that  the  :vaf«e  function  specifies 
the  operation  to  be  applied  to  an  object  value  to  map  it 
onto  a  screen  intensity  or  color. 

[:link  *sl*  :zoom  2  2  :pan  60  60] 

This  links  *wl‘^  to  the  current  window  and  concatenates 
a  zoom  and  a  pan  transformation. 

[RegionDB 

:p 

: pos it ions  RegionDB . locat ions 
rvalues  RegionDB. textureNeasure 
:  linear  0  100  min*  emaz* 

: red-8] 

This  says  to  display  the  RegionDB  in  the  current  display 
window  with  the  positions  coming  from  the  locations 
attribute  of  the  regions  in  the  regionsDB  and  the  values 
by  taking  the  Region  DB  texture  mappings  and  using 
a  linear  mapping  from  these  onto  screen  intensities  in  8 
bits  of  red. 

[eUl*  histogram  :plotld  ] 

[*U2*  image  :p] 

[*U1*  histogram 

:i 

:1  [min  objeet-valnes[l]  .z] 

[maz  »  object-values [2]. z] 

[*V2*  image 

:p 

: value-f unct ion 
(if  ((imago. value  >  min)  ftft 
(imago. value  <  maz)) 
blue  rod)] 

This  is  an  example  of  plotting  used  for  interactive  his¬ 
togram  segmentation  in  which  the  interaction  methods 
lets  us  click  on  the  axis  of  a  plotted  function  to  returns 
the  X  coordinate  and  the  y-value  of  the  displayed  object 
and  then  use  these  values  to  specify  peaks  in  a  histogram. 
Here  the  user  has  plotted  a  histogram  in  *W1*.  He  then 
selects  the  range  of  values  by  clicking  on  the  displayed 
histogram.  The  current-object-value  contains  the  x  and 
y  value  from  the  displayed  histogram.  These  are  stored 
in  the  local  values  min  and  max.  When  the  user  hits  the 
key  1,  the  selected  range  of  values  are  displayed  in  the 
blue  overlay  plane  in  *W2*. 

6  Additional  Features 

Even  though  our  focus  has  been  on  developing  the  core 
functionality  of  the  user  interface,  there  are  several  other 
features  that  have  been  considered  for  use  with  the  inter¬ 
face.  Some  of  these  can  be  built  on  top  of  the  interface 
objects  and  operations  described  previously.  These  are 
important  candidates  as  packages  and  libraries  to  aug¬ 
ment  the  core  lUE. 
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One  important  area  involves  interactive  task  manage¬ 
ment  tools.  Examples  can  be  found  in  the  data-flow 
editor  in  the  Cantata  portion  of  Khoros  and  the  Task 
editor  in  KBVision.  Another  area  that  several  people 
feel  is  important  is  developing  graph  browsers  are  for 
the  display  of  graphs  and  networks,  generally  represent¬ 
ing  object  or  values  as  nodes  and  using  links  to  describe 
relations.  Graph  browsers  can  have  difficulties  when  try¬ 
ing  to  display  several  nodes  with  arbitrary  relations  be¬ 
tween  them  in  that  the  connections  between  the  nodes 
can  begin  to  obscure  the  over  all  display.  A  typical  use 
would  be  for  the  display  of  a  constraint  or  coordinate 
transform  network. 

There  are  probably  hundreds  of  nice  interactive  con¬ 
trols  for  displays  and  visualization  that  exist  in  differ¬ 
ent  environments,  such  as  interactively  manipulating  the 
object-value  to  screen-intensity  function  by  interactively 
shaping  a  function;  selecting  color  look-up  tables;  modi¬ 
fying  color  look-up  tables;  interactively  building  display 
commands  using  templates  or  command  browsers;  float¬ 
ing  tool  palettes  of  interactive  drawing  tools;  etc.  In  gen¬ 
eral,  such  tools  can  be  very  useful,  but  it  is  extremely 
important  that  there  be  a  consistent  look  and  feel  with 
different  applications  that  are  based  on  the  lUE.  This 
will  be  partially  achieved  by  depending  on  the  underly¬ 
ing  graphical  user  interface  to  supply  the  basic  interface 
objects. 

Other  useful  interface  tools  are: 

•  Interactive  Selection  and  Modification  of  color 
lookup  tables  and  display  mapping  functions;  cy¬ 
cling  through  different  color  look  up  tables 

•  A  dialog  box  for  setting  up  system  defaults  and 
initializing  characteristics  of  the  lUE:  initial  layout, 
font  selected,  level  of  expertise,  etcs. 

•  Access  to  and  Integrated  use  of  Established  Visu¬ 
alization  Packages;  There  are  several  data  visual¬ 
ization  products  and  it  would  be  nice  to  have  a 
modular  interface  to  these 

•  Mensuration  tools:  Such  things  as  rulers,  grid  over¬ 
lays,  and  the  use  of  multiple  cursors  mark  of  dis¬ 
tances  and  points  of  reference.  These  probably  can 
be  built  on  top  of  basic  interface  capabilities  and 
the  display  of  lUE  objects  (in  particular,  the  dis¬ 
play  interaction  methods  and  lUE  objects  such  as 
bit-mapped  regions,  line-objects). 

•  Interactive  Object  Creation  (Draw  Objects);  It 
should  be  possible  to  create  object  interactively. 
This  is  useful  for  creating  idealized  data  for  test¬ 
ing  and  development.  This  should  be  supported  by 
the  display  interaction  methods  and  access  to  the 
instantiation  methods  associated  with  spatial  ob¬ 
jects 

•  Incorporating  Hardware  Accelerators:  So  the  inter¬ 
face  and  the  lUE  in  general  can  modularly  incorpo¬ 
rate  different  hardware  accelerators. 

•  Display  Buffer  Optimization;  The  display  buffer  it¬ 
self  is  a  short  term  memory  for  manipulating  the 
view  of  a  displayed  object.  A  useful  feature  would 
be  routines  to  directly  access  the  display  buffer  or 


performing  specific  display  operations  in  ways  op¬ 
timized  for  particular  types  of  displays. 

7  Status  -  Implementation  Trade-offs 

We  are  currently  prototyping  many  different  parts  of 
the  user  interface  to  complete  the  functional  specifica¬ 
tion  and  to  answer  basic  implementation  questions  about 
choices  regarding  GUIs  and  user  interface  toolkits.  This 
will  help  to  simplify  the  job  of  the  eventual  integrating 
contractor.  For  reasons  of  rapid  development,  current 
implementations  is  taking  place  in  C  and  C-l-l-  on  Silicon 
Graphics  machines  using  the  GL  graphics  library.  Motif, 
and  the  FORMS  user  interface  toolkit.  We  have  been 
able  to  put  up  the  general  display  object  and  the  differ¬ 
ent  browsers  and  hope  to  use  these  as  initial  browsers 
and  displays  specialized  for  use  with  the  Data  Exchange 
Format.  We  are  exploring  extensions  to  GNUPlot  so  it 
is  compatible  with  the  methods  associated  with  the  gen¬ 
eral  display  class  and  can  provide  an  inexpensive  plotting 
package.  We  are  also  evaluating  OPENGL  as  a  possible 
machine  independent  graphics  package. 

8  User  Facilitation  Tools 

The  lUE  will  be  supported  by  on-line  documentation 
and  tutorials.  The  tools  for  implementing  these  will  also 
be  available  for  enhanced  communication  and  publica¬ 
tion  by  scientists  and  developers  who  use  the  lUE.  While 
there  is  significant  activity  in  developing  documentation 
and  hypermedia  toolkits,  they  remain  largely  machine 
dependent  with  no  clear  standardization.  We  are  de¬ 
veloping  a  simple  documentation  tool  called  Knowledge 
Weasel  (KW)  which  is  based  on  Lucid  Emacs  19  and 
existing  media  editing  tools. 

Knowledge  Weasel  (KW)  is  a  presentation  and  author¬ 
ing  system  designed  to  support  annotation  using  several 
different  types  of  media.  A  simple  analogy  for  KW  is 
reading  a  book  or  attending  a  lecture  and  being  able 
to  make  diverse  types  of  comments  and  annotations  on 
the  material.  In  reality,  such  unrestricted  annotations 
and  comments  made  with  respect  to  real  books  and  lec¬ 
tures  could  create  a  significant  mess  (especially  if  made 
by  several  different  people),  so  in  developing  KW  we 
have  extended  this  simple  metaphor  in  several  ways.  The 
first  is  to  provide  a  general  format  for  annotations  that 
can  include  several  different  types  of  media.  An  anno¬ 
tation  is  a  common  record  structure  wrapped  around 
instances  of  different  types  of  media  such  as  text  files, 
sound,  drawings,  postscript  files,  GNU-plots,  code  run¬ 
ning  in  the  GDB  debugger,  and  others.  Annotations  are 
implemented  much  as  a  property  lists  in  LISP  with  at¬ 
tributes  and  values  and  are  displayed  as  buttons  with 
an  associated  region  of  support.  When  an  annotation  is 
selected  it  performs  an  operation  specific  to  the  type  of 
annotation  selected.  Annotations  are  created  using  exit¬ 
ing  media  editing  tools  for  operations  such  as  recording 
a  sound,  drawing  packages,  calls  to  other  branched  pro¬ 
cesses,  grabbing  a  portion  of  the  screen.  The  second 
extension  has  been  to  develop  different  types  of  naviga¬ 
tion,  organization  and  presentation  tools  to  keep  users 
from  being  overwhelmed  with  a  great  deal  of  possibly 
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irrelevant  information.  Users  can  prune  the  set  of  anno¬ 
tations  that  they  want  to  deal  with  and  also  how  they  are 
displayed.  Annotations  are  structured  to  make  possible 
intelligent  processing,  perhaps  eventually  including  rule- 
based  processing  for  automatic  presentation  and  “ferret¬ 
ing”  of  information  (hence  the  name). 

We  ate  implementing  KW  on  Lucid  Emacs  19  which 
is  in  turn  based  on  the  X  window  system.  Lucid’s  im¬ 
plementation  of  Emacs  Lisp  provides  primitives  for  han¬ 
dling  display  attributes  such  as  windows,  fonts,  and  col¬ 
ors.  Lucid  Emacs  version  19  has  a  built-in  lisp  inter¬ 
preter  for  Emacs  Lisp  and  this  lisp  variant  provides  a 
wide  variety  of  primitives  that  are  useful  for  manipu¬ 
lating  text,  processes,  and/or  files.  It  is  available  via 
anonymous  FTP  on  the  Internet,  and  is  also  the  basis 
of  a  commercial  product.  Knowledge  Weasel  is  chiefly 
written  in  Emacs  Lisp  but  some  parts,  such  as  the  part 
which  interacts  with  the  operating  system’s  lock  dae¬ 
mon  (lockd),  is  in  C  and  communicates  via  pipes  with 
the  Emacs  Lisp  portion  of  the  implementation. 

Figure  8  shows  an  example  using  some  of  the  current 
features  of  KW.  A  user  is  reading  some  text  about  his¬ 
togram  equalization  from  a  text  file  in  Emacs.  He  has 
selected  some  annotations  for  display  (these  could  be 
comments  from  other  users  or  references  to  other  ma¬ 
terials).  One  annotation  corresponds  to  bringing  up  the 
corresponding  code  and  then  executing  it  step-by-step  in 
the  GNU  debugger.  One  nice  thing  about  the  intergra- 
tion  with  GNU-tools  and  Emacs  is  that  it  is  possible  to 
directly  annotate  running  programs  for  step  by  step  com¬ 
mentary.  The  user  has  selected  the  button  *View  His¬ 
togram*  which  is  associated  with  a  GNUPIot-type  anno¬ 
tation.  Annotations  are  displayed  in  a  larger  font  of  text 
(which  is  colored).  The  actual  display  of  annotations  is 
controlled  by  a  user.  Annotations  can  be  conditionally 
displayed  and  mapped  onto  different  colors  and  fonts. 

We  have  begun  using  an  initial  version  of  KW  to  de¬ 
velop  an  on-line  version  of  the  Low  Level  Vision  course 
taught  at  Georgia  Tech.  We  also  plan  to  use  it  as  part  of 
a  computer  vision  algorithms  course  where  students  will 
select  a  paper  from  the  literature,  implement  the  corre¬ 
sponding  algorithms  and  use  KW  to  develop  a  tutorial 
presentation  of  the  paper. 

8.1  CD-ROM 

A  significant  instance  of  technology  transfer  is  the 
DARPA  lU  Proceedings  and  workshop.  For  the  next 
meeting,  we  hope  to  enhance  this  by  having  the  work¬ 
shop  proceedings  available  on  CD-ROM,  and  integrated 
with  the  Data  Exchange  Format,  a  documentation  and 
browsing  tool  such  as  Knowledge  Weasel,  and,  possi¬ 
bly,  the  lUE  itself.  This  will  enable  an  extraordinary 
type  of  paper  which  includes  data,  code,  additional  refer¬ 
ences,  luiimations,  and  extensive  annotations  and  cross- 
references. 
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Abstract 

A  major  activity  of  the  lUE  committee  is  the 
design  of  a  data  exchange  standard  for  lU  algo¬ 
rithm  results.  The  exchange  standard  is  formu¬ 
lated  according  to  object-oriented  design  prin¬ 
ciples  and  is  based  on  the  class  hierarchy  of 
the  lUE  specification.  This  paper  provides  an 
overview  of  the  exchange  format. 

Introduction 

As  the  design  of  the  lUE  progressed,  it  became  clear 
that  the  concepts  for  lU  data  structures  and  operations 
could  be  applied  to  the  formulation  of  a  data  exchange 
standard  for  lU  research  and  application  results.  Such  a 
standard  is  badly  needed  since  two  application-oriented 
programs  are  now  underway  at  DARPA  which  involve 
the  cooperation  of  a  large  number  of  research  groups. 
One  of  these  projects,  RADIUS'  involves  a  number  of 
university  and  other  research  institutions  who  are  de¬ 
veloping  lU  algorithms  to  support  site  modeling  and 
image  analysis.  The  second  project  is  the  Unmanned 
Ground  Vehicle(UGV)  project  which  is  focussed  on  au¬ 
tonomous  navigation  and  reconnaissance.  It  is  clear  that 
both  of  these  projects  can  benefit  from  the  capability  to 
exchange  detailed  results  of  algorithms  such  as  image 
segmentation,  feature  grouping  and  model  matching. 

The  lUE  '^ata  exchange  standard  based  on  an  object- 
oriented  representation  of  the  main  structures  used  in 
lU  research  and  applications.  The  primary  emphasis  is 
on  the  relationship  between  image  signal  data  and  geo¬ 
metric  structures.  Much  of  lU  research  is  involved  with 
grouping  and  matching  of  geometry  derived  from  images. 
Another  major  area  of  processing  and  representation  is 
associated  with  the  derivation  of  camera  parameters  as¬ 
sociated  with  camera  calibration  and  camera  motion. 

The  design  of  the  exchange  format  is  based  on  these 
design  principles:  character  based  formats  (ASCII)  will 
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be  used  for  simplicity,  object  descriptions  will  be  be 
stored  in  Lisp-like  lists,  and  the  format  facilitates  the 
transfer  of  data  between  different  systems. 

Character  format  The  first  design  choice  elimi¬ 
nates  binary  formats,  which  may  be  necessary  for  effi¬ 
cient  storage  of  some  objects,  but  simplifies  transporta¬ 
bility  between  different  systems.  Note  that  this  format 
is  not  primarily  for  the  storage  of  image  data,  but  for 
the  storage  of  more  geometric  and  relational  lUE  object 
data. 

LISP-like  syntax  The  second  principle  implies 
only  that  parentheses  (or  another  suitably  defined  macro 
characters  and  reserved  words)  surround  the  data.  Oth¬ 
erwise  the  format  is  relatively  free-form.  Since  Lisp  has 
a  simple  syntax,  this  assumption  provides  a  small  set  of 
delimiters  to  break  the  data  and  an  easy  mechanism  to 
read  the  data  in  Lisp.  For  the  C++  implementation,  the 
parsing  will  be  straightforward  and  through  the  use  of 
a  few  key  words,  the  format  can  be  efficiently  parsed  by 
Lex  2uid  Yacc  parsing  mechanisms. 

EVee  form  output  By  expecting  the  user  to  have 
relative  freedom  in  the  output  sequencing,  we  are  not 
required  to  analyze  the  data  to  find  relatively  efficient 
formats.  The  user  will  specify  what  objects  are  to  be 
saved,  the  order  of  the  objects  and  the  set  of  object  slots 
that  are  included.  This  approach  simplifies  the  output 
process,  but  does  requires  the  user  to  insure  that  all 
the  required  object  instances  are  defined  before  used  by 
other  classes.  The  design  assumes  that  the  format  con¬ 
version  is  a  single  pass  operation. 

Portability  The  final  principle  requires  a  format  that 
is  easily  read  and  written  in  either  Lisp  or  C,  one  that 
is  not  dependent  on  the  host  machine,  and  one  that  can 
be  transformed  into  other  internal  representations  inde¬ 
pendent  of  the  Image  Understanding  Environment.  The 
syntax  is  also  very  similar  to  the  class  construction  styles 
used  in  both  C++  and  CLOS,  so  the  action  routines  of 
the  parser  do  not  require  much  reconfiguration  of  the 
data  to  form  class  constructors. 

Relation  to  Other  Standards 

There  are  many  standards  for  binary  image  data  file  for¬ 
mats,  such  as  NITF,  TIFF  and  even  as  ASCII  such  as 
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(f)  3d  polyhedral  topology 

(g)  2d  and  3d  box  neighborhoods 

3.  IVansforms 

(a)  2d  and  3d  Euclidean 

(b)  3D  quaternion 

(c)  4x4  Projective 

(d)  4x3  Imr^e  Transform 

(e)  UTM  and  lat-long  earth  coordinates 

4.  Spatial  Indices 

(a)  2D  grid 

(b)  2D  quadtree 

(c)  2D  r-TVee 

(d)  2D  Hough  array 

5.  Image  Features 

(a)  edgel 

(b)  pixel  chain 

(c)  edgel  chain 

(d)  segmentation  line  segment 

(e)  connected  line  segments 

(f)  image  region 

6.  Sensors 


Postscript.  The  Programmers  Image  Kernel(PIK)  stan¬ 
dard  provides  additional  representations  for  image  pro¬ 
cessing  operations  as  well  as  representation  for  various 
image  data  types.  There  are  also  standards  for  the  repre¬ 
sentation  of  CAD  geometry  such  as  the  Initial  Graphics 
Exchange  Specification(IGES).  Some  aspects  of  lU  are 
also  captured  by  standards  associated  with  the  exchange 
of  geographic  data  such  as  the  DOD  Vector  Product 
Format(VPF)  Standard  and  the  Standard  Interchange 
Format(SIF)  used  to  represent  simulation  database  en¬ 
tities.  More  recently,  standards  are  emerging  for  the 
representation  of  product  design  information  under  the 
Product  Data  Exchange  Using  STEP(PDES)  program 
of  the  Department  of  Commerce.  STEP  is  an  interna¬ 
tional  standard  for  the  representation  of  product  geome¬ 
try  and  functionality.  In  addition,  some  aspects  of  prod¬ 
uct  definition  and  test  requirements  are  being  addressed 
by  the  DOD  Computer-aided  Acquisition  and  Logistic 
Support(CALS)  program. 

These  existing  standards  do  not  cover  the  breadth  of 
mathematical  and  physical  concepts  inherent  in  lU  algo¬ 
rithms.  For  example,  in  the  case  of  image  segmentation, 
there  are  many  attributes  which  must  be  defined,  such 
as  edge  strength,  or  edgel  orientation.  In  order  for  differ¬ 
ent  research  groups  to  use  each  other’s  results  these  at¬ 
tributes  must  be  defined  according  to  a  standard  naming 
convention  and  associated  mathematical  definition.  As 
another  example,  lU  algorithms  depend  on  many  types 
of  grouping  operations,  some  quite  unique  to  lU,  such 
as  the  Hough  transform  and  are  not  supported  by  other 
exchange  formats. 

Finally,  lU  is  a  rapidly  evolving  discipline  and  it  is 
necessary  to  have  an  easily  extensible  standard  and  the 
same  time  maintain  the  compatibility  of  existing  data. 
The  lUE  object-oriented  design  approach  enables  this 
flexibility  though  inheritance  and  class  definitions  which 
can  be  provided  in  the  exchange  file  itself. 

In  the  remaining  sections,  we  provide  a  summary  of 
the  ideas  behind  the  development  of  the  standard  and 
provide  the  syntax  for  the  current  version  of  the  file  for¬ 
mat. 

Core  Exchange  Data  Structures 

The  initial  scope  of  the  data  exchange  format  is  based  on 
the  core  data  structures  in  the  lUE.  The  following  sum¬ 
marizes  the  classes  to  be  supported  in  the  initial  release 
of  the  standard. 

1.  Image  Data 

(a)  8  and  16  bit  intensity  data 

(b)  8bit,  3-channel  color  data 

(c)  multi-channel  landsat  data 

(d)  range  with  registered  intensity  data 

2.  Spatial  Object 

(a)  basic  spatial  object 

(b)  2d  and  3d  pointsets 

(c)  2d  and  3d  implicit  and  parametric  lines 

(d)  2d  and  3d  implicit  and  parametric  planes 

(e)  2d  polygonal  topoiogy(e.g.  vertex,  edge,  face) 


(a)  perspective  camera 

(b)  stereo  pair 

(c)  moving  linear  array  camera 

(d)  range  camera 

These  structures  represent  only  a  portion  of  the  lUE 
design,  but  have  been  selected  as  an  initial  implementa¬ 
tion  goal  and  are  likely  to  provide  maximum  utility  to 
the  RADIUS  and  UGV  projects  mentioned  in  the  intro¬ 
duction. 


2d-image-regioii 

Description  A  2d-iniage-region  is  a  connected  set 
of  image  pixels,  registered  with  a  set  of  images  by  op¬ 
erations  such  histogram  segmentation,  region-growing, 
model  surface  projection.  Sometime  attributes  and  oper¬ 
ations  involving  regions  are  based  upon  the  set  of  points 
which  comprise  the  region  (i.e.,  compactness,  Euler  num¬ 
ber),  some  operations  and  attributes  are  based  upon  the 
image  values  at  these  locations  (i.e.,  average  intensity  in 
the  image  area  corresponding  to  the  region).  This  con¬ 
cept  can  be  generalized  for  voxel  processing  in  analogy 
to  the  class  block. 

Superclasses 

image-feature  2d-unordered-pomtset  face 
connected-image- pointset 


An  Example 

The  following  example  is  taken  from  the  lUE  specifi¬ 
cation  document  which  includes  examples  for  data  ex¬ 
change.  The  specification  is  mainly  concerned  with  nam¬ 
ing  and  definition  of  region  attributes. 
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Pseudo  Slot8(Attribute8) 

l-chns:  list(iinage-l-chain) 

Multiple  interior  boundaries  as  a  list  of  usually 
image-pixel-chains.  These  chains  can  be  4  con¬ 
nected  or  8-connected.  The  boundary  is  usually 
composed  of  several  pixel-chains  which  intersect  at 
image  vertices. 

area  integer 

The  size  of  the  region  in  image  pixels  a  method  of 
face  specialized  for  discrete  pixel  regions. 

number~of~holes:  integer 

The  number  of  holes  in  the  region. 

minimnm-bounding-reciangle: 

2d-aligned-rectangle-neighborhood 

centroid:  point 

The  position  of  the  centroid  of  the  region. 

scatter-mairix-of~pixel-positier.s: 
vector  [2]  (vector  [2]  (float )  ) 

Provides  covariance  of  x  and  y  coordinates  of  pixels 
in  the  region. 

compactness:  float 
Ratio  of  perimeter  to  area. 

adjacent-regions:  liat(2d-image-region) 

A  list  of  regions  which  share  a  boundary  with  self. 
The  shared  boundary  descriptions  are  contained  in 
the  inferiors  of  each  region. 

intensity-distribution:  vector (2] (float) 

A  distribution(a8sumed  gaussian)  with  two  slots, 
mean  and  variance.  These  are  floats  with  the  val¬ 
ues  computed  using  all  the  points  in  the  region  and 
the  corresponding  intensity  image.  If  the  nature  of 
the  image  is  unknown,  then  this  is  the  mean  of  its 
values.  For  other,  known,  image  types  such  as  red, 
infra-red,  range,  etc.  other  attributes  will  be  used, 
but  they  have  the  general  same  form. 

red- distribution:  vector[2](float) 

Distribution  for  the  red  component. 

green-distribution:  vector[2](float) 

Distribution  for  the  green  component. 

blue- distribution:  vector [2]  (float ) 

Distribution  for  the  blue  component. 

XXX- distribution:  vector [2] (float) 

Distribution  for  the  XXX  component.  Since  these 
are  implemented  as  pseudoslots,  any  number  of  such 
distributions  can  be  specified  according  to  appltcar 
tion  retpiirements.  The  general  name  for  the  differ¬ 
ent  distributions  is  <band-name>-distribution  for 
the  variety  of  image  bands. 

An  example  of  the  data  exchange  format  for  region  A  in 
the  figure. 

(■ak«  2d-iMaga-r«gion  "reg-A” 

(slot  l-chns  (list 
(■aks  lnags-1-chain  "  Ichn-a-b-A’ ' 

(slot  sdgss  (list 

(make  2d-pixel-chain  "pchn-a-b" 


(slot  vO  (make  2d-vertez  "vert-a” 

(slot  p  (vector  2  integer  11  13)})) 

(slot  t1  (make  2d-vertex  '‘vert-b" 

(slot  p  (vector  2  integer  13  9))}) 

(slot  n  24) 

(slot  chain-code-sequence 

(vector  23  integer  444432 
43131221000077767) 

) 

) 

(make  2d-pizel-chain  ‘ 'pchn-b-a-1" 
(slot  vO  (use  "vert-b*’)) 

(slot  vl  (use  "vert-a”)) 

(slot  n  S) 

(slot  chain-code-sequence 
(vector  4  integer  1121) 

) 

) 

)) 

(slot  dir  (vector  2  integer  1  1)) 

(slot  closed-p  true) 

) 

(make  image-l-chain  "Ic-c-c-A” 

(slot  edges  (list 

(make  2d-pizel-chain  "pchn-c-c” 

(slot  vO  (make  2d-vertex  "vert-c” 

(slot  p  (vector  2  integer  7  7)))) 

(slot  vl  (use  "vert-c”)) 

(slot  n  14) 

(slot  chain-code-sequence 

(vector  13  integer  6676000 
1  2  2  3  4  4) 

) 

) 

)) 

(slot  dir  (vector  1  integer  1)) 

(slot  closed-p  true) 

) 

)) 

(slot  nghbrhood 

(make  2d-image-pixel-neighborbood  "n-S” 
(slot  num-nghbrs  7))) 

(slot  number-ol-holes  1) 

(slot  adjacent-regions  (list  "reg-B”)) 
(slot  intensity-distribution 
(vector  2  float  135. 2  3.4)) 


The  Exchange  Format 

Basic  syntax 

At  the  most  basic  level,  lUE  Exchange  Format  looks 
somewhat  like  a  lisp  file;  the  format  is  designed  to  be 
readable  by  most  lisp  readers  without  much  difficulty, 
should  this  be  necessary.  A  prototype  C  parser  gener¬ 
ated  by  the  standard  Unix(TM)  utilities  Lex  and  Yacc 
is  available  upon  request. 

File  Organisation 

A  file  in  lUE  Exchange  Format  consists  of  an  lUE  Ex¬ 
change  Format  version  identifier,  followed  by  a  series  of 
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lUE  Class  instance  descriptions,  default  slot  value  set¬ 
tings,  and  references.  The  general  model  is  that  the 
content  an  individual  lUE  file  will  correspond  to  one  hi¬ 
erarchy,  e.g.  one  site  or  one  building.  lUE  files  may 
reference  external  objects  by  identifying  their  lUE  files; 
consistent  with  the  notion  that  the  file/hierarchy  rela¬ 
tionship  is  one-to-one.  Since  an  lUE  file  will  normally 
contain  a  flattened  series  of  lUE  Class  Instance  descrip¬ 
tions  and  references  to  those  descriptions,  the  convention 
has  been  adapted  that  the  final  lUE  Class  Instance  de¬ 
scription  or  reference  in  a  file  will  be  considered  the  root 
of  the  hierarchy.  The  standard  algorithm  for  traversing  a 
network  of  lUE  Class  Instances  will  invariably  attempt 
to  make  the  root  object  the  last  one  referenced  in  the 
file. 

File  Identification 

An  lUE  File  Identifier  is  constrained  to  appear  at  the 
very  beginning  of  the  file,  with  no  spaces  or  newline 
characters  embedded.  This  is  so  that  a  file  sniffer  may 
depend  on  the  first  29  characters  of  the  file  being  “(lUE- 
Exchange- Format- Version  ”.  Case  must  be  strictly  ad¬ 
hered  to  in  this  particular  instance,  again,  so  that  file 
sniffers  may  be  as  simple  and  fast  as  possible.  It  will 
probably  be  adequate  for  file  sniffers  to  examine  the  first 
four  characters  (“(lUE”)  in  most  instances. 

Comments 

Comments  are  introduced  by  a  semi-colon  (;)  character, 
as  in  Common  Lisp,  and  run  to  the  end  of  line  character, 
again  as  in  Common  Lisp. 


Sequence  numbers  and  Class  instance  names 

When  an  lUE  file  writer  describes  an  lUE  Class  instance 
(using  a  ‘make’  clause),  it  is  assigned  a  unique  positive 
integer.  Most  writers  will  probably  start  with  one  and 
increment  the  number  for  each  Class  instance  described, 
but  as  long  as  the  integers  are  unique,  and  ail  integers 
referenced  are  defined  somewhere  in  the  file,  there  is  no 
other  requirement.  For  files  generated  by  hum^ul8,  Class 
instances  may  have  names  instead  of  numbers,  in  which 
case  they  are  not  assigned  integers  from  the  sequence. 
These  names  must  be  unique  within  the  context  of  the 
file,  and  have  no  meaning  outside  of  that  context;  an  lUE 
file  reader  may  discard  them  once  a  file  has  been  read.  In 
a  human-generated  lUE  file,  both  class  instance  numbers 
and  names  may  be  used. 

Using  an  object 

The  (use  ...)  clause  is  the  standard  method  for  refer¬ 
encing  an  lUE  class  instance.  A  use  clause  may  refer  to 
an  integer  sequence  number  (described  in  the  previous 
paragraph),  an  lUE  Class  instance  name,  or  an  external 
object  (using  an  external  clause.)  The  object  need  not 
have  been  defined  at  the  time  that  a  use  clause  refers  to 
that  object;  in  such  cases  an  lUE  file  reader  will  place  the 
reference  on  the  list  of  presently  unresolved  references, 
which  are  to  be  cleaned  up  by  the  time  that  the  file  has 
been  completely  processed.  A  standard  algorithm  for 
forward  reference  handling  is  provided  in  an  Appendix. 
Examples  of  (use  ...)  clauses: 

(nsa  12) 

(use  "foo-bar  adgs") 
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External  objects 

An  (external  ...)  clause  may  be  used  inside  a  (use  ...) 
clause  to  include  files.  Any  file  included  must  be  in  the 
same  directory  as  the  file  containing  the  (external  ...) 
clause.  The  last  Class  instance  listed  in  the  external 
lUE  file  by  either  a  (make  ...  )  clause  or  a  (use  ...) 
clause  will  be  the  one  referenced  by  the  (use  (external 
...))  clause. 

Example  of  an  (external  ...)  clause: 

(use  23  (external  "my-cube.iue”)) 

In  this  example,  the  file  my-cube.iue  is  processed  and 
the  last  lUE  Class  instance  made  or  used  in  the  file  is 
returned,  and  assigned  sequence  number  23.  This  se¬ 
quence  number  may  be  used  elsewhere  in  the  file  -  but 
it  must  not  appear  in  any  (make  ...)  clause  in  the  same 
file. 

It  is  expected  that  lUE  file  readers  will  keep  a  list  of 
files  and  the  last  objects  referenced  by  them;  this  way, 
when  an  external  reference  is  made,  a  check  can  be  made 
to  see  if  the  file  has  already  been  read;  otherwise  multiple 
copies  of  objects  might  be  created. 

Making  an  object 

A  (make  ...)  clause  consists  of  the  reserved  word  ‘make’, 
followed  by  an  identifier  corresponding  to  a  Class  name 
from  the  lUE  class  hierarchy,  followed  by  either  a  se¬ 
quence  number  or  a  string,  and  finally  the  slots  and 
attributes  for  this  lUE  object  and  their  contents.  At¬ 
tributes  are  items  not  provided  for  in  the  lUE  standard 
that  an  lUE  user  may  wish  to  attach  to  lUE  objects. 
Examples  of  (make  ...)  clauses  and  slots  appear  in  the 
next  section. 

Slots 

Slots  may  contain  a  variety  of  data  types;  these  include 
simple  types  like  bit,  int,  and  float,  more  complex  types 
such  as  string  and  vector,  and  may  contain  objects  or 
lists  of  objects. 

A  slot  clause  consists  of  the  word  ‘slot’,  followed  by  the 
name  of  the  slot  and  by  the  content  of  the  slot,  if  a  slot 
refers  to  an  lUE  Class  instance,  then  a  use  clause  will 
be  emitted  referencing  the  object.  If  the  object  does  not 
exist,  then  in  a  human-generated  lUE  file,  a  make  clause 
may  be  inserted  describing  the  object.  A  human  writing 
an  lUE  file  may  recursively  descend  through  the  struc¬ 
ture,  writing  make  clauses  as  lUE  Class  instances  are 
referenced  for  the  first  time.  A  program  generating  an 
lUE  file  is  required  to  generate  a  flattened  file  description 
in  which  make  clauses  are  never  nested;  this  is  done  be¬ 
cause  extremely  large,  deeply  nested  files  may  otherwise 
impose  unreasonable  memory  management  demands  on 
lUE  file  readers. 

Example  of  a  simple  Make  clause  for  a  3D  Vertex  lo¬ 
cated  at  [45.3,  23.8,  3.4]  (the  vector  data  type  will  be 
detailed  later  in  this  document): 


(aaks  3D-vartez  62 
(slot  p 

(vector  3  float  45.3  23.8  3.4) 

) 

) 


Attributes 

Attributes  are  very  similar  to  slots,  but  being  user  de¬ 
fined  items,  they  are  not  described  in  the  lUE  document. 
It  is  the  responsibility  of  the  lUE  user  to  insure  that  any 
attributes  can  be  properly  written  and  read. 

Attributes  also  differ  from  slots  in  that  since  they  are 
not  described  in  the  lUE  document,  their  data  type  is 
not  known  until  they  are  encountered  during  input;  for 
this  reason,  the  attribute  type  is  included  in  an  Attribute 
clause. 

Example  of  a  nested  Make  clause  for  a  3D  edge  with 
2  previously  undescribed  vertices: 

(■ako  odgo  "uy  edge" 

(slot  vl 

(■ako  3D-vertoz  36 

(slot  p  (vector  3  float  3.46  -2.34  7.3298)))) 
(slot  v2 

(■ako  3D-vartez  36 

(slot  p  (vector  3  float  6.732  3.21  -2.3)))) 

(attribute  edge-naae 

string  "IE  Edge  of  object  Foo") 

) 

Example  of  the  same  description,  flattened  as  required 
for  machine  input,  and  using  the  hard  slots  inferiors  and 
superiors,  as  is  more  appropriate  with  machine  generated 
lUE  files: 

(■ake  3D-vertez  2 
(slot  p 

(vector  3  float  3.46  -2.34  7.3298)) 

(slot  superiors  (list  (use  1))) 

) 

(■ako  3D- vert  ex  3 

(slot  p  (vector  3  float  6.732  3.21  -2.3)) 

(slot  superiors  (list  (use  1))) 

) 

(sake  edge  1 

(slot  inferiors 

(list  (use  2)  (use  3))) 

(attribute  edge-naae 

string  "IE  Edge  of  object  Foo") 

) 

It  is  recommended  that  lower  level  objects  be  created 
immediately  before  higher  level  objects,  in  order  to  keep 
the  list  of  unresolved  references  reasonably  small. 
Example  of  a  2D  1-chain^  which  contains  3  edges: 

(■aks  2D-vartax  3 

(slot  p  (vector  2  float  1.0  2.3)) 

(slot  suporiors  (list  (use  2)))) 

^A  1-chain  ia  a  sequence  of  connected  line  segments. 
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Lists 


(■aks  20-vsrtsz  4 

(slot  p  (vector  2  float  2.0  2.3)) 

(slot  superiors  (list  (use  2)  (use  5)))) 
(aake  2D-vertez  6 

(slot  p  (vector  2  float  7.3  8.2)) 

(slot  superiors  (list  (use  7)  (use  5)))) 
(aake  2D-vertez  8 

(slot  p  (vector  2  float  -2.0  9.3)) 

(slot  superiors  (list  (use  7)))) 

(aake  edge  2 

(slot  inferiors  (list  (use  3)  (use  4))) 
(slot  superiors  (list  (use  1)))) 

(aake  edge  5 

(slot  inferiors  (list  (use  4)  (use  6))) 
(slot  superiors  (list  (use  1)))) 

(aake  edge  7 

(slot  inferiors  (list  (use  6)  (use  8))) 
(slot  superiors  (list  (use  1)))) 

(aake  1-chain  1 

(slot  inferiors  (list 

(use  edge  2)  (use  edge  S) 

(use  edge  7)) 

) 

) 


Vector  types 

As  may  be  inferred  from  the  previous  examples,  the  vec¬ 
tor  clause  is  used  to  describe  homogeneous  sequences  of 
class  instances;  an  array  representation  is  presumed.  A 
vector  clause  begins  with  the  word  vector.  The  second 
item  in  a  vector  clause  is  the  number  of  elements  in  the 
vector;  the  third  is  the  data  type  of  the  vector.  Vector 
elements  are  constrained  to  be  of  the  same  type  as  spec¬ 
ified  by  the  vector  type,  or  in  the  case  of  bit  types,  the 
elements  must  be  integers. 

lUE  Exchange  Format  provides  only  ID  vectors  and 
vectors  of  vectors;  matrices  are  represented  by  vectors 
of  similar  vectors  (which  are  generally  similar  in  both 
length  and  data  type.)  It  is  possible  to  represent  matri¬ 
ces  of  arbitrary  size  and  dimensionality  using  this  mech¬ 
anism  without  extension  to  the  grammar  for  the  lUE 
Exchange  Format. 

Example  of  a  vector  representation  of  an 
1024xl024x8bii  array: 

(vector  1024  vector 
(vector  1024  bit8 

10  11  10  10  14  14  16  ... 

) 

(vector  1024  bit8 

11  11  10  11  14  15  16  ... 

) 

) 

Note  that  decimal  integers  are  being  used  to  represent  bit 
values;  a  special  syntax  for  bit  values  is  not  particularly 
necessary. 


A  list  clause  consists  of  the  word  list,  followed  by  a  se¬ 
quence  of  simple  types,  vectors,  and  lUE  Class  instances. 
Lists  of  simple  types  will  always  be  homogeneous.  Lists 
of  objects  are  always  constrained  so  that  all  objects  share 
a  common  superclass.  These  restrictions  are  intended  to 
ease  C-f-b  implementation. 

Characters  and  Strings 

A  simple  data  type  for  single  characters  has  been  in¬ 
tentionally  omitted;  this  is  because  the  Lisp  and  the 
C/C++  worlds  have  decidedly  different  notions  of  what 
is  appropriate  syntax.  A  char2tcter  string  containing  ex¬ 
actly  one  character  is  more  than  adequate  for  represen¬ 
tation  of  a  single  character. 

String  is  intentionally  limited  to  printable  ascii  char¬ 
acters;  by  implication,  strings  may  not  presently  contain 
end-of-line  characters  or  tabs.  Strings  may  not  contain 
double  quote  characters.  Strings  are  written  between 
double  quote  (”)  characters. 

Default  Slot  Values 

Default  slot  values  may  be  specified  for  classes  and  sub¬ 
classes  using  the  default-slot-value  clause.  These  de¬ 
faults  will  be  used  whenever  the  class  is  instantiated  and 
a  slot  value  is  not  explicitly  provided;  the  defaults  may 
be  changed  with  a  new  default  slot  value  form  at  the 
top  level  in  an  lUE  file.  To  set  a  default  neighborhood^ 
3d-ordered-point-sets,  one  would  use  a  form  such  as  the 
following: 

(•ak«  Sd-linssagaent-nsighbozhood  23 
(slot  span  3.3)) 

(delault-slot-valno 
3d-ordorsd-point8ot 
3d-nghbrbood  (use  23)) 

Floats 

The  syntax  for  floats  is  a  subset  of  those  of  Common 
Lisp,  C,  and  C-H-,  thus  permitting  it  to  be  parsed  by  the 
standard  tools  of  any  of  those  languages.  It  is  expected 
that  lUE  floats  will  always  be  double  precision  floats. 

Reserved  Words 

The  number  of  reserved  words  has  been  kept  to  a  mini¬ 
mum  and  the  grammar  designed  so  that  changes  to  the 
lUE  hierarchy  will  not  necessarily  force  changes  to  the 
grammar  for  the  exchange  format;  in  particular,  lUE 
class  names  and  slot  names  are  not  reserved  words. 

User-defined  lUE  classes 

A  restricted  form  of  class  description  is  provided  so  that 
lUE  users  may  describe  their  extensions  to  the  lUE  hi¬ 
erarchy.  Such  descriptions  will  be  limited  to  clas.s  inher¬ 
itance  and  slot  definitions;  no  provision  will  be  made  for 
transmitting  code  fragments.  An  example  follows: 

^A  Sd-linesegment-neighborhood  is  a  linesegment  joining 
a  pointset  in  a  one-dimensional  sequence.  Points  are  not 
considered  connected  if  the  span  distance  is  exceeded. 
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(class  By-ne«-iaa-class 
(inherits 
iue-classl 
iae-class2 
iae-class3) 

(slots 

loo  float 
bar  integer 
baz  bit) 

) 

A  lisp  system  can,  of  course,  create  such  classes  on  the 
fly  during  lUE  file  input.  A  C++  system  will  have  to 
take  extra  steps;  a  preprocessor  will  have  to  locate  the 
class  descriptions  in  the  lUE  file  and  emit  a  C++  header 
file  fragment.  The  person(s)  managing  the  lUE  system 
at  the  destination  site  will  be  responsible  for  integrating 
this  C++  code  fragment  with  their  system. 

New  lUE  Classes  must  be  defined  before  they  are  ref¬ 
erenced. 

Appendix  1 

Output  Algorithm 

Using  this  algorithm  to  write  an  lUE  class  instance  will 
produce  a  properly  ordered  flat  file  with  the  class  in¬ 
stance  for  the  lUE  Class  Instance  selected  appearing  last 
in  the  file,  which  is  proper  organization  for  (external 
...)  clauses.  Infinite  loops  due  to  circular  references  are 
avoided  as  objects  who  have  sequence  numbers  assigned 
already  have  either  been  written  or  are  on  the  stack  wait¬ 
ing  to  be  written,  and  thus  do  not  need  to  be  revisited. 
The  algorithm  is  depth-first  in  character. 

Data  Structure  required: 

sequence  number  hash  table  -  key  is  lUE-Class- 
instance,  datum  is  sequence  number 
method  Output-object(  lUE-Class-Instance,  output- 
stream) 

1 :  [check  for  previous  visitation]  if  object  already  has 
secpience  number  stored  in  hash  table  then  return 
2:  [assign  sequence  number]  obtain  sequence  number 
and  place  in  sequence  number  hash  table 

3:  [for  all  slots]  if  slot  contains  object,  list  of  objects, 
or  vector  of  objects  then  for  each  sub-object  recursively 
invoke  Output-Object  on  sub-object 
4;  [for  desired  attributes]  if  attribute  contains  object, 
list  of  objects,  or  vector  of  objects  then  for  each  sub¬ 
object  recursively  invoke  Output-Object  on  sub-object 
5:  [write  instance  header]  “(make  class-name 

sequence-number  ...” 

6:  [for  all  slots  write]  “(slot  slot-name  slot-value)” 

7:  [for  desired  attributes  write]  “(attribute  attribute- 
name  attribute-type  att-value)” 

8:  [write  instance  close]  “)” 

Input  Algorithm 

This  reader  will  handle  both  flat  and  deep  representa¬ 
tions  of  data  structures,  correctly  restoring  circular  ref¬ 
erences.  If  names  are  to  be  handled  as  well  as  sequence 


numbers,  then  a  C++  implementation  will  need  to  dou¬ 
ble  up  hash  tables  (this  is  not  necessary  in  a  Common 
Lisp/CLOS  implementation.)  Implementation  will  be 
somewhat  different  in  a  Lex/Yacc  driven  implementa¬ 
tion,  but  details  of  the  restoration  of  the  circular  refer¬ 
ences  will  be  identical. 

This  algorithm  presumes  that  an  implementation  can 
support  ‘empty  shell’  class  instances,  whose  slots  have 
not  yet  been  filed  in.  If  an  implementation  csmnot  sup¬ 
port  such  lUE  Class  Instances,  but  can  create  an  inferior 
object  without  knowledge  of  its  superiors  (e.g.,  create 
edges  given  vertices  but  not  yet  knowing  1-chains),  then 
it  will  be  necessary  to  provide  an  intermediate  ‘storage 
class’  in  which  to  stash  Class  information  such  as  slot 
contents,  until  the  object  description  has  been  read  in 
allowing  the  lUE  Class  Instance  to  be  created.  Such  a 
two  stage  process  may  necessitate  doubled  hash  tables 
for  storage  of  intermediate  information,  and  cause  some 
processing  steps  to  be  slightly  delayed.  Such  a  two  stage 
process  is  used  in  the  Yacc/Lex  prototype  reader  for  Ge¬ 
ometer  Jr. 

There  are  many  implementation  details  such  as  han¬ 
dling  slots  which  contain  lists  of  references  (both  resolved 
and  unresolved)  that  are  not  handled;  some  creativity 
may  be  required  of  the  implementor,  although  there  are 
no  insurmountable  problems  (just  a  few  irritating  ones.) 

Data  structures  required; 

sequence  number  hash  table  -  key  is  sequence  number, 
datum  is  empty-shell  lUE  Class  Instance 

unresolved  reference  hash  table  -  key  is  sequence  num¬ 
ber  of  as  yet  undefined  instance;  datum  is  a  list  of  refer¬ 
ences  in  the  form  (sequence  number,  slot-name) 

call  Function  Read-Object  for  each  (make  ...)  in  the 
input  stream: 

Function  llead-Object(  input-stream)  returns  lUE- 
Class-Instance 

Rl:  [make  instance]  create  appropriate  shell  of  a  lUE 
Class  instance  based  on  lUE  class  type  from  (make  ...) 
clause 

R2:  [record  sequence  number]  put  Class  Instance  shell 
and  sequence  number  in  sequence  number  hash  table 

R3:  [for  all  slots  and  attributes]  if  make  clause  encoun¬ 
tered  then  recursively  invoke  Read-Object  on  it,  and  set 
slot  value  to  return  value  else  if  a  sequence  number  is  in 
the  sequence  number  hash  table  then  set  slot  value  else 
put  current  sequence  number,  slot  name,  and  sequence 
number  of  undefined  object  on  the  appropriate  list  in  the 
unresolved  reference  table 

R4;  [check  for  references  that  can  now  be  resolved]  if 
sequence  number  for  newly  created  lUE  Class  Instance 
appeatrs  in  unresolved  reference  hash  table,  then  fill  the 
slots  in  the  appropriate  class  instances  and  remove  the 
associated  triple  from  the  table. 

R5:  [return]  newly  created  lUE  Class  Instance  as  re¬ 
sult 
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Appendix  2 

lUE  Exchange  Format  grammar: 


<IUB-Bzchang«-FozBat-Fil«>  <IOE-Fila-Identifiar>  ■{  <top-objact>  > 
<diglt>  ::«0|ll2l3l4|5le|7|8|9 

<l«tter>  ::>A|B|C|DlE|F|6|HlIlJlK|LlN|ll0iP 
iqiRlSlTlUlVlUlXlYlZlalblcldlalf 
IglhliljlklllBlnlolpIqlrlaltlnlv 
I  «  I  X  I  y  I  z 


<atandard-typ«>  : : *  vector  I  slot  I  float  I  integer 
I  bit32  I  bit24  I  bitl6  I  bitS  I  bit 

I  type 

<re8erved-vords>  : : *  nake  I  use  I  list  I  slot  I  t  I  nil 
I  <standard-type> 

I  lUE-Ezchange-Fomat-Version 
I  defanlt-slot-valne 
I  slots  I  inherits  I  external 

<digit-8equence>  : : ■  <digit>  {  <digit>> 

<sign>  : :=  +  |  - 

<f loat-exponent>  •  <digit-8eqnaace>  I  E  <digit-seqnence> 

<dottad~digits>  : : ^  <digit->seqnence>  .  <digit-seqnenca> 

I  <digit-seqaence>  . 

I  .  <digit-seqnenee> 

<ansigned-float>  <dotted-digests> 

I  <digit-8eqnence>  <float-exponent> 

I  <dotted-digit8>  <float-exponent> 

<float>  <sign>  <ansigned-float> 

I  <unsigned-float> 


<string>  ::s  <double-quote>  <printable-ascii-characters>  <donble-qaote> 

<integer>  <digit-8eqaence>  I  <8ign>  <digit-seqaence> 

<label>  <integar>  I  <string> 

<identifier>  <letter>  {  <identifier-char>  } 

I  <digit>  {  <hyphen-or-digit>  >  <letter>  •£  <identilier-ehar>  > 

<identifiar-or-type>  <identifier>  I  <standard-type> 

<identifiar-char>  <letter>  I  <digit>  I  - 

<hyphen-or-digit>  : : »  -  I  <digit> 

<IUE-File-Identifiar>  (  lUE-Exchange-Fomat-Version 
<digit-seqnence>  .  <digit-saqnence>  ) 

<nake-or‘-nse>  <Biake>  I  <nsa>  I  nil 
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<obj-li*t>  {  <MJc«-or-aa«>  } 

<top-obj«ct>  <a»k«-or-QS«>  I  <dal&iilt-slot-T»liia>  I  <clus-d«tiiiition> 
<list>  ::»  (  liat  <liat-tail>  ) 

<liat-t«il>  <obj-liat>  I  <int-liat>  I  <float-liat> 

I  <vactor-liat>  I  <liat-ol-liata> 

<iBt-liat>  : : »  <iBtag8r>  {  <lxitagar>  } 

<llo«t-liat>  : : »  <lloat>  {  <lloat>  > 

<atriiig-liat>  <atriiig>  {  <atri&g>  } 

<Tactor-liat>  : : «  <vactor>  {  <aactor>  > 

<liat-of-liata>  <liat>  {  <liat>  } 

<vactor>  : : e  (  aactor  <alaBaBt-eouiit>  <Tactor'tail>  ) 

<Taetor-tail>  : : ■  iatagar  <lnt-liat> 

I  float  <float-liat> 

I  atring  <atrlBg-llat> 

I  vaetox  <vactor-liat> 

I  liat  <liat-of-liata> 

I  idantlfiar  <obj-liat> 

<alaBant-couBt>  : : *  <int> 

<claaa-dafinitioa>  ::3:  (  claaa  <claaa-naaa>  <elaaa-iaharitanca>  <claaa-alota>  ) 
<claaa-aaaa>  <idaiitifiar> 

<claaa*‘iiiharitaDca>  :;»  (  inbarita  <  <claaa-Baaa>  }  } 

<claaa-alota>  (  alota  <  <alot-iiaaia>  <idaBtifiar-or-typa>  >  } 

<Baka>  ::«  (  aaka  <idantifiar>  <labal>  <alota-and-attribataa>  ) 

<aaa>  : : -  (  aaa  <labal>  } 

I  (  aaa  <aztamal>  } 

<aztanial>  :ss  (  aztamal  <atring>  ) 

<dafault-alot-xalna>  (  dafanlt-alot-valaa  <idaiitifiar>  <idantifiar> 
<alot-valaa>  ) 

<alota-and-attrlbataa>  : : =  {  <alot-or-attribnta>  > 

<alot-or-attrlbiita>  : : »  <alot>  I  <attribata> 

<alot-daacriptor>  ::■  (  alot  <idantlfiar>  <alot''Taliia>  ) 

<alot-Talna>  <atring>  I  <iBtagar>  I  <float>  I  t  I  <liat> 

I  <Tactor>  I  <aAka-or-aaa>  I  <idaiitiliar-or-typa> 

<attribata-daacriptor>  ::>  (  attribnta  <idaatlfiar>  <idaatifiar-or-typa> 

<alot-aalva>  ) 
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The  Image  Understanding  Environment:  Image  Features 


Keith  Price  and  lUE  Committee* 
Institute  for  Robotics  and  Intelligent  Systems 
University  of  Southern  California 
Los  Angeles,  California  90089-0273 


Abstract 

An  lUE  image  feature  is  a  spatial  object  that 
represents  some  information  extracted  from  an 
image.  These  features  include  image  points, 
curves,  edges,  lines,  regions,  and  complex  col¬ 
lections  of  these  kinds  of  features,  such  as  per¬ 
ceptual  groups.  Image  features  without  ref¬ 
erence  to  the  underlying  image  correspond  to 
basic  objects  in  the  spatial  object  mathemat¬ 
ical  hierarchy.  The  imaffe  feature  class  shares 
a  common  development  with  this  mathemati¬ 
cal  structure.  In  practice,  image  features  will 
be  kept  in  collections  (the  features  extracted 
from  an  image)  and  will  be  used  as  an  element 
of  that  collection  of  through  the  use  of  a  spa¬ 
tial  index.  This  paper  describes  the  philosophy 
behind  the  image  feature  with  a  few  limited  ex¬ 
amples  of  how  they  are  described  and  used. 

1  Introduction 

This  paper  assumes  that  the  reader  is  familiar  with  the 
basic  concepts  used  by  the  Image  Understanding  Envi¬ 
ronment  (especially  the  basics  of  object  oriented  develop¬ 
ment  and  the  basic  classes  used  by  the  lUE)  and  should 
be  read  in  conjunction  with  the  other  papers  on  the  lUE 
included  in  these  proceedings.  The  complete  documen¬ 
tation  on  image  features  in  the  lUE  requires  many  pages 
and  this  paper  is  intended  to  provide  a  general  descrip¬ 
tion  of  the  image  feature  classes  not  to  be  the  complete 
description. 

Central  to  any  Image  Understanding  research  or  ap¬ 
plication  program  is  the  extraction  and  use  of  image  fea¬ 
tures.  Users  of  the  lUE  include  applications  users  who 
are  primarily  interested  in  the  user  interface,  applica^ 
tions  developers  who  will  use  and  extend  image  features, 
and  other  developers  who  will  implement  the  more  basic 
programs.  For  each  of  these  groups,  the  image  feature 
class  definitions  will  be  important. 


*The  members  of  the  lUE  Ck>mmittee  are:  Tom  Binford- 
Stanford;  Terry  Bonlt-Columbia;  Bob  Haralick,  V.  Ramesh- 
U.  Washington;  A1  Hanson,  Chris  Connolly-U.  Mass.;  Ross 
Beveridge-Colorado  State;  Charlie  Kohl-  All;  Daryl  Lawton- 
Georgia  Tech;  Dong  Morgan-ADS;  Joe  Mnndy-GE;  Keith 
Price-USC;  Tom  Strat-SRI. 


Image  features  form  a  portion  of  the  larger  spatial 
object  hierarchy  and  are  directly  related  to  the  math¬ 
ematical  structure  of  the  hierar^y.  lUE  users  such  as 
triplication  developers  will  be  interested  in  how  to  use 
and  extend  the  image  feature  classes  with  little  consider¬ 
ation  of  the  larger  spatial  object  hierarchy.  The  spatial 
object  hierarchy  and  the  relationship  of  image  features 
to  it  developed  through  the  series  of  meetings  of  the  lUE 
Committee  and  forms  a  clean  way  to  describe  and  im¬ 
plement  the  hierarchy.  There  are  several  reasons  why 
developing  image  features  within  the  spatial  object  hier¬ 
archy  is  crucial. 

First,  there  is  a  natural  correspondence  between  the 
sequence  of  topological  constructs,  e.g.  vertex,  edge,  and 
face  used  for  spatial  objects,  and  the  descriptions  of  im¬ 
age  features  for  points  and  junctions,  edges,  and  regions. 
Access  to  this  topological  representation  is  especially  im¬ 
portant  for  describing  composite  image  features  such  as 
linked  line  segments,  adjacencies  between  regions  found 
in  segmentations,  and  perceptual  groups.  S^ond,  since 
image  features  generally  correspond  to  the  projection  of 
three  dimensional  object  models,  it  is  useful  to  have  the 
same  underlying  operations  and  representations  used  for 
both  of  them.  Third,  image  features  are  characterized  by 
having  a  wide  range  of  possible  attributes.  We  wanted 
to  be  sure  that  it  would  be  possible  for  users  to  easily 
define  and  extend  the  attributes  associated  with  image 
features.  Many  of  the  shape  attributes  associated  with 
image  features  are  described  by  fitting  a  curve  or  two 
dimensional  shape  corresponding  to  a  spatial  object. 

2  Relationship  to  Spatial  Objects 

Image  features  are  defined  as  spatial  objects  that  repre¬ 
sent  some  information  extracted  from  an  image.  These 
features  include  simple  features  such  as  points,  curves, 
and  regions  and  complex  collections  of  image  features 
in  perceptual  groups.  An  image  feature  has  very  little 
meaning  separate  from  its  underlying  image — without 
the  relationship  to  the  image,  image  features  fit  into  the 
basic  spatial  object  hierarchy.  In  practice,  image  fear 
tures  will  be  kept  in  collections  (i.e.  a  collection  con¬ 
taining  all  the  edge  features  extracted  from  one  image) 
and  will  be  stored,  referenced,  and  manipulated  as  ele¬ 
ments  of  that  collection.  A  common  alternative  for  the 
collection  would  be  the  spatial  index  that  allows  image¬ 
like  (efficient)  access  to  the  features  based  on  position. 
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Since  image  features  are  usually  kept  in  collections  some 
of  the  slot  values  associated  with  image  features  are 
only  stored  once  for  the  collection.  For  example,  the 
coordinate-system  associated  with  an  individual  image 
feature  is  that  of  the  underlying  image,  and  since  most 
expensive  processing  with  the  image  feature  is  fixed  to 
the  coordinate-system  of  the  image,  there  is  no  cost  as¬ 
sociated  with  this  information. 

Methods  for  image  features  are  inherited  mostly  from 
the  spatial-object  class.  There  will  be  varying  needs  for 
specializations  for  methods  relating  to  extraction  (from 
the  image),  property  value  computations  (i.e.  color), 
display  on  the  image,  spatial  indexing  operations,  input, 
output,  grouping,  and  the  various  iterators  over  sets  of 
spatial  objects  (subsets,  all  in  an  area,  etc.).  Image  fea¬ 
tures  will  also  be  used  as  the  basis  for  the  region-of- 
interest  in  the  image  processing  operations. 

Image  features  are  developed  following  the  mathemat¬ 
ical  structures  used  for  spatial- object  where  the  vertex, 
0-chain,  edge,  etc.  of  the  spatial-object  hierarchy  have 
analogs  in  the  image  feature  hierarchy.  The  most  appar¬ 
ent  difference  between  the  mathematical  structure  of  the 
spatial-object  classes  and  image  features  is  the  inclusion 
of  image  related  property  values  and  the  use  of  named 
object  classes  that  correspond  to  the  kinds  of  image  fea¬ 
tures  used  in  image  understanding  research  and  develop¬ 
ment  programs.  An  important  requirement  of  the  lUE 
has  always  been  to  support  rather  than  hinder  research, 
so  we  do  not  define  these  objects  in  absolute  final  forms 
but  indicate  what  property  values  (slots)  are  possible, 
the  names  that  will  be  used,  and  the  associated  seman¬ 
tics.  Between  different  programs,  the  slots  may  vary,  but 
the  creation,  reading,  and  writing  programs  will  allow  for 
missing  and  extra  information  without  harm.  Our  first 
description  of  image  features  will  parallel  the  mathemat¬ 
ical  hierarchy  of  spatial-objects.  Most  image  features  fit 
clearly  into  this  parallel  hierarchy,  but  some  may  not  be 
obvious  at  first  glance. 

2.1  Image  Points  as  Special  Cases 

Image  points  are  simple  features  for  representing  image 
positions.  As  was  used  for  point  in  the  spatial  object 
hierarchy,  the  simple  mage-point  does  not  inherit  from 
the  image-feature  class,  but  is  a  primitive  element  that 
only  contains  the  image  positions.  Sequences  of  points 
are  stored  in  specialized  versions  of  the  ordered-pointset. 

2.2  Vertex  Objects  and  Image  Features 

The  mathematical  object  vertex  is  a  0-D  object.  A  0-D 
image  feature  is  a  point  in  the  image,  possibly  with  some 
associated  neighborhood.  This  point  can  be  something 
extracted  directly  (e.g.  points  from  an  interest  operator, 
a  corner  detector,  etc.)  or  the  point  could  be  the  result 
of  an  intersection  of  two  lines  (or  line-segments).  The 
primitive  slots  for  an  image  feature  vertex  are  the  image 
positions,  but,  depending  on  the  feature  being  modeled, 
many  other  property  values  are  possible. 

A  common  image  feature  corresponding  to  vertex  is 
the  edgel — extracted  in  a  neighborhood  and  representing 
the  step  between  two  intensity  surfaces.  A  simple  edgel 
that  encodes  the  yes/no  result  of  an  edge  operator  at 


every  pixel  in  the  image  is  treated  as  a  single  point  with 
no  associated  properties  and  is  thus  a  vertex.  The  main 
aspect  which  distinguishes  the  class  edgel  from  the  basic 
vertex  is  that  a  local  model  for  the  geometry  of  the  image 
intensity  surface  is  assumed.  The  local  neighborhood  is 
defined  by  a  disk  about  each  potential  edgel  location. 

In  the  lUE  core,  we  allow  a  number  of  simple  mod¬ 
els  for  the  intensity  surface  in  the  local  neighborhood 
around  each  edgel  location.  The  neighborhoods  are  char¬ 
acterized  by  the  local  structure  of  image  intensity  sur¬ 
faces.  The  neighborhoods  are  defined  as  follows  in  order 
of  frequency  of  occurrence  in  an  image: 

A  The  interior  of  a  single  surface. 

B  Two  surfaces  intersect  at  the  edgel. 

C  Three  or  more  surfaces.  Often  corresponds  to  &  cor¬ 
ner. 

D  The  image  intensity  surface  is  too  complex  to  be 
described  by  a  simple  model. 

The  intent  of  the  cover  with  small  disks  is  to  divide 
and  conquer,  to  make  small  neighborhoods  that  are  suf¬ 
ficiently  simple  that  it  is  feasible  to  describe  the  sur¬ 
face  adequately  over  the  neighborhood.  This  is  a  fine  to 
coarse  to  fine  approach.  In  this  approach,  the  first  level 
of  disks  is  the  smallest  meaningful.  It  is  simple  to  de¬ 
scribe  patches  of  class  A  that  include  a  single  continuous 
surface.  Continuous  patches  are  described  in  differential 
geometry  by  a  tuple:  (point,  tangent  plane,  and  curva¬ 
ture  tensor). 

It  is  reasonably  simple  to  describe  the  compound  sur¬ 
face  over  a  disk  of  type  B  that  includes  two  surfaces. 
This  model  is  the  most  common  representation  used  to 
define  an  edgel.  Two  surfaces  are  bounded  either  by  a 
curve  at  which  two  surfaces  intersect  or  by  a  limb,  an  ap¬ 
parent  boundary.  On  a  small  disk,  the  boundary  curve  is, 
locally  straight.  The  model  of  type  C  is  typically  called 
a  corner  and  is  best  represented  as  an  attributed  vertex. 

In  addition  to  the  position  slot  that  is  inherited  from 
vertex,  the  basic  set  of  attributes  that  is  associated  with 
the  edgel  model  are  as  follows: 

Line  Segment  The  parameters  of  the  edgel  line 
segment. 

Tangent  Vector  For  efficiency  the  local  line  seg¬ 
ment  may  be  represented  as  the  tangent  vector. 

Tangent  angle  Another  alternative  is  a  quantized 
tangent  angle. 

Strength  The  local  slope  of  the  discontinuous  tran¬ 
sition  between  the  two  surfaces  in  the  ed<rel  neigh¬ 
borhood. 

Left  Surface  Normal  The  surface  normal  of  the 
intensity  surface  on  the  left  of  the  edge  boundary. 

Right  Surface  Normal  The  right  surface  normal. 

Covariance  Matrix  The  variances  of  the  param¬ 
eters  of  the  edgel  model  computed  from  the  actual 
intensity  distribution  in  the  neighborhood.  Gives  a 
likelihood  that  the  model  holds  for  the  local  disk. 

These  point  operation  results  can  be  grouped  in  a 
number  of  ways  depending  on  the  user’s  concept  of  the 


underlying  geometry  of  the  boundary  or  region.  The 
simplest  group  is  the  0-chain  which  is  a  set  of  vertices 
with  an  implicit  linear  ordering.  Usually  a  full  topo¬ 
logical  description  is  not  applied  until  a  local  neighbor¬ 
hood  analysis  is  done  to  “link”  edgels  into  connected  sets 
called  edgelchain. 

2.3  0-Chains  of  Image  Features 

The  mathematical  0-chain  describes  ordered  collections 
of  objects  of  class  vertex,  such  as  that  formed  by  link¬ 
ing  edgels  into  a  discrete  curve.  For  some  purposes  the 
ordered-poiniset  class  may  be  used  to  group  edgel  se¬ 
quences  and  for  others  the  more  complete  image-O-chain 
is  proper.  The  main  difference  is  that  the  image-O-Chain 
is  a  topological  concept  which  supports  algebraic  opera¬ 
tions  on  the  points  while  the  ordered-pointset  is  a  set  of 
feature  points  with  no  topological  interpretations. 

A  sequence  of  corner  objects  also  forms  a  image-O- 
chain  but  here  there  is  often  no  local  neighborhood  rela¬ 
tion  assumed  between  the  corners.  However,  the  corners 
usually  correspond  to  junctions  of  two  or  more  edges 
and  it  is  reasonable  to  use  the  topological  structure  of 
the  vertex  for  the  basis  of  the  individual  junctions. 

2.4  Edge,  One- Dimensional  Image  Features 

The  name  edge  has  a  clear  meaning  in  describing  graphs 
and  in  topology,  and  has  a  very  different  meaning  in  most 
image  understanding  work.  In  the  lUE  descriptions,  we 
use  edge  for  both  of  these  meanings,  but  usually  the 
context  will  indicate  which  one  is  meant.  The  mathe¬ 
matical  edge  corresponds  to  many  image  features,  espe¬ 
cially  bounded  line  segments  and  curves.  The  image- 
line-segment  and  image- curve-segment  are  the  most  ob¬ 
vious  objects  in  this  class.  These  features  may  be  com¬ 
puted  directly  from  the  image  or  derived  from  other  im¬ 
age  features. 

The  image-line-segment  potentially  has  a  large  num¬ 
ber  of  attributes,  but  the  basic  set  of  attributes  include: 

VO  and  VI  The  primary  information  of  the  line 
segment  is  the  beginning  and  ending  points.  These 
are  are  vectors  of  image  locations. 

0-chain  A  0-chain  (ordered  list)  of  the  pixel  loca¬ 
tions  corresponding  to  the  line  segment. 

Strength  The  difference  across  the  line  segment, 
a  float. 

Segment-length  The  length  across  the  line  seg¬ 
ment,  a  float. 

tangent-angle  The  direction  (in  radians,  a  float) 
of  the  line  segment. 

2.5  1-Chain,  Connected  1-D  Features 

The  mathematical  construction  continues  with  the  de¬ 
scription  of  1-Chains  or  ordered  sets  of  1-D  objects  such 
as  sequences  of  line  segments.  1-Chain  structures  can 
also  intersect  at  junctions  represented  by  objects  of  the 
class  vertex  and  be  extended  to  define  closed  region 
boundaries.  In  practice,  it  is  often  difficult  to  form  com¬ 
plete  topologies  of  these  types  in  a  bottom-up  fashion. 


An  image-O-chain  may  be  analyzed  to  produce  a  image- 
1-chain  composed  of  image-line-segments  that  approx¬ 
imate  the  original  curve  and  the  individual  segments 
maintain  the  order  given  by  the  original  edge  elements. 

2.6  Face,  2-D  Image  Features 

The  spatial- object  face  corresponds  to  2-D  image  features 
most  clearly  represented  as  a  2d-image-region  in  an  im¬ 
age.  The  low  level  representation  of  a  region  can  vary 
according  to  the  ultimate  use  due  to  time  and  space  effi¬ 
ciency  considerations,  and  can  include  point  sets,  binary 
masks,  interval  codes,  boundaries,  etc.  Each  of  these  can 
be  derived  from  the  others.  A  nice  aspect  of  treating  re¬ 
gions  as  faces  is  that  all  of  the  edge  mathematical  struc¬ 
tures  can  be  used  directly  to  determine  adjacent  regions 
and  represent  the  geometry  of  the  image  structures. 

There  are  several  typical  attributes  associated  with 
2d-image-regions.  Some  of  these  are  simple  scalar  and 
matrix  attributes  for  describing  shape  such  as  Area  (in¬ 
teger  number  of  pixels),  Euler  number.  Centroid  (the  2-D 
point  of  the  centroid),  Scatter-Matrix-of-Pixel-Positions, 
and  Compactness.  Some  shape  attributes  correspond  di¬ 
rectly  to  instances  of  ID  and  2D  spatial  objects  which 
describe  the  shape  of  a  region  and  are  instantiated  by  ap¬ 
plying  their  corresponding  fitting  methods  to  the  edge! 
chain  of  a  region.  Other  attributes  such  as  average  inten¬ 
sity,  variance,  and  so  forth,  are  computed  using  a  spatial 
index  to  register  a  region  with  an  image  and  to  access 
the  corresponding  image  values.  These  attributes  are 
stored  as  a  gaussian-distributwn  with  two  slots,  mean 
and  variance. 

2.7  2-Chains,  Linked  2-D  Features 

A  2-chain  is  a  sequentially  linked  face  feature.  It  is  likely 
that  the  more  conventional  region  adjacency  graph  is 
the  more  effective  data  structure  to  group  faces  as  im¬ 
age  regions,  although  it  is  essential  that  the  underlying 
mathematical  descriptions  of  edges  and  faces  be  used. 
Composite  regions  produced  by  an  image  segmentation 
procedure  are  represented  as  multiply  connected  faces 
corresponding  to  a  set  of  1-Chains  enclosing  pixel  areas 
in  the  image  plane.  Region  merging  operations  are  sup¬ 
ported  by  the  topological  operations  for  removing  com¬ 
mon  edges  between  faces.  Merging  operations  require 
methods  to  access  properties  of  adjacent  regions  shar¬ 
ing  a  common  boundary.  These  descriptions  of  region 
adjacencies  are  similar  to  the  attributes  used  with  edgel 
models. 

2.8  Blocks,  3-D  Image  Features 

Images  are  typically  2-D  objects,  but  with  range  sensors 
and  time  sequences,  we  can  expect  to  deal  with  extend¬ 
ing  the  block  to  image  features. 

2.9  Other  Collections  of  Image  Features 

Not  all  collections  of  image  features  fit  the  definitions  of 
the  X-chains.  For  example,  perceptual-groups  formed  by 
clustering  some  number  of  image  features  into  perceptu¬ 
ally  meaningful  structures  may  result  in  a  image-O-chain 
or  image-l-chain  where  points  or  lines  are  grouped  into 
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single  curves,  but  frequently  they  are  grouped  into  ei¬ 
ther  simple  shapes  with  few  features  (e.g.  two  parallel 
lines,  rectangles,  etc.)  or  clustered  into  an  area  feature 
where  order  and  linking  are  unimportant.  The  lUE  will 
provide  direct  support  for  basic  groups  of  image  features 
through  the  2d-segment-pair,  Sd-segment-triple,  and  Sd- 
segment-quad  classes.  The  specific  shapes  will  be  han¬ 
dled  by  classes  such  as  Sd-segment-parallel,  Sd-image- 
comer,  Sd-segmeni-junciion,  2d-u-shapt,  etc.  which  will 
inherit  from  the  appropriate  spatial-object  class  and  con¬ 
tain  the  links  to  the  image  features  that  contributed  to 
the  object. 

Perceptual  grouping  produces  hypotheses  that  include 
cluster  of  interior  cells  of  a  region  and  n-tuple  of  edgels 
along  a  smooth  curve  boundary  between  surfaces. 

Basic  images  features  may  be  grouped  by  a  variety 
of  properties  such  as  proximity,  alignment,  curvature, 
etc.  Many  different  data  structures  can  be  important 
to  use  in  this  grouping  process,  such  as  K-D  Trees  and 
quadtrees  Additionally,  the  Hough  transform  performs 
directional  grouping.  Proximity  and  directional  group¬ 
ing  also  often  use  different  forms  of  spatial  indexing.  One 
example  of  the  grouping  process  involves  the  generation 
of  two  sets  of  hypotheses  from  a  tuple  of  edgels.  The 
first  set  is  that  there  is  not  a  smooth  curve  between  two 
surfaces.  The  second  set  is  that  there  is  a  smooth  curve 
between  two  surfaces.  Hypotheses  for  individual  edgels 
are  that  they  are  on  the  curve,  that  they  are  random 
spatially-invariant  as  a  result  of  camera  noise,  and  that 
they  are  “clutter,”  non-random  and  not  on  the  smooth 
curve,  e.g.  another  edge. 

3  An  Example  Image  Feature 

To  illustrate  the  construction  of  a  class  in  the  image 
feature  hierarchy  we  will  use  the  2d-image-lin€-segment. 
This  description  shows  the  level  of  description  available 
to  the  system  developers  and  illustrates  how  slots  and 
methods  are  inherited  in  the  construction  of  objects.  It 
must  be  pointed  out  that  the  format  in  the  complete 
description  documents  is  superior  to  what  is  given  in 
this  paper. 

3.1  2d-image-line-segment 

A  major  structure  in  image  segmentation  algorithms. 
Many  of  the  slots  are  associated  with  the  orientation 
of  the  line  and  there  is  a  standard  orientation  ambigu¬ 
ity  which  arises  in  image  coordinate  frames.  At  times 
it  is  convenient  to  have  the  image  coordinates  with  x 
along  the  increasing  image  column  index  and  y  down¬ 
ward  along  the  incre^ksing  row  index.  This  coordinate  is 
left  handed,  and  alters  the  meaning  of  the  .segment  orien¬ 
tation.  The  sense  of  the  coordinate  frame  is  provided  by 
its  definition  at  the  level  of  spatial-object.  It  is  assumed 
that  the  inverted  coordinate  frame  is  normally  used,  i.e., 
‘y’  downward.  In  order  to  put  the  features  into  a  stan¬ 
dard  right-handed  cartesian  coordinate  system  for  later 
processing,  the  ‘y’  orientations,  9y,  must  be  transformed 
to  9y  -P  180  which  is  a  method  on  coordinate-transform. 

•  Superior  Class  image-line-segment 


•  Pseudo  slots  -  the  slots  that  are  added  with  this 
definition. 

-  tangent-angle-x  float  Another  alternate  orien¬ 
tation  specification  for  the  line  segment.  The 
angle  in  radians  of  the  line  segment.  The  sense 
is  counter-clockwise  with  respect  to  the  x-axis. 

-  tangent- angle-d-x  float  Same  as  tangent-angle-x 
except  the  angle  is  in  degrees. 

-  tangent-angle-y  float  Another  alternate  orien¬ 
tation  specification  for  the  line  segment.  The 
angle  in  radians  of  the  line  segment.  The  sense 
is  counter-clockwise  with  respect  to  the  y-axis. 

-  tangent-angle-d-y  float  Same  as  tangent-angle-y 
except  the  angle  is  in  degrees. 

-  slope  float  The  direction  of  the  line  represented 
as  the  slope  in  the  image  coordinate  space. 

-  x-intercept  float  The  position  where  the  seg¬ 
ment  intersects  the  X  axis,  or  the  column  value 
when  the  row  is  0. 

-  y-intercept  float  The  position  where  the  seg¬ 
ment  intersects  the  Y  axis,  of  the  row  value 
when  the  column  is  0. 

-  rho  float  Hough  Transform  representation,  dis¬ 
tance. 

-  theta-x  float  Hough  Transform  representation, 
angle  of  the  2d-line-segment  normal  in  radians 
with  respect  to  the  x-axis. 

-  theta-x-d  float  Hough  TYansform  representa¬ 
tion,  angle  theta-x  in  degrees. 

The  inheritance  structure  shows  how  the  slots  for  this 
object  are  derived.  Most  slots  are  derived  from  the  ob¬ 
jects  earlier  in  the  hierarchy  with  the  direct  2-D  image 
related  slots  added  at  this  stage.  Additionally,  many  of 
the  slots  are  implemented  as  “pseudo”  slots  rather  than 
as  “hard”  slots.  That  is,  these  behave  like  slots  in  terms 
of  storing  the  values,  but  do  not  take  any  space  if  they  are 
not  needed.  Additionally,  some  slots  are  defined  early 
in  the  hierarchy  and  are  refined  as  the  hierarchy  is  de¬ 
veloped  (e.g.  centroid  changes  from  the  general  point  to 
the  more  specialized  nd-image-point).  Slots  such  as  co- 
ordsys  (the  coordinate  system  for  the  spatial  object)  are 
usually  the  same  for  all  image  features  corresponding  to 
one  image  (and  are  the  same  as  the  image). 
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Class  where  Defined 

Inheriian 

Slot  Name 

ct  Structure: 

Type  of  Slot 

How  Implemented 

spatialobject 

coordsys 

pointer(  coordinate-system) 

hard 

spatialobject 

bounding-box 

aligned-box-neighborhood 

hard 

spatialobject 

centroid 

point 

hard 

image-feature 

centroid 

nd-image-point 

hard 

image-feature 

image 

pointer(image) 

hard 

parametric-curve 

domain 

pointer(spatial-object) 

hard 

parametric-curve 

range 

implicit-curve 

hard 

parametric-curve 

parametric-mapping 

pointer(function) 

hard 

parametric-line 

range 

implici^line 

hard 

parametric-line 

Imat 

vector[n](vector[2](float)) 

pseudo 

2d-parametric-line 

range 

2d-implicit-line 

hard 

2d-parametric-line 

Imat 

vector[2](vector[2](float)) 

pseudo 

topology-node 

superiors 

list(pointer(  topology-node)) 

pseudo 

topology-node 

inferiors 

list(pointer(topology-node)) 

pseudo 

edge 

0-chn 

0-chain 

pseudo 

edge 

1-chn 

list(pointer(  1-chmn)) 

pseudo 

edge 

vO 

pointer(  vertex) 

hard 

edge 

vl 

pointer(vertex) 

hard 

image-line-segment 

tangent- vector 

vector[n](float) 

pseudo 

image-line-segment 

fitting-tolerance 

float 

pseudo 

image-line-segment 

edgel-fit 

float 

pseudo 

2d-image-line-segment 

tangent-angle-x 

float 

pseudo 

2d-image-line-segment 

tangent- angle-d-x 

float 

pseudo 

2d-image-iine-8egment 

tangent- angle-y 

float 

pseudo 

2d-image-iine-segment 

tangent-angle-d-y 

float 

pseudo 

2d-image-line-segment 

slope 

float 

pseudo 

2d-image-line-segment 

x-intercept 

float 

pseudo 

2d-image-line-8egment 

y-intercept 

float 

pseudo 

2d-image-line-segment 

rho 

float 

pseudo 

2d-image-iine-segment 

theta 

float 

pseudo 

2d-image-iine-segment 

theta-d 

float 

pseudo 
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Abstract 

A  major  insight  achieved  by  the  lUE  committee 
is  the  concept  of  the  "spatial  object".  Objects  to 
represent  typical  lU  features  such  as  points,  lines, 
arcs,  volumes,  etc.,  have  common  spatial  attributes 
and  operations.  The  concept  of  "spatial  object”  is 
an  effort  to  abstract  these  properties  into  a  compact 
set  of  generic  classes.  In  this  paper,  we  provide  an 
outline  of  the  design  of  spatial  objects  in  the  lUE. 

1  Introduction 

A  major  insight  achieved  by  the  lUE  committee 
is  the  concept  of  the  "spatial  object”.  Object* 
s  to  represent  typical  lU  features  such  as  points, 
curves,  surfaces  and  volumes,  have  common  spatial 
attributes  and  operations.  The  concept  of  "spatial 
object”  is  an  effort  to  abstract  these  properties  into 
a  compact  set  of  generic  classes.  Essentially,  a  spa^ 
tial  object  is  a  point  set  in  n-dimensions.  The  point 
set  has  an  associated  coordinate  frame  so  that  pro* 
jections  and  transformations  can  be  applied  to  the 
point  set.  It  is  not  essential  that  the  point  set  be  a 
simple  flat  set,  but  can  be  represented  as  a  hierar* 
chical  group  or  other  relationship  among  groups  of 
point  sets.  A  polyhedral  object  represents  a  com¬ 
plex  spatial  object  which  represents  a  set  of  points 
in  3D  space. 

We  divide  the  spatial  object  concept  into  ma¬ 
jor  categories  according  to  the  dimension  of  the 

*The  members  of  the  lUE  Committee  are:  Tom  Binford- 
Stenford;  Terrjr  Boult-Columbia;  Bob  Haralick,  V.  Ramesh* 
U.  Washington;  A1  Hanson,  Chris  Connolly-Umass;  Ross 
Beveridge-Colorado  State;  Charlie  Kohl-  All;  Daryl  Lawton- 
Georgia  Tech;  Doug  Morgan-ADS;  Joe  Mundy-GE;  Keith 
Price-USC;  Tom  Strat-SRI;  The  committee  effort  has  been 
funded  by  numerous  D  ARPA  grants  and  associated  funding 
partners  under  contract  to  each  institution. 


point  set  and  its  embedding  dimension.  The  ma¬ 
jor  categories  are,  point,  curve,  surface  and  vol¬ 
ume.  These  entities  correspond  to  point  sets  with 
dimension,  0,1,2,3  mpectively.  The  entities  can 
be  placed  in  coordinate  spaces  of  their  dimension 
or  higher.  For  example,  a  plane  can  be  placed  in 
spaces  of  dimension  2  or  higher.  Composite  struc¬ 
tures  such  as  polygonal  curves  or  polyhedrons  are 
represented  by  topological  networks  which  contain 
pointers  to  the  geometric  elements.  For  example  a 
1-chain  is  a  sequence  of  edges.  An  edge  is  a  curve 
segment,  bounded  by  two  vertices.  A  vertex  is  a 
point  with  associated  topological  connections. 


In  addition  to  these  basic  structures,  we  also  in¬ 
troduce  the  concept  of  neighborhood  which  is 
taken  from  the  standard  theory  of  point  set  topol¬ 
ogy.  The  various  neighborhoods  are  needed  to  es¬ 
tablish  continuity  and  connectedness  relations  be¬ 
tween  points  and  other  spatial  objects.  Any  spatial 
object  can  act  as  a  neighborhood,  but  we  define  a 
special  hierarchy  of  typical  neighborhoods  with  the 
idea  that  the  rather  simple  intersection  tests  associ¬ 
ated  with  neighborhoods  will  be  implemented  with 
hard  coded  method  implementations  and  will  short 
circuit  most  of  the  pointer  chuns  associated  with 
deep  inheritance  hierarchies. 


In  this  paper,  we  provide  an  outline  of  the  spa- 
tird  objects  in  the  lUE.  The  paper  is  organised  in 
the  following  fashion.  First  we  discuss  the  detuls 
of  the  class  "spatial-object”.  In  subsequent  sec¬ 
tions,  we  discuss  implicit  and  parametric  represen¬ 
tations  of  spatial  objects,  representations  for  topo¬ 
logical  structures,  and  representations  for  solids. 
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2  Spatial  Objects 

2.1  Abstract  Geometry 

The  theory  of  geometry  is  based  on  abstract  con¬ 
cepts  such  as  point  and  various  point  sets,  e.g.  line, 
surface  ...  etc.  The  theory  of  these  structures  can 
be  presented  and  manipulated  entirely  in  terms  of 
formal  predicate  logic.  For  example,  we  can  rep¬ 
resent  the  axiom  that  there  is  always  at  least  one 
point  not  on  a  given  line^ 

V  £  3p  Line{C)  A  Poini{jp)  A  ~  /n(p,  C) 

However,  the  theory  of  geometry  and  topology  is 
not  very  efficiently  computed  in  such  formal  terms. 
For  practical  applications,  it  is  necessary  to  provide 
a  model  for  these  abstract  concepts  which  permits 
efficient  computation  of  their  properties  and  rela¬ 
tions.  A  common  model  for  geometry  is  the  repre¬ 
sentation  of  a  point  as  a  tuple  in  92”.  A  line  then 
becomes  a  set  of  points,  where  set  membership  is 
defined  by, 

7n(p,  £)  =  (p  =  apa  -I-  /3pj)  (1) 

i.e.,  two  unique  points  determine  a  unique  line  by 
linear  combination. 

Now  a  particular  concept  can  have  many  models. 
For  example,  a  circle  can  be  defined  by  **  +  y*  -  1 
or  p  —  1  =  0  depending  on  the  use  of  cartesian  or 
polar  coordinates  in  the  model  for  the  circle.  The 
models  are  related  by  the  fact  that  they  represent 
the  same  abstract  concept,  the  unit  circle.  For  most 
applications,  the  most  effective  model  for  geometry 
and  topology  reptesents  a  point  as  an  n-tuple  from 
92”.  This  model  is  called  the  n-Euelidean  model. 
The  concepts  such  as  curve  and  surface  are  sets  of 
points  or  n-tuple  sets. 

In  the  development  to  follow,  a  Spatial  Object 
will  either  be  a  point  or  a  point  set  which  can  sup¬ 
port  a  standard  set  of  methods.  Some  of  the  more 
common  basic  methods  of  the  spatial  object  are, 

ln(p)  [92”  -♦  {T,  f’}]  -  A  point  p  is  in  the  set 
of  points  defined  by  the  spatial  object. 

^  These  ii  tome  confusion  over  the  predicates,  /n(p)  and 
On(p)  in  the  mathematical  literature.  For  example  Hilbert 
uses  "On”  to  mean  that  a  point  is  on  element  of  a  line  point 
set.  On  the  other  hand,  common  usage  uses  the  predicate 
"In”  to  denote  on  element  of  a  set.  In  the  discussion  here, 
we  take  the  predicate  symbols  In  and  €  to  be  equivalent. 
We  reserve  the  symbol  On  to  refer  to  points  which  ore  In 
the  boundary  of  a  set. 


Dimension[92”  — *  Z]  -  The  intrinsic  dimension 
of  the  point  set.  For  example  a  point  has  di¬ 
mension  0,  a  curve  dimension  1  and  a  surface 
dimension  2. 

On(p)  [92”  — »  {T,  iP}]  -  For  point  sets  which 
have  a  boundary,  the  method  On(p)  is  true  for 
points  which  have  a  neighborhood  containing 
at  least  one  point,  q,  for  which  ln(q)  is  false. 
The  concept  of  neighborhood  is  defined  later. 

Intersect(SO)  [92”  — ♦  92”]  -  Perform  a  Boolean 
intersection  with  the  spatial  object,  SO. 

Compose(SO)  [92”  — ►  92”*]  -  For  parametrized 
spatial  objects,  it  *s  meaningful  to  use  the 
range  of  one  spatial  object,  Oi,  as  the  domain 
of  another,  Oj.  The  dimension  of  the  Domain 
of  Oi  is  n  and  the  range  of  O3  is  m.  For  ex¬ 
ample,  a  polygon  may  be  used  as  a  region  of 
interest  in  an  image. 

Nearest-Distance(p)  -  For  point  sets  which  have 
a  metric  defined^  ,  it  is  meaningful  to  compute 
the  distance  from  a  point  to  the  nearest  point, 
p,  which  is  On(p). 

Transform(CS)  -  Transform  all  the  points  of  the 
spatial  object  from  the  current  coordinate  sys¬ 
tem  of  the  spatial  object  to  the  target  coordi¬ 
nate  system,  CS. 

Boundary  -  Returns  the  boundary  of  a  spatial 
object.  For  example,  the  boundary  of  a  curve 
segment  is  two  vertices.  Note  that  the  bound¬ 
ary  of  a  boundary  is  always  empty  or  NIL. 

Surface-Normal(p)  -  When  a  point,  p,  satisfies 
On(p)  and  the  spatial  object  is  a  differentiable 
manifold  then  it  is  meaningful  to  compute  the 
surface  normal  at  a  point.  Many  other  surface 
differential  properties  can  be  defined  at  this 
level  of  abstraction. 

These  methods  are  used  extensively  throughout 
many  lU  algorithms  and  the  lUE  classes  are  de¬ 
signed  to  efficiently  implement  these  basic  meth¬ 
ods.  In  most  cases  these  efficient  implementations 
will  be  based  on  sets  of  92”  and  associated  algebriuc 
operations. 

^  A  metric  ie  a  function  which  defined  on  two  point*  which 
obey*  a  *et  of  axiom*  corre*ponding  to  the  ueuol  notion*  of 
di*tance.  For  example,  d(Pi ,  P*)  <  d(Pi,  Pj)  +  d(P2,  Ps). 
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Figure  1:  A  portion  of  the  lUE  class  hierarchy 
which  defines  the  fundamental  notion  of  relation* 
s  and  point  sets. 

2.2  The  nDPoint 

The  nDPoint  is  the  fundamental  element  of  geom¬ 
etry.  In  the  lUE  the  point  is  modeled  as  a  tuple  of 
real  numbers,  i.e.  an  element  of  K”.  Each  real  num¬ 
ber  in  the  tuple  corresponds  to  a  coordinate  value 
with  respect  to  the  coordinate-system  associat¬ 
ed  with  the  spatial-object.  A  simple  point  can  be 
considered  to  be  a  spatial-object  with  a  default 
coordinate-system  where  each  axis  is  just  the  re¬ 
al  number  line.  A  set  of  such  points  can  be  con¬ 
structed  and  associated  with  a  global  coordinate- 
system  where  the  axes  have  defined  units  andax- 
is  types,  along  with  relationships  to  other  coordi¬ 
nate  systems.  We  will  often  use  the  mathematical 
structure  of  91”  as  a  model  for  the  theory  of  points, 
curves  and  surfaces.  The  hierarchy  of  Figurel  il¬ 
lustrates  the  relationship  of  points  and  attributed 
points  to  the  basic  classes.  ‘  In  truth,  there  is  little 
which  distinguishes  a  nDPoint  from  the  class  Re- 
alNTuple  and  we  may  be  able  to  dispense  with  the 
extra  specialization  layer.  Arithmetic  on  elements 
of  9P*  provides  an  algebraic  representation  for  point 
set  operations.  In  the  example  of  Equation  1  shown 
earlier,  a  linear  combination  of  two  points,  repre¬ 


sented  as  vectors  of  9^”,  defines  the  points  of  a  line. 
An  Attributed-nDPoint  is  the  same  as  point,  ex¬ 
cept  that  a  set  of  dynamic  attributes  are  associat¬ 
ed  with  the  point.  These  attributes  typically  arise 
in  the  context  of  segmentation  and  correspond  to 
properties  such  as  texture,  contrast,  or  differential 
geometry  properties.  For  example  an  Edgel  is  con¬ 
sidered  to  be  a  point  with  attributes  describing  the 
local  pixel  neighborhood  around  a  discontinuity. 

2.3  Coordinate  Systems 

When  we  have  a  large  number  of  simple  objects, 
such  as  an  nDPoint,  it  is  not  efficient  to  associate 
a  coordinate  system  with  each  point  individually. 
Instead,  we  group  the  points  into  a  set  and  then 
associate  the  coordinate  system  with  the  set  as  a 
spatial  object.  In  this  case,  we  make  use  of  the 
generic  9f”  tuples  to  save  the  cost  of  linking  each 
point  to  a  coordinate  system.  Similarly,  we  do  not 
attach  a  coordinate  system  to  91  for  many  mathe¬ 
matical  definitions.  For  example,  it  is  not  meaning¬ 
ful  for  X  to  have  units.  Sometimes  it  is  warranted 
to  associate  a  coordinate  frame  with  a  single  point, 
as  in  a  trajectory  or  track.  We  do  not  rule  out  the 
idea  of  a  single  point  being  a  spatial  object  but  to 
be  an  equivalent  data  type,  it  must  be  represented 
as  a  point  set  with  one  element. 

The  process  of  associating  a  coordinate  system 
with  a  spatial  object  is  achieved  by  associating  at¬ 
tributes  with  each  element  of  the  9f”  tuple.  At¬ 
tributes  are  attached  to  define  the  dimension  and 
type  of  each  coordinate  axis  through  the  class  Ax- 
isType.  For  example,  a  coordinate  might  be  dis¬ 
tance  (dimension)  measured  in  meters  (units).  To¬ 
gether  these  make  up  the  attributes  for  a  compo¬ 
nent  of  the  9f”  tuple.  This  attributed  92”  is  still  a 
subclass  of  91”  so  it  can  be  used  as  the  range  and 
domun  sets  of  a  spatial  object.  The  relationship  to 
the  classes  described  in  the  section  on  coordinate 
systems  is  shown  in  Figure  2. 

3  Implicit  Point  Set 

The  class  Relation  is  specialized  to  form  the  con¬ 
cept  of  an  ImplicitPointSet  as  shown  in  Figure  3. 
The  tuples  in  the  relation  are  instances  of  The 
predicate  defining  the  pointset  will  typically  be  a 
system  of  algebraic  equations.  For  example  the  in¬ 
tersection  of  two  ellipses  in  the  plane  generally  de¬ 
fines  up  to  four  real  points  from  When  the 
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Figure  2:  A  portion  of  the  lUE  class  hierarchy  which  introduces  the  use  of  coordinate  systems  with 
spatial  objects. 


relational  predicate  is  defined  simply  by  an  enu¬ 
meration  of  the  points  in  the  relation,  there  is  no 
distinction  between  an  Implicit-nOPointSet  and 
a  set  of  nDpoints. 

It  should  be  noted  that  the  computation  of  im¬ 
plicit  point  sets  is  perhaps  the  most  difficult  area 
of  mathematics.  On  the  other  hand,  there  is 
widespread  use  of  simple  implicit  forms  in  lU  al¬ 
gorithms,  such  as  the  implicit  form  for  a  line  and 
circle,  so  we  will  proceed  to  develop  the  hierarchy 
to  provide  the  necessary  general  representational 
structure. 

4  Implicit  Curve 


the  surface  of  a  sphere  requires  at  least  two  maps. 

The  basic  set  of  curve  classes  is  shown  in  Fig¬ 
ure  3.  Most  lU  applications  will  involve  either  the 
implicit  form  of  the  line  or  an  implicit  conic.  Most 
curves  of  interest  are  implicit  polynomial  equation 
with  the  exception  of  the  Superquadric  which  is 
of  the  form 


-1  =  0 


The  superquadric  representation  is  appealing,  be¬ 
cause  a  wide  variety  of  shapes  can  be  generated 
with  relatively  few  parameters. 


The  theory  of  implicit  curves  and  surfaces  is  not 
well  developed.  For  example,  it  is  not  easy  to  tell 
the  dimension  of  the  solution,  or  variety,  of  a  sys¬ 
tem  of  polynomials.  It  is  not  unusual  to  get  a  mix¬ 
ture  of  solution  sets  with  different  dimension.  For 
the  implicit  case,  the  distinction  between  curve  and 
surface  becomes  fairly  abstract.  A  curve  can  be 
continuously  mapped  onto  a  line  or  circle.  A  sur¬ 
face  can  be  locally  mapped  to  a  disk  and  completely 
represented  by  a  finite  number  of  such  maps.  For 
example,  an  invertible  mapping  between  disks  and 


4.1  Embedding 

As  considered  so  far,  these  implicit  point  sets  can 
be  defined  on  any  set  of  However,  it  is  often 
useful  to  consider  that  a  curve  or  surface  is  defined 
on  its  natural  dimension,  i.e.  32  for  curves  and 
for  surfaces.  The  embedding  of  the  entity  in  high¬ 
er  dimensional  spaces  is  carried  out  by  defining  a 
mapping  from  the  natural  dimension  onto  the  tar¬ 
get  space.  In  general,  we  can  define  uy  mapping 
from  one  space  to  another  as  long  as  the  natural 
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Figure  3:  The  implicit  curve  class  hierarchy. 


dimension  and  neighborhood  properties  of  the  im¬ 
plicit  spatial  object  are  preserved.  These  embed¬ 
ding  mappings  can  be  used  to  define  a  broad  class 
of  parametrised  curves  and  surfaces.  For  example 
a  conic  can  be  defined  as  a  parametrized  curve  in 
the  projective  plane  as  [xi,  xj,  xa]  =  1]  which 

is  the  equation  of  a  parabola.  Then  any  conic  can 
be  generated  by  transforming  this  projective  space 
with  a  homogeneous  3x3  matrix  transformation. 
The  8  parameters  of  the  matrix  define  a  space  of 
conic  curves^. 

5  Parametric  Spatial  Objects 

A  class  of  spatial  objects  can  be  defined  in  terms  of 
the  function.  A  parametric-entity  in  the  lUE  has 
the  same  interface  as  the  type  Relation.  In  this 
way,  all  of  the  standard  class  methods  which  apply 
to  Relation,  such  as  the  predicates.  In  and  Inter¬ 
sect,  carry  over  to  the  parametric  object  classes.  A 
function  is  a  particular  type  of  relation  which  de¬ 
fines  a  mapping  between  two  sets  of  n-tuples,  the 
Domain  and  the  Range.  The  mapping  associat- 

*  Actually,  the  space  of  conic*  i*  only  five  dimenaional  ao 
thi*  projective  mapping  i*  not  uniquely  invertible. 


ed  with  a  function  must  satisfy  the  property  that 
a  given  element  from  the  domun  maps  to  exact¬ 
ly  one  element  of  the  range.  A  parametric  entity 
uses  a  function,  and  the  function  is  either  a  com¬ 
ponent  of  the  parametric  class  or  is  intrinsic  to  the 
code  implementing  the  class  methods.  A  portion 
of  the  lUE  class  hierarchy  illustrating  the  defini¬ 
tion  of  the  class  Function  is  shown  in  Figure  4  In 
our  use  of  the  function*,  as  the  basis  for  paramet¬ 
ric  spatial  objects  we  further  restrict  the  mapping 
to  be  order  preserving  and  one  to  one.  With  these 
properties  we  can  always  find  a  unique  point  in  the 
domain  for  a  given  point  in  the  range  and  the  nat¬ 
ural  dimension  and  neighborhood  properties  are  p- 
reserved.  This  is  a  much  stronger  condition  than 
is  usually  associated  with  the  idea  of  parametric 
curves  or  surfaces.  The  curves  here  are  perhap- 
s  more  properly  called  "well-parametrised,”  where 
there  is  a  unique  inverse  for  each  point  in  the  range 
of  the  curve. 

As  an  example,  a  parametrised  line,  £,  in  the 


*Initially  the  lUB  heirarchy  wa*  deaigned  so  that  a  para¬ 
metric  entity  had  the  lame  da**  interface  a*  a  fimction. 
Subsequently,  it  waa  realised  that  a  more  uniform  class  in¬ 
terface  is  achieved  by  making  parametric  entities  be  a  type 
of  Relation 
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Figure  4:  The  class  hierarchy  for  parametric  curves  and  surfaces.  These  classes  are  based  on  functional 
mappings. 


plane  defines  a  mapping  from  to  as  follows, 

X  =  Oat  +  bg 
y  =  a^t  +  by 

where  t  €  K  is  the  domiun  and  {x,  y)  €  is  the 
range.  In  general,  parametrised  curves  are  func¬ 
tions  with  a  domain  and  represent  mappings  on¬ 
to  some  nD  point  space.  In  this  example  of  the  line, 
the  entire  set  of  is  used  as  the  domain  of  £.  The 
range  of  the  line  in  can  be  further  restricted  by 
introducing  a  set  of  inequalities  on  the  domsun,  e.g, 
0<t<  1. 

5.1  The  method  In(p) 

The  predicate  ln(p)  is  decided  by  finding  the  value 
of  the  parameter,  t,  in  the  domain  of  £  and  checking 
the  inequalities.  In  the  case  of  our  example, 

^  +  y)  -  (^«  + 

2  Ox +  ay 

So  ultimately,  the  decision  on  the  predicate  ln(p) 
rests  on  an  implicit  point  set  which  is  defined  by  a 
set  of  equalities  and  inequalities  on  32”.  Most  often, 
these  domains  are  simple  balls  or  boxes  in  92",  as 
in  our  simple  example  with  the  line  segment. 

For  curves  defined  by  higher  order  polynomials, 
it  is  possible  that  a  curve  can  cross  itself,  as  in 


the  case  of  a  parametric  cubic  curve.  In  order  to 
make  the  parametrisation  uniquely  invertible,  we 
choose  to  introduce  a  vertex  at  the  double  point  so 
that  there  is  an  invertible  parametrisation  between 
endpoints. 

This  definition  of  ln(p)  assumed  that  the  point 
was  on  the  line.  This  backward  reference  to  the 
domcdn  of  the  parametric  line  is  not  defined,  if  a 
point  is  not  on  the  line.  In  general,  the  implicit 
form  of  a  curve  or  surface  is  necessary  to  determine 
if  a  point  is  in  the  set  when  the  set  is  embedded  in 
a  space  with  higher  than  the  natural  dimension.  A 
natural  approach  is  to  use  the  implicit  form  of  the 
curve  as  the  domain  of  the  parametric  form.  Then 
the  method  ln(p)  for  the  range  of  the  parametric 
form  can  be  queried  to  determine  if  the  point  is  on 
the  curve  before  checking  the  bounds  imposed  by 
the  domun  inequalities.  With  our  design  for  the 
parametric  form  of  a  spatial  object,  the  Domain 
of  an  implicit  form  for  a  curve  can  be  used  as  the 
Range  of  the  parametric  form  since  a  Range  slot  or 
attribute  is  available  for  the  parametric  mapping 
function. 


5.2  Composition 

A  function  defines  a  composition  operation  where 
the  range  of  one  function  is  used  as  the  domrun 
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of  another.  In  the  usual  notation,  composition  is 
represented  as,  /(x)  =  9(/i(x))  where  /  is  the  com¬ 
position  of  g  and  h.  Here,  the  domain  of  /  is  the 
domain  of  h  and  the  range  of  /  is  the  range  of  g.  We 
expect  that  the  composition  of  functions  will  play 
a  central  role  in  the  application  of  spatial  objects 
in  the  lUE.  For  example,  consider  an  image  /(z,  y) 
which  can  be  considered  as  a  function  from  92^  onto 
92.  We  also  define  a  parametric  curve,  say  a  circle 
as  a  function  from  92  onto  92^.  The  composition  of 
the  two  functions  is,  /(circles (t),  ctrcfey(t))  &Dd  is 
a  mapping  from  92  onto  92  and  represents  the  image 
intensities  at  the  points  of  the  circle  which  can  be 
plotted  as  a  one  dimensional  graph. 

For  the  composition  of  a  number  of  functions,  the 
predicate  ln(p)  is  computed  by  a  chain  of  function 
inversions,  until  the  decision  is  made  on  the  point 
set  representing  the  domain  of  the  first  function. 
Suppose  we  have  a  curve  segment  in  the  plane  de¬ 
fined  by,  C(f(g(h(t)))  where  0  <  t  <  1.  To  decide 
ln(p),  we  compute, 

i  =  h-\g-\rHC-\p)))) 

Then  it  is  easy  to  check  if  0  <  t  <  1.  Again,  the 
ultimate  decision  of  parametric  point  set  member¬ 
ship  rests  on  deciding  the  membership  of  an  implic¬ 
it  point  set,  in  this  case,  the  domain  of  the  function 
h. 

5.3  Parametric  curves 

In  the  hierarchy  of  Figure  4  a  number  of  types 
of  parametric  curves  are  illustrated.  The  classes 
are  organised  around  the  type  of  functions  used 
to  provide  the  mapping  from  the  domun  to  range 
of  the  curve.  The  most  common  classes  are  de¬ 
fined  in  terms  of  polynomials  of  various  degrees, 
for  example  a  parametric  cubic  curve  is  shown  in 
the  figure.  Another  common  class  of  projectively 
defined  curves  is  constructed  from  rational  polyno¬ 
mials.  For  example  a  parametric  circle  can  be  de¬ 
fined  as  the  ratio  of  two  second-order  polynomials. 
Similarly,  curves  can  be  defined  by  trigonometric 
functions  as  in  the  case  of  a  circle  parametrized  by 
polar  coordinates. 

We  introduce  the  class  OrderedPointSet, 
which  is  a  sequence  of  points,  to  represent  sampled 
curves  or  discrete  pixel  chains.  Since  the  point  set  is 
ordered,  it  has  all  of  the  semantics  of  a  parametric 
curve.  That  is,  we  can  define  a  parameter  along  the 
curve  which  maps  to  points  in  the  discrete  point  set. 


Often,  the  point  set  is  viewed  as  a  set  of  samples 
from  some  continuous  space  and  the  class  definition 
must  provide  the  necessary  structure  to  represent 
the  neighborhood  properties  of  the  samples.  For 
example,  in  the  case  of  pixel  chains,  each  point  is 
associated  with  the  square  pixel  neighborhood  in 
the  image.  The  computation  of  intersection  or  in¬ 
cidence  is  carried  out  by  taking  into  account  the 
intersection  or  incidence  of  pixel  neighborhoods. 

It  is  also  common  to  define  one-dimensional 
neighborhoods  about  each  point  by  connecting  each 
point  with  a  straight  line  segment  which  is  the  min¬ 
imal  assumption  about  the  continuity  of  the  curve. 
Smoother  assumptions  can  be  introduced,  such  as 
higher  order  derivative  continuity,  e.g..  Cl  and  C2, 
as  well  as  global  assumptions  about  the  curve.  For 
example,  the  samples  are  taken  from  a  bandlimit- 
ed  curve  and  the  original  curve  can  be  recovered 
exactly  by  Sine{x)  interpolation. 

5.4  Spline  Curves 

A  spline  curve  is  a  sequence  of  polynomial  segments 
with  continuity  conditions  at  each  segment  break¬ 
point,  or  knot.  The  polynomial  segments  are  de¬ 
limited  by  the  breakpoints  along  the  curve.  These 
breakpoints  are  represented  by  the  class  Ordered¬ 
PointSet  Each  polynomial  segment  is  represented 
as  a  linear  combination  of  basis  functions.  Depend¬ 
ing  on  what  basis  functions  are  used  and  the  order 
of  continuity,  we  can  categorise  splines  as  being 
linear,  quadratic,  cubic,  Bspline  etc.  A  general- 
parametric-curve  is  one  where  the  curve  is  speci¬ 
fied  by  a  linear  combination  of  user  specified  basis 
functions.  The  different  subclasses  for  parametric- 
splines  include:  2d-linear-spline,  3d-linear-spline, 
2d-quadratic-spline,  3d-quadratic-spline,  2d-cubic- 
spline,  3d-cubic-spline,  and  Bspline. 

5.5  Parametric  surfaces 
The  covering  problem 

Parametric  surfaces  are  generated  by  functions 
with  two  arguments,  i.e.  functions  with  domain  92^. 
The  range  can  be  any  tuple  space  but  usually  we 
think  of  continuous  surfaces  in  92".  Again  a  wide 
range  of  surfaces  can  be  represented  by  polynomial 
functions.  However  these  mappings  do  not  in  gen¬ 
eral  cover  an  entire  surface.  For  example,  a  para¬ 
metric  mapping  for  the  surface  of  a  sphere  in  terms 
of  spherical  coordinates  results  in  a  non-invertible 
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Figure  5:  A  parametrized  surface  patch  generat¬ 
ed  by  sweeping  one  curve  segment  along  another. 
One  approach  to  ribbon  generation  is  to  define  a 
peirametrized  transformation. 

mapping.  However,  two  parametric  surface  patches 
with  invertible  mappings  may  be  defined. 

It  is  often  necessary  to  use  many  parametric 
patches  to  make  up  a  surface,  since  the  particu¬ 
lar  representation  is  not  able  to  accurately  “fit”  a 
desired  shape,  even  though  the  topology  provided 
by  two  or  more  patches  is  sufficient.  For  example, 
many  industrial  components  with  complex  shapes 
are  represented  as  composite  surfaces  made  up  of  a 
network  of  BiCubicPatch  instances.  The  patches 
are  defined  on  a  set  of  knots  in  a  analogous  man¬ 
ner  to  the  spline  curve  case.  The  class  Ordered- 
PointSetintroduced  earlier  is  implicitly  ordered  as 
a  sequence,  since  this  ordering  supports  the  bulk  of 
applications  for  ordered  point  sets.  Therefore  it  is 
necessary  to  add  a  new  class,  2DArrayOrdered- 
PointSet  to  maintain  the  knots  for  splined  surface 
representations.  A  number  of  examples  of  spline 
surfaces  are  indicated  in  the  hierarchy  of  Figure  4. 

The  Ribbon 

The  Ribbon  is  an  unusual  case  since  it  is  de¬ 
fined  in  terms  of  one  curve  segment  which  is  swep- 
t  along  another  curve  segment  by  a  parametrized 
transformation.  An  example  is  shown  in  Figure  5. 
The  ribbon  can  be  generated  by  introducing  a  co¬ 
ordinate  system  for  both  curve  segments.  Then 
the  generator  curve  can  be  “swept”  along  the  axis 


curve  by  varying  parameters  of  the  transformation. 
The  target  coordinate  system  is  a  local  coordinate 
frame  along  the  axis  curve.  Usually  the  generator 
curve  is  maintained  perpendicular  to  the  axis  curve 
at  the  intersection  point.  A  free  parameter  of  the 
transformation  corresponds  to  the  position  along 
the  axis  curve.  Taken  together,  these  constraints 
define  a  parametrized  transformation  matrix  which 
maps  the  generator  curve  at  each  point  along  the 
axis.  The  final  result  is  a  surface  patch  with  two 
parameters,  /i,  position  along  the  generator,  and 
V,  position  along  the  axis.  As  usual,  the  bounds 
on  these  parameters  define  the  extent  of  the  ribbon 
surface  patch. 

Discrete  surfaces 

In  a  similar  manner  to  the  discrete  curve,  a  para¬ 
metric  surface  can  be  defined  by  a  bit  map  array 
where  the  points  on  the  surface  are  generated  by 
the  two  array  indices.  The  boundary  of  the  the  sur¬ 
face  can  be  defined  by  inequalities  on  the  array  in¬ 
dex  values.  Alternatively,  the  elements  of  the  array 
domain  which  have  no  corresponding  surface  point 
in  the  range  can  map  to  a  range  element  which  is 
reserved  to  indicate  undefined. 

As  in  the  case  of  discrete  curves,  we  often  wish 
to  consider  this  discrete  representation  as  associ¬ 
ated  with  a  continuous  space.  To  achieve  a  con¬ 
tinuous  representation  it  is  necessary  to  maintain 
a  neighborhood  description  for  the  sample  points. 
A  neighborhood  definition  is  required  to  compute 
intersection  or  incidence  methods.  For  example, 
for  a  discrete  3D  surface,  all  points  might  have  an 
implicit  neighborhood  defined  by  a  sphere  of  given 
diameter.  Then  the  method  ln(p)  can  be  comput¬ 
ed  checking  if  p  lies  inside  or  on  any  of  the  spheres 
surrounding  the  sample  points.  Similarly  to  the 
discrete  curve  case,  we  can  establish  minimal  two 
dimensional  neighborhoods  about  each  point  by  tri¬ 
angulating  the  mesh  of  points.  Again,  higher  order 
continuous  neighborhoods  can  be  defined  in  analo¬ 
gy  to  the  curve  case. 

6  Primitive  Spatial  Objects 
and  Neighborhoods 

In  any  discussion  of  continuous  curves  and  surfaces 
as  well  as  the  sampling  of  these  entities,  it  is  neces¬ 
sary  to  introduce  the  concept  of  a  neighborhood.  In 
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the  current  representation  a  neighborhood  is  sim¬ 
ply  a  primitive  spatial  object  such  as  a  line  inter¬ 
val,  a  box,  a  disk  or  sphere.  Since  these  structures 
are  used  so  pervasively  throughout  the  system,  it 
is  intended  that  these  primitive  structures  can  be 
included  in  any  class  with  minimal  cost  in  space 
and  computation. 

For  example,  in  computing  connected  compo¬ 
nents  on  an  image  segmentation  it  is  necessary  to 
maintain  a  description  of  the  square  pixel  neigh¬ 
borhood  about  each  point.  Since  applications  will 
involve  millions  of  pixels  for  each  image,  it  is  essen¬ 
tial  that  the  representation  of  such  neighborhoods 
be  efficiently  processed.  On  the  other  hand,  it  is 
important  that  the  resulting  topological  structures 
are  consistent  and  compatible  throughout  the  sys¬ 
tem.  It  should  be  the  case  that  a  region  boundary, 
extracted  from  an  image,  will  have  the  same  topo¬ 
logical  structure  as  a  solid  model  and  that  image 
curves  can  be  simply  extruded  to  form  solid  sur¬ 
faces. 

In  general  it  is  desirable  to  have  any  spatial  ob¬ 
ject  act  as  a  neighborhood,  since  in  some  applica¬ 
tions  the  geometry  of  a  neighborhood  can  be  quite 
complex.  For  example  the  projected  neighborhood 
of  two  conjugate  pixels  in  a  stereo  image  pair  is  a 
3D  trapesoidal  prism  in  space.  That  is,  any  point 
within  this  volume  will  project  to  the  correspond¬ 
ing  pixel  neighborhoods  in  the  image.  As  anoth¬ 
er  example,  very  complex  composite  neighborhood 
regions  can  be  developed  when  the  primitive  neigh¬ 
borhoods  are  circular  or  spherical. 

6.1  Topological  Space 

The  fundamental  concept  of  a  Topological  Space  can 
be  defined  in  terms  of  neighborhoods  as  follows: 

A  Topological  Space  is  a  set  of  points  S  a- 
long  with  the  choice  of  a  class  of  subsets  M  of 
5,  each  of  which  is  called  a  neighborhood  of 
its  points,  such  that, 

a)  Every  point  of  S  is  in  some  neighborhood. 

b)  The  intersection  of  any  two  neighbor¬ 
hoods  of  a  point  contains  a  neighborhood 
of  that  point. 

The  usual  definition  of  topological  spaces  use  open 
intervals,  disks  or  spheres  as  neighborhoods  as 
shown  in  Figure  6.  It  is  easy  to  see  that  these 
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Figure  6:  A  topological  space  is  defined  in  terms  of 
neighborhoods.  This  figure  illustrates  a  number  of 
primitive  neighborhoods  for  various  dimensions. 

neighborhoods  satisfy  the  properties  of  a  topolog¬ 
ical  space.  In  the  core  lUE,  we  will  implemen- 
t  neighborhood  methods  for  line  segments,  boxes 
and  rectangular  prisms.  These  neighborhood  ge¬ 
ometries  support  efficient  computation  of  boolean 
operations  and  predicates  such  as  ln(p).  The  con¬ 
cept  near  is  fundamental  to  define  most  of  usual 
topological  notions. 

Let  be  a  subset  of  S  and  p  a  point  of  S. 

Then  p  is  near  A  if  every  neighborhood  of  p 

contains  a  point  of  A. 

The  definitions.  Closed,  Connected,  Boundary, 
can  then  be  expressed  in  terms  of  near.  For  exam¬ 
ple,  we  can  define  A  as  Closed  if  A  contains  all  of 
its  near  points.  For  more  detail  on  these  concepts 
and  further  definitions  see  Henle[4]. 

6.2  The  Neighborhood  Classes 

The  lUE  designates  a  small  number  of  basic  neigh¬ 
borhood  classes  which  are  designed  to  be  efficien- 
t.  This  portion  of  the  hierarchy  is  shown  in  Fig¬ 
ure  7.  Neighborhoods  are  defined  for  various  do- 
main  dimensions.  Many  of  these  neighborhoods 
are  defined  implicitly.  A  0-Dimensional  neighbor¬ 
hood  corresponds  to  a  point.  The  most  pervasive  1- 
Dimensional  neighborhood  is  the  line  interval,  spec¬ 
ified  in  terms  of  inequalities  as 

Also  provided  is  an  oriented  line  segment  to  pro¬ 
vide  a  local  curve  neighborhood  as  in  the  case  of 
the  Edgel.  The  neighborhood  is  specified  as  an 
oriented  line  segment  by  specifying  a  parametrised 
line  and  bound  inequalities  on  the  line. 

The  2D  analogy  of  the  line  interval  is  the  rect¬ 
angle  which  is  aligned  with  the  coordinate  axes  to 
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Figure  7:  The  lUE  neighborhood  classes.  These  classes  are  meant  to  be  efficient  for  large  volume 
applications  and  can  be  “mixed  in”  to  other  classes  such  as  Image. 


support  efficient  computation.  A  special  neighbor¬ 
hood  is  defined  for  image  pixel  locations  to  sup¬ 
port  direct  integer  indexing  of  neighboring  pixels. 
A  more  general  2D  neighborhood  is  provided  by 
the  VoronoiNbrhd  where  the  neighborhood  re¬ 
gion  around  a  point  is  defined  as  the  set  of  points 
closest  to  the  point.  These  concepts  carry  over  di¬ 
rectly  to  three  dimensional  neighborhood  domains. 

It  is  assumed  that  the  neighborhood  is  defined 
around  each  point  in  a  spatial  object  as  a  result  of 
the  mixin.  For  a  single  point,  the  process  amounts 
to  constructing  a  “point-with-neighborhood”  class. 
Often,  the  same  neighborhood  description,  with  the 
same  parameters,  is  applied  to  each  point  in  a  point 
set.  In  the  case  of  Edgel  neighborhoods,  the  orien¬ 
tation  and  possibly  the  length  of  the  neighborhood 
varies  from  point  to  point.  Similarly,  a  Voronoi 
neighborhood  is,  in  general,  a  different  region  for 
each  point  in  the  set. 

The  neighborhood  should  in  general  be  able  to 
support  all  of  the  methods  for  a  spatial  object,  such 
as  Boolean  intersection  methods.  The  additional 
methods  which  each  neighborhood  class  must  sup¬ 
port  are  summarised  below.  For  the  purpose  of 
definition  it  is  assumed  that  the  neighborhood  is 
defined  about  a  point,  p. 

Near(p)  -  A  predicate  which  is  true  if  p  is  near 

the  neighborhood  defined  by  self. 


Neighbors  -  Returns  the  adjacent  neighbor¬ 
hoods  of  self.  A  set  of  neighborhoods  is  re¬ 
turned  consisting  of  the  neighborhoods  con¬ 
nected  to  self.  For  example,  the  Neigh¬ 
bors  of  a  pixel  neighborhood,  riij,  in  a  4- 
connected  image  array  are  the  pixel  neighbor¬ 
hoods,  >»(i+i),>)ni,(;_j),rH, (,+!)}. 

In  general,  there  are  adjacent  point  sets  of  various 
dimensionality.  For  example,  the  point  of  intersec¬ 
tion  of  a  line  with  a  plane  has  a  ID  set  of  neighbors 
on  the  line  and  a  2D  set  of  neighbors  on  the  plane. 
This  situation  is  handled  below  in  a  more  struc¬ 
tured  topological  representation  bel'ow^.  At  this 
basic  level,  it  is  assumed  that  the  topological  space 
is  everywhere  the  same  dimension. 


6.3  Why  Separate  Neighborhood 
Classes? 

In  the  discussion  of  neighborhoods,  there  is  no  real 
distinction  between  the  neighborhood  classes  and 
the  general  concept  of  a  spatial  object.  Indeed,  we 

*In  the  treatment  of  Section  8  the  ntigkiori  seti  ii  di- 
Tided  into  •  number  of  called  tuptriort  and  inftriort.  The 
fuperiors  are  one  higher  dimeniion  than  the  given  entity 
and  the  inferiors  are  one  dimension  lower.  Also  there  can  be 
more  than  one  adjacent  pointset,  e.g.  the  edges  incident  st 
a  vertex. 
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intend  that  any  spatial  object  can  act  as  a  neigh¬ 
borhood.  On  the  other  hand,  we  expect  that  the 
efficiency  demanded  of  a  neighborhood  will  require 
a  special  implementation  and  therefore  we  define 
these  separate  classes.  Conversely,  it  is  expected 
that  the  neighborhood  classes  are  quite  satisfacto¬ 
rily  used  as  spatial  objects  and  it  can  be  considered 
that  the  neighborhoods  are  specialisations  of  the 
appropriate  spatial  object  class. 

For  example,  a  rectangular  neighborhood  is  the 
set  of  points  interior  to  a  rectangle  which  is  a 
specialisation  of  four-sided  polygoiu  However,  we 
do  not  need  to  maintain  the  extra  baggage  asso¬ 
ciated  with  the  bounding  1-Cycle  and  associated 
edges  and  vertices  in  order  to  perform  neighbor¬ 
hood  methods.  Admittedly,  we  might  choose  to 
implement  the  rectangular  planar  face  as  efficiently 
as  just  assumed  for  the  rectanguUr  neighborhood 
and  then  the  two  concepts  ould  i  _  identical.  In 
any  case,  the  same  numb>  '  specialised  classes 
must  be  implemented. 

7  Hierarchical  and  Sequence 
Groups 

In  order  to  proceed  with  our  development  of  spatial 
objects,  it  is  necessary  to  introduce  several  class¬ 
es  which  define  groups  of  the  geometric  primitives 
we  have  just  discussed.  The  first  group  is  the  Se¬ 
quence  which  is  an  ordered  set  of  objects.  A  gener¬ 
al  approach  in  the  lUE  for  implementing  the  access 
to  groups  is  through  the  mechanism  of  inheritance. 
That  is,  an  object  becomes  part  of  a  group  by  in¬ 
heriting  the  mechanisms  of  a  generic  node  class  de¬ 
fined  for  the  group.  In  this  case,  an  object  becomes 
part  of  a  sequence  by  inheriting  the  class  Sequen- 
ceNode  as  shown  in  Figure  8  The  basic  methods 
for  SequenceNode  are  Next  and  Previous.  The  in¬ 
tention  is  that  the  node  classes  are  designed  to  be 
very  efficient  at  computing  the  direct  access  meth¬ 
ods  required  by  a  node.  More  global  queries  such 
as  the  total  number  of  items  in  the  sequence  should 
be  referred  to  the  Sequence  class  itself. 

The  second  type  of  group  we  need  to  define  is  the 
HierarchicalGroup.  The  concept  is  a  hierarchy 
of  superior  and  inferior  elements  which  form  a  tree. 
An  typical  example  of  a  hierarchical  group  is  the 
description  of  part-whole  relations.  For  example  if 
the  human  body  is  considered  the  root  of  the  group 
tree,  the  inferiors  are  parts  like  head,  trunk,  arms. 


legs,  etc.  The  inferiors  of  head  are  eyes,  nose,  .etc. 
The  superior  of  leg  is  body.  The  case  of  multiple 
superiors  occurs  when  a  part  is  shared.  For  exam¬ 
ple  the  wrist  may  be  considered  as  part  of  both  the 
hand  and  arm. 

As  shown  in  Figure  8  an  example  global  method 
of  HierarchicalGroup  is  Superior-p(Node,  Node), 
which  determines  if  one  node  is  indirectly  superi¬ 
or  to  another  in  the  hierarchy  by  tracing  up  the 
tree.  We  define  the  class,  HierarcalGroupNode 
which  efficiently  implements  access  to  the  list  of 
direct  superiors  and  direct  inferiors  of  a  node.  We 
will  make  immediate  use  of  HierarchicalGroup  in 
defining  the  topology  of  geometric  structures.  Also 
note  that  the  usual  notions  of  part  composition  is 
implemented  naturally  by  a  subclass  of  Hierarchi¬ 
calGroup. 

The  approach  of  inheriting  a  Hierarchical- 
GroupNode  to  acquire  the  methods  of  Inferiors  and 
Superiors  is  limited  to  one  grouping  for  each  entity. 
It  does  not  make  sense  to  inherit  more  than  one 
hierarchical  node.  This  creates  a  problem  for  sit¬ 
uations  where  an  entity  is  a  member  of  a  number 
of  hierarchical  groupings.  The  solution  proposed 
so  far  is  to  have  a  multi-group  node  class  for  those 
structures  which  require  membership  in  more  than 
one  hierarchy.  A  single  hierarchy  node  class  is  also 
available  for  efficiency  reasons. 

8  Topology  Groups 

So  far  we  have  introduced  the  concepts  involved  in 
defining  the  primitive  classes  of  points,  curves  and 
surfaces.  The  next  step  is  to  define  the  topological 
structures  which  are  needed  to  construct  compos¬ 
ite  curves  and  surfaces.  The  fundamental  notion  of 
topology  is  that  of  “connection”  or  adjacency.  A 
simple  example  is  provided  by  two  line  segments 
which  share  a  common  endpoint.  The  common 
point  of  incidence  is  viewed  as  a  connection  be¬ 
tween  the  two  line  segments.  In  order  to  develop 
a  consistent  topology,  the  connection  between  two 
line  segments  is  restricted  to  endpoints.  That  is,  if 
two  line  segments  intersect  at  all,  they  intersect  at 
an  endpoint. 

8.1  The  Vertex 

These  considerations  lead  to  the  definition  of  the 
class  Vertex.  The  vertex  is  both  a  point  in  space 
as  well  as  a  connection.  In  order  to  represent  these 
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Figure  8:  The  definition  of  sequence  and  hierarchical  groups. 


properties,  the  vertex  class  is  constructed  by  multi¬ 
ple  inheritance  from  the  class  Point  and  Topolog- 
yNode  as  shown  in  Figure  9  The  class  Topolc^- 
yNode  is  a  special  case  of  HierarchicalGroupN- 
ode  which  defines  a  set  of  Inferiors  and  Superiors. 
In  our  case,  the  vertex  has  no  inferiors  but  has  a  set 
of  superiors  which  are  instances  of  the  class  Edge. 

Each  of  these  topology  elements  is  illustrated  in 
Figure  9. 

8.2  The  OChain 

The  OChain  is,  strictly  speaking,  an  unordered  set 
of  of  vertices.  In  most  applications  it  is  useful  to 
establish  an  order  on  this  set  and  thus  we  consider 
the  OChain  to  be  a  Sequence.  The  class  Vertex 
therefore  must  inherit  from  the  class  SequenceN- 
ode.  The  concept  of  an  ordered  set  of  vertices  is 
useful  for  various  topology  constructions.  For  ex¬ 
ample,  when  constructing  a  polygon  from  the  set 
of  its  vertices,  it  is  necessary  to  maintain  a  stric- 
t  ordering  so  that  the  boundary  of  the  polygon  is 
consistently  defined.  The  ordering  is  also  necessary 
to  maintain  a  consistent  surface  normal  orientation. 

The  OChain  is  also  useful  in  the  case  of  a  curved 
edge,  such  as  a  spline,  which  may  have  a  num¬ 
ber  of  sample  points  on  its  interior  between  the 
endpoints.  These  interior  points  are  not  vertices, 
but  they  may  be  used  to  define  a  sequence  of  line 


Figure  9:  The  various  elements  of  object  topology. 
Note  that  if  a  IChain  or  2Chain  is  closed  it  forms  a 
ICycle  or  2Cycle.  It  is  not  necessary  to  introduce  a 
new  class  but  just  associate  the  attribute  “closed” 
with  the  chain  structures.  The  structures  arc  or¬ 
dered  top-to-bottom,  left-to-right  in  the  order  of 
the  inferior  — *  hierarchy. 
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8egment8(poIygonal  line)  for  display  or  intersection 
calculations.  For  these  applications,  this  set  of  in¬ 
terior  and  boundary  points  are  conveniently  repre¬ 
sented  by  a  OChain. 

8.3  The  Edge 

The  class  Edge  is  a  bounded  segment  of  a  curve, 
where  the  boundary  of  the  segment  is  defined  by 
two  vertices.  These  bounding  vertices  are  estab¬ 
lished  as  the  inferiors  of  the  edge.  The  superiors 
of  and  edge  are  a  set  of  IChmns,  where  a  IChain 
is  a  sequence  of  edges.  The  edge  is  constructed 
by  multiple  inheritance  from  a  curve  class  as  well 
as  the  classes  SequenceNode  and  TopologyN- 
ode  in  order  to  support  the  necessary  relationship* 
s.  Note  that  an  edge  can  participate  in  a  number 
of  IChains,  as  in  the  case  of  two  polygons  which 
are  joined  at  a  common  edge. 

The  generic  predicate  ln(p)  for  the  edge  can  be 
computed  by  parametrising  the  curve  in  association 
with  the  bounding  vertices.  A  typical  choice  is  to 
define  the  parameter  at  the  first  vertex  as  t  =  0  and 
t  =  1  at  the  second.  The  predicate  On(p)  is  true 
when  the  point  p  is  either  of  the  endpoints  since 
these  points  define  the  boundary  of  the  the  edge. 

When  representing  higher  order  curves  it  is  pos¬ 
sible  for  an  edge  to  intersect  itself  somewhere  other 
than  a  vertex.  This  event  violates  the  requirements 
for  a  consistent  topology  and  a  vertex  must  be  in¬ 
troduced  at  the  point  of  intersection.  The  numer¬ 
ical  computation  of  such  events  is  difficult  and  er¬ 
rors  can  result  in  an  inconsistent  topology.  Issues  of 
this  sort  have  impeded  progress  in  CSG  modeling 
for  curved  surfaces. 

8.4  The  IChain 

The  class  IChain  has  many  applications  within 
the  lUE.  It  is  the  basic  structure  for  defining  com¬ 
posite  curves,  e.g.  a  polygonal  chain  of  edges.  The 
IChain  is  a  sequence  of  edges  with  additional  di¬ 
rection  information  associated  with  each  edge.  The 
direction  values,  ±  define  the  sense  of  traversal  of 
an  edge  along  the  curve.  If  the  traversal  is  from 
Vertexo  to  Vertexi  of  the  edge  then  the  direction  is 
+,  otherwise  — .  The  IChain  is  also  used  to  repre¬ 
sent  closed  boundaries  of  surface  regions  or  faces. 
A  closed  IChain,  or  ICycle,  has  one  superior,  i.e., 
the  face  which  it  bounds.  The  inferiors  of  a  IChain 
are  the  edges  in  the  chain. 


Figure  10:  A  bounded  surface  or  face.  An  example 
is  shown  of  a  multiply  connected  face  with  interior 
holes. 

Many  of  the  topological  properties  required  to 
define  image  features  are  provided  by  vertices, 
edges  and  IChains.  The  usual  process  of  joining 
and  recursively  growing  sequences  of  line  segments 
can  be  handled  by  manipulating  the  IChain  struc¬ 
ture.  Also,  boundary  corner  events  such  as  “T”  and 
“Y”junctions  are  defined  where  image  edge  chains 
meet  at  a  vertex. 

Note  that  it  may  be  necessary  to  maintain  a  num¬ 
ber  of  topological  descriptions  for  a  pixel  chain. 
The  first  level  is  the  original  sequence  of  pixel  lo¬ 
cations.  Next,  we  may  fit  straight  lines  or  splines 
to  the  pointset  using  a  number  of  levels  of  fitting 
tolerance.  Each  fit,  will  produce  a  different  IChain 
which  should  be  associated  with  the  other  IChains 
as  well  as  the  original  discrete  set.  This  composite 
structure  is  analogous  to  a  pyramid,  since  a  larger 
error  tolerance  leads  to  fewer  curve  segments.  As 
yet,  we  don’t  have  an  adequate  understanding  of 
the  semantics  of  this  type  of  grouping. 

8.5  The  Face 

The  class  Face  is  a  bounded  surface  region.  The 
boundary  may  be  multiply  connected  as  shown  in 
Figure  10.  The  face  multiply  inherits  from  classes 
SequenceNode  and  Topology  Node.  The  face 
inferiors  are  the  IChains  defining  the  face  bound¬ 
ary.  This  definition  of  the  face  allows  multiply  con¬ 
nected  regions  with  one  outside  IChun  and  any 
number  of  interior  IChains  as  shown  in  Figure  10. 
The  superior  of  the  face  is  the  2Chain  which  is  de¬ 
fined  in  the  next  section.  One  important  applica¬ 
tion  of  the  face  structure  is  to  define  the  boundaries 
of  a  region  segmentation.  In  this  case  the  surface  is 
a  discrete  point  set  as  well  as  the  edge  curves  form¬ 
ing  the  boundaries.  In  the  current  definition,  there 
is  only  one  level  of  interior  IChains.  However,  the 
case  of  “islands”  within  the  holes  of  a  face  is  some¬ 
times  encountered  in  image  region  segmentation. 
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This  itructure  can  be  accommodated  by  arranging 
the  IChune  in  a  tree  itructure.  Each  level  of  the 
tree  accounts  for  another  layer  of  “inside"  contain-  L  |  J 
ment.  However,  it  should  be  noted  that  this  layered 
structure  is  not  consistent  with  the  usual  notions  of  v«taa4^(vs) 
connectivity.  In  the  case  of  image  segmentations,  - — 
the  concept  of  “figure*  and  “ground"  regions  arise,  j "I 
where  the  ground  region  is  taken  to  be  a  continu- 
ous  surface.  The  figure  regioiu  occlude  the  ground 
region,  but  the  ground  is  taken  to  be  one  connect- 
ed  surface  even  though  only  isolated  islands  mayi 
be  visible.  With  the  contunment  tree  structure  I 
for  the  inner  face  boundaries,  it  is  straightforward 
to  produce  the  set  of  equivalent  standard  topology 
faces.  Ttwoblielo^loew*'! 


eal  (dliOMdM  tor  teon,  adgM  and  mticM. 
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8.6  The  2Chain 

In  general,  the  face  is  used  as  the  primitive  element 
in  any  composite  surface.  As  we  pointed  out  earlier, 
it  is  always  necessary  to  form  composite  arrays  of 
surface  patches  in  order  to  cover  a  closed  boundary 
of  space.  The  essential  structure  needed  to  define 
a  composite  surface  is  the  2Chain.  The  2Chain  is  a 
sequence  of  faces  which  where  adjacent  faces  in  the 
sequence  meet  at  a  common  edge*.  If  the  2Chain 
is  closed,  it  is  called  a  2Cycle  and  defines  a  volume 
of  space. 

8.7  The  Block 

If  a  volume  of  space  has  interior  cavities,  then  a 
number  of  2Chains  are  used  to  represent  the  re¬ 
gion  boundary  in  analogy  to  the  multiply  connected 
face  region.  The  class  Block  is  introduced  to  con¬ 
tain  the  required  2Chain  pointers  as  inferiors.  The 
block  is  the  highest  topological  structure  and  has  no 
superiors^.  Composite  objects,  such  as  articulated 
structures  or  different  buildings  in  a  site  model,  are 
usually  constructed  from  separate  blocks  with  each 
block  embedded  in  a  transform  network.  That  is, 
each  block  has  is  own  local  coordinate  system  and  a 
set  of  transforms  to  the  other  blocks  or  to  a  central 
coordinate  system. 


*Note  that  a  linyle  teiiuence  cannot  repreient  all  a<lja- 
cencies  on  a  surface.  However  if  two  facet  are  next  to  each 
other  in  the  sequence,  they  share  a  conunon  edge. 

^Note  that  we  could  naturally  extend  the  representation 
to  include  4  dimensional  structures  by  joining  blocks  at  faces 
to  form  SChains. 


Figure  11:  The  nine  possible  topological  .Jjacency 
schemes.  The  combinations  result  from  a  primary 
structure  selected  from  {  Face,  Edge,  Vertex  }  and 
an  adjacent  structure  selected  from  the  same  set. 

9  Relation  to  Other  Topolog¬ 
ical  Schemes 

The  lUE  topology  can  be  compared  with  other 
topological  structures  using  the  concepts  derived 
by  Kevin  Weiler[5].  Weiler  classifies  various  topo¬ 
logical  schemes  in  terms  of  the  types  of  adjacency 
relations  supported  by  the  structure.  There  are 
three  basic  structures,  vertex,  edge  and  face.  Con¬ 
sequently,  there  are  nine  combinations  of  primary 
and  adjacent  structural  schemes.  These  combina¬ 
tions  are  illustrated  in  Figure  11.  In  the  case  of 
curved  surfaces,  it  is  impossible  to  recover  the  cor¬ 
rect  topological  description  of  an  object  from  the 
indicated  adjacency  information  and  it  is  usually 
necessary  to  provide  ancillary  structures  to  com¬ 
plete  the  sufficiency  of  the  topological  structure. 
The  lUE  itructure  provides  a  number  of  these  ad¬ 
jacency  relationships  at  the  same  time  and  provides 
a  complete  topological  specification. 

The  lUE  structure  is  also  reasonably  efficient  in 
terms  of  storage.  For  example,  the  lUE  structure 
takes  (4-|-(7/2)m)lV;  pointers  where  m  is  the  num¬ 
ber  of  edges  per  face  and  Nf  is  the  total  number 
of  faces*.  By  contrast,  a  winged  edge  structure* 

*A  umilar  deiign  to  the  lUB  stnicture[3]  repreaents  the 
two  endpoints  of  an  edge  as  a  OChain,  but  this  approach 
considerably  increases  the  pointer  count. 

*The  winged-edge  structure  has  an  edge  as  the  primary 
structure  and  edges  as  the  adjacent  structures. 
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Figure  12:  A  partial  class  hierarchy  for  construc¬ 
tive  solid  geometry,  CSG.  Only  a  small  sampling  of 
possible  solid  primitives  are  indicated. 


uses  4mNf  pointers.  As  the  complexity  of  the  face 
increases,  as  would  occur  in  image  regions,  the  lUE 
structure  takes  less  space  than  winged-edge.  The 
cross  over  is  at  about  10  edges  per  face.  The  lUE 
topology  is  similar  to  the  usual  notions  of  vertex- 
edge-loop-face-shell-block[l,  2].  However,  the  use 
of  nChains  introduces  the  machinery  of  chain  al¬ 
gebra  which  considerably  clarifies  the  removal  of 
“bridges”  in  topological  configurations.  The  rigor 
introduced  by  the  axioms  of  chain  algebra  assists 
the  construction  of  algorithms  for  boolean  opera¬ 
tions. 


10  Constructive  Solid  Geom¬ 
etry 

The  other  major  solid  representation  is  in  terms 
of  solid  primitives  which  can  be  considered  blocks 
from  the  boundary  point  of  view.  These  primitives 
are  combined  by  attachment  and  boolean  intersec¬ 
tion  operations  to  form  composite  objects.  The 
resulting  composite  object  is  described  by  a  tree 
where  the  nodes  of  the  tree  represent  various  prim¬ 
itives  and  partial  constructions  and  the  arcs  of  the 
tree  represent  boolean  or  attachment  operations. 
This  description  is  referred  to  as  the  Constructive 
Solid  Geometry,  or  CSG,  representation.  A  par¬ 
tial  volume  representation  hierarchy  is  shown  in 
figure  12.  For  example,  the  superquadric  is  a  flex¬ 


ible  surface  primitive  whose  shape  is  controlled  by 
providing  variable  exponent  values  on  the  quadric 
terms. 

The  generalized  cylinder  is  another  major  repre¬ 
sentational  approach  where  the  primitives  are  de¬ 
fined  by  a  axis  which  can  be  a  general  space  curve 
and  a  sweeping  rule  which  defines  the  variation  of 
the  object  cross  section  along  the  axis.  For  exam¬ 
ple,  a  cone  is  a  generalized  cylinder  with  a  circular 
cross  section  and  a  linear  sweeping  rule  along  a  s- 
traight  line  axis.  The  complete  description  of  the 
object  is  maintained  in  a  tree  structure,  called  the 
CSG  tree,  where  the  leaf  nodes  are  primitives  and 
the  root  is  the  final  object.  Boolean  operations 
are  carried  out  between  primitives  at  each  interi¬ 
or  node  of  the  tree.  This  representation  requires 
that  Boolean  operations  are  defined  for  each  prim¬ 
itive  which  allows  quite  complex  objects  to  be  de¬ 
fined  by  a  short  description.  For  example  a  cylinder 
with  a  hole  can  be  defined  as  the  subtraction  of  one 
cylinder  from  another.  Usually  each  primitive  is  as¬ 
sociated  with  a  bounding  box  (rectangular  prism) 
which  facilitates  efficient  checking  for  the  possibili¬ 
ty  of  intersection.  The  Boolean  operations  required 
for  a  CSG  definition  are.  Union, lntersection,and  D- 
ifference.  Union  is  used  to  join  two  spatial  objects 
into  a  single  spatial  object  with  no  boundary  at 
the  join.  Intersect(SO)  is  used  to  find  the  common 
point  sets  between  two  objects.  For  example  the 
intersection  of  a  cube  an  a  cylinder  is  a  (possibly) 
shorter  cylinder.  Difference(SO)  -  Forms  the  inter¬ 
section  of  an  object  with  SO,  the  complement  of 
the  second  object.  In  this  case  the  difference  of  a 
cube  and  a  cylinder  is  a  cube  with  a  hole. 
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1  Introduction 

Computational  Sensors  combine  computation  and 
signal  acquisition  to  improve  performance  and  pro¬ 
vide  new  capabilities  that  were  not  previously  pos¬ 
sible. 

They  may  attach  analog  or  digital  VLSI  process¬ 
ing  circuits  to  each  sensing  element,  exploit  unique 
optical  design  or  geometrical  arrangement  of  ele¬ 
ments,  or  use  the  physics  of  the  underlying  material 
for  computation.  Typically,  a  computational  sensor 
implements  a  distributed  computing  model  of  the 
sensory  data,  including  the  case  where  the  data  are 
sensed  or  preprocessed  elsewhere. 

Recognizing  the  importance  and  potential  of 
computational  sensors,  Oscar  Firschein,  DARPA 
SISTO,  requested  us  to  organize  a  workshop  to  bring 
together  developers  and  users  of  computational  sen¬ 
sors.  The  workshop  was  to  define  the  state  of  the 
art,  discuss  the  issues,  and  identify  promising  ap¬ 
proaches  and  applications  for  this  new  technology. 
The  workshop  was  held  at  The  University  of  Penn¬ 
sylvania  on  May  11-12,  1992.  Approximately  40 
people  attended  from  academia,  goverrunent,  and 
industry.  The  workshop  hosted  several  key  presen¬ 
tations  and  followed  them  with  group  discussion  and 
summary  sessions.  This  workshop  report  presents 
a  summary  of  the  state  of  the  art  in  computational 
sensors  and  recommendations  for  future  research 
programs. 

In  Section  2  we  discuss  opportunities  for  compu¬ 
tational  sensors.  Some  computational  sensor  exam¬ 
ples  are  reviewed  in  Section  3.  Technologies,  issues, 
and  limitations  are  considered  in  Section  4.  Section 
S  discusses  algorithms  for  computational  sensors. 
Recommendations  for  future  programs  are  given  in 
the  concluding  section.  The  appendix  includes  a 


bibliography  of  computatiorud  sensing  created  with 
input  from  the  woricshop  participants. 

2  Opportunities 

Traditionally,  sensory  information  processing  pro¬ 
ceeds  in  three  steps:  transducing  (detection),  read¬ 
out  (or  digitization),  and  processing  (interpretation). 
Micro-electronics  technologies  will  spawn  a  new 
generation  of  sensors  which  combine  transducing 
and  processing  on  a  single  chip  -  a  computational 
sensor. 

In  machine  vision,  the  basic  approach  has  been 
to  use  a  TV  camera  for  sensing,  to  digitize  the  im¬ 
age  data  into  a  frame  buffer  and  then  to  process 
the  data  with  a  digital  computer.  Apart  from  be¬ 
ing  expensive,  large,  heavy,  and  power-hungry,  this 
sense-digidze-and-then-process  paradigm  has  fun¬ 
damental  performance  disadvantages.  A  high  band¬ 
width  is  required  to  transfer  data  from  the  sensor  to 
the  processor.  The  parallel  nature  of  operands  cap¬ 
tured  in  a  2D  image  plane  is  not  exploited.  Also, 
high  latencies  caused  by  this  method,  due  to  image 
transfer  times,  limit  the  usefulness  of  this  method 
for  high-speed,  real-time  applications.  Combining 
processing  on  silicon  wafers  together  with  detectors 
will  eliminate  these  limitations,  and  have  the  po¬ 
tential  to  produce  a  visual  sensor  of  low-cost,  and 
low-power  with  high-throughput  and  low  latency. 

The  potential  for  integrating  the  transducing  and 
processing  of  signals  has  been  recognized  for  some 
time,  but  in  the  past,  research  and  development  in 
this  area  was  driven  mostly  by  curiosity  or  special 
use.  Today,  however,  the  advarKement  of  VLSI 
and  related  technologies  provides  opportunities  for 
us  to  harness  this  potential  in  new,  broad,  practi¬ 
cal  applications  in  image  understanding,  robotics. 
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and  human-computer  interfaces.  Most  importantly, 
VLSI  technologies  have  become  available  and  ac¬ 
cessible  to  the  sensor  application  community  where 
we  have  recently  observed  a  growing  body  of  re¬ 
search  in  computational  sensors. 

Several  computational  sensors  have  been  fabri¬ 
cated  and  demonstrated  to  perform  effectively.  Ana¬ 
log  vision  chips  have  been  demonstrated  which  can 
detect  a  motion  held,  or  continuously  compute  the 
size  and  orientation  of  an  object.  Three  dimen¬ 
sional  range  sensing  has  been  performed  at  a  rate 
of  1000  frames  per  second  using  a  chip  containing 
an  array  of  cells  each  capable  of  detecting  and  cal¬ 
culating  the  timing  of  an  intensity  profile.  Sensor 
chips  that  mimic  the  human’s  fovea  and  peripheral 
vision  have  been  fabricated  and  used  for  pattern 
recognition.  Tiny  lenses  can  be  etched  on  silicon 
to  focus  light  efficiently  on  a  photosensitive  area, 
or  even  to  perform  a  geometrical  transformation  of 
images.  Resistive  networks  and  associated  circuits 
on  a  chip  can  solve  optimization  problems  for  shape 
interpolation. 

Computational  sensors  are  not  limited  to  vision 
use,  but  have  applications  in  mechanical,  chemical, 
medical  and  other  sensors.  Development  of  mi¬ 
cromechanical  pressure  sensors  and  accelerometers 
has  been  underway  for  some  time.  An  air-bag  sen¬ 
sor  for  automobiles  could  become  one  of  the  first 
successful,  mass-produced,  low-cost  computational 
sensors.  It  contains  a  miniature  accelerometer  and 
processing  circuits  in  a  chip.  Processing  could  also 
be  combined  with  micro-chemical  sensors  to  de¬ 
tect  water  contamination,  air  pollution,  and  smells, 
while  micro-medical  sensors  could  measure  blood 
chemistry,  flow,  and  pressure. 

Potential  applications/markets  of  computational 
sensors  are  abundant: 

•  robot  perception 

•  industrial  inspection 

•  navigation  and  automobile 

•  space 

•  sensor  based  appliances 

•  medicine  (e.g.  patient  monitoring) 

•  security  and  surveillance 

•  entertainment  ^nd  media 


•  toy 

Development  of  a  computational  sensor  does  not 
simply  mean  combining  sensing  capability  with  pro¬ 
cessing  algorithms.  It  requires  new  thinking.  Most 
of  the  current  vision  algorithms,  for  example,  are 
strongly  influenced  by  the  fact  that  image  data  is 
provided  in  a  stream  and  processed  by  instructions. 
Also,  the  concept  of  frame  rate  (ie.,  considering 
a  certain  number  of  discrete  frames  per  second)  is 
dominant  in  dealing  with  time  varying  events.  How¬ 
ever,  a  computational  sensor  can  take  advantage  of 
the  inherent,  two-dimensional  nature  of  the  sensory 
data  arrangement,  the  continuous  time-domain  sig¬ 
nal,  and  the  physics  of  the  media  (eg.  silicon)  it¬ 
self  for  processing.  This  type  of  new  thinking  of¬ 
ten  results  in  a  completely  different,  more  efficient, 
orders-of-magnitude  faster  "algorithm".  Many  of 
the  successful  examples  mentioned  above  and  in 
section  3  are  the  results  of  such  new  algorithms. 

Finally,  computational  sensors  can  create  a  funda¬ 
mental  change  in  the  approach  to  the  sensor  system 
as  a  whole.  When  a  sensor  is  bulky,  expensive  and 
slow,  it  is  not  affordable,  both  economically  and 
technically,  to  place  many  of  them  within  a  system. 
The  sensor  system  is  forced  to  be  centralized.  If 
computational  sensors  can  provide  cheaper,  smaller, 
and  faster  sensing  units,  we  can  place  a  large  num¬ 
ber  of  sensors  throughout  a  system,  such  as  covering 
the  whole  surface  of  a  submersible  vehicle.  A  new 
opportunity  exists  to  make  sensor  systems  more  dis¬ 
tributed,  reliable,  and  responsive. 

3  Computational  Sensors:  Some  Ex¬ 
amples 

This  section  reviews  computational  sensor  architec¬ 
tures  that  have  emerged  in  recent  years: 

1.  The  focal  plane  computational  sensor:  Pro¬ 
cessing  is  done  on  a  focal  plane,  i.e.  the  sensing 
and  processing  element  are  tightly  coupled: 

2.  The  spatio-geometrical  computational  sensor: 
Computation  takes  place  via  the  inherent  geo¬ 
metrical  structure  and/or  optical  properties  of 
the  sensor; 

3.  The  VLSI  computational  module:  Sensor  and 
processing  element  are  not  tightly  coupled,  but 
processing  is  done  on  a  tightly  coupled  module. 
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Many  existing  systems  would  fall  into  several  of 
the  above  categories.  Representative  examples  of 
each  category  are  presented  here. 

Although  most  examples  we  give  are  of  visual 
information  processing,  these  considerations  and 
techniques  extend  directly  to  measurement  over  the 
whole  spectrum  of  electromagnetic  radiation.  In 
general,  any  other  “imaging  sensors”  such  as  me¬ 
chanical  (e.g.  tactile)  or  magnetic  sensors,  could 
also  benefit  from  lessons  learned  when  considering 
and  designing  computational  sensors  for  vision  ap¬ 
plications. 

3.1  The  focal  plane  architecture 

The  focal  plane  architecture  tightly  couples  process¬ 
ing  and  sensing  hardware — each  sensing  site  has  a 
dedicated  processing  element.  The  sensor  and  the 
processing  element  (PE)  are  located  in  close  phys¬ 
ical  proximity,  thus  reducing  data  transfer  time  to 
PE’s.  Each  PE  operates  on  the  signal  of  its  sen¬ 
sor.  However,  depending  on  the  algorithm,  each  PE 
may  need  the  signals  of  neighboring  sensors  or  PE’s. 
This  concept  corresponds  to  the  SIMD  paradigm  of 
parallel  computer  architectures.  In  computational 
sensors,  the  operands  are  readily  distributed  over  an 
array  of  PE’s  as  they  are  being  sensed. 

Cell  Parallelism 

Gruss  and  Kanade  [26]  [27]  [40]  at  Carnegie  Mellon 
have  developed  a  computational  sensor  for  range 
detection  based  on  light-stripe  triangulation.  The 
sensor  consists  of  an  array  of  cells,  each  cell  having 
both  a  light  detector  and  a  dedicated  analog-circuit 
PE.  The  light  stripe  is  swept  continuously  across  the 
scene  to  be  measured.  The  PE  in  each  cell  monitors 
the  output  of  its  associated  photoreceptor,  recording 
a  time-stamp  when  the  incident  intensity  peaks.  The 
processing  circuitry  uses  peak  detection  to  identify 
the  stripe  and  an  analog  sample-and-hold  to  record 
time-stamp  data.  Each  time-stamp  fixes  the  position 
of  the  stripe  plane  as  it  illuminates  the  line-of-sight 
of  that  cell.  The  geometry  of  the  projected  light 
.stripe  is  known  as  a  function  of  time,  as  is  the  line- 
of-sight  geometry  of  all  cells.  Thus,  the  3-D  location 
of  the  imaged  object  points  (“range  pixels”)  can  be 
determined  through  triangulation.  The  cells  operate 
in  a  completely  parallel  manner  to  acquire  a  frame  of 
3-D  range  data,  so  the  spatial  resolution  of  the  range 


image  is  determined  solely  by  the  size  of  the  array. 
In  the  current  CMOS  implementation,  an  array  of  28 
x  32  cells  has  been  fabricated  on  a  7.9mm  x  9.2mm 
die. 

Keast  and  Sodini  [41]  at  MIT  have  designed  and 
fabricated  a  focal  plane  processor  for  image  acqui¬ 
sition,  smoothing,  and  segmentation.  The  processor 
is  based  on  clocked  analog  CCD/CMOS  technol¬ 
ogy.  The  light  signal  is  acquired  as  an  accumulated 
charge.  The  neighboring  PE’s  share  their  operands 
in  order  to  smooth  data.  In  one  iteration,  each  PE 
sends  one  quarter  of  its  charge  to  each  of  its  four 
neighbors.  The  charge  meets  halfway  between  the 
pixels  and  mixes  in  a  single  potential  well.  After 
mixing,  the  charge  is  split  in  half  and  returned  to 
the  original  PE,  approximating  Gaussian  smooth¬ 
ing.  However,  the  segmenting  circuit  will  prevent 
this  mixing  if  the  absolute  difference  between  the 
neighboring  pixels  is  greater  than  a  given  threshold. 
A  40  X  40  array  with  a  cell  size  of  about  ISO  x  ISO 
microns  is  currently  being  fabricated. 

Use  of  Media  Physics  (Resistive  Grid) 

Some  algorithms  can  exploit  the  physics  of  the  VLSI 
layers  to  achieve  “processing”  in  a  computational 
sensor.  Carver  Mead  at  Caltech  has  developed  a 
set  of  subthreshold  CMOS  circuits  for  implement¬ 
ing  a  variety  of  vision  circuits.  The  best  known 
design  is  the  “Silicon”  retina,  a  device  which  com¬ 
putes  the  spatial  and  temporal  derivative  of  an  im¬ 
age  projected  onto  its  phototransistor  array.  The 
photoreceptor  consists  of  a  phototransistor  feeding 
current  into  a  node  of  a  48  by  48  element  hexagonal 
resistive  grid  with  uniform  resistance  values  R.  The 
photoreceptor  is  linked  to  the  grid  by  a  conductance 
of  value  G.  An  amplifier  senses  the  voltage  between 
the  receptor  output  and  the  networic  potential.  The 
circuit  computes  the  Laplacian  of  an  image,  while 
temporal  derivatives  are  obtained  by  adding  a  ca¬ 
pacitor  to  each  node. 

Another  example  which  exploits  resistive  grids 
to  achieve  signal  processing  is  the  blob  position  and 
orientation  circuit  developed  by  Standley,  Horn,  and 
Wyatt  at  MIT  [83]  [84].  Light  detectors  are  placed  at 
the  nodes  of  a  rectangular  grid  made  of  polysilicon 
resistors.  The  photo-current  is  injected  into  these 
nodes  and  the  current  flowing  out  of  the  perimeter 
of  the  grid  is  monitored.  The  injected  photocurrent 
and  the  grid  perimeter  current  are  related  through 
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Green’s  theorem;  based  on  sensed  perimeter  cur¬ 
rent,  information  to  compute  the  first  and  second 
moments  of  the  blob  is  extracted  at  5000  frames/sec. 
An  array  of  29  x  29  cells  has  been  fabricated  on  a 
9.2mm  x  7.9mm  die. 

3.2  Spatio-Geometric  and  Optical  Compu¬ 
tational  Sensors 

Some  computational  sensors  are  based  on  the  "com¬ 
putation"  performed  by  virtue  of  the  special  geom¬ 
etry  or  optical  material  of  the  sensor  array. 

Log-Polar  Sensor 

The  University  of  Pennsylvania’s  log-polar 
sensor  developed  by  Kreider  and  Van  der 
Spiegel  [47]  [48]  [73]  [77]  in  collaboration  with 
Sandini  of  University  of  Genova  and  researchers  at 
IMEC  in  Belgium  has  a  radially-varying  spatial  res¬ 
olution.  A  high  resolution  center  is  surrounded  with 
a  lower  resolution  periphery  in  a  design  resembling 
a  human  retina.  A  sensor  that  has  a  high  spatial 
resolution  area,  like  a  fovea  in  a  human  retina,  is 
often  termed  a  foveating  sensor.  The  image  is  first 
mapped  from  log- polar  to  the  Cartesian  plane.  There 
is  evidence  that  in  biological  systems  this  type  of 
mapping  takes  place  from  eye  to  brain.  The  authors 
have  shown  that  transformations  involving  perspec¬ 
tive,  such  as  optical  flow  and  rotation,  are  simplified 
with  such  a  mapping.  This  sensor  must  be  mechan¬ 
ically  foveated  for  a  specific  region  of  interest,  and 
current  research  concentrates  on  applying  this  chip 
to  robotics. 

Bederson,  Wallace,  and  Schwartz  [7]  at  New  York 
University  and  Vision  Application,  Inc.  designed  a 
log-polar  sens'^r  as  well.  The  VLSI  sensor  itself  is 
in  the  process  of  being  fabricated.  An  additional 
interesting  part  of  their  system  is  a  miniature  pan¬ 
tilt  actuator  called  Spherical  Pointing  Motor  (SPM) 
shown.  The  SPM  is  capable  of  carrying  and  ori¬ 
enting  the  sensor.  It  is  an  accurate,  fast,  small,  and 
inexpensive  device  with  low  power  requirements 
and  is  suitable  for  active  vision  applications. 

Another  foveating  sensor  has  been  designed  by 
Kosonocky,  Wilder  and  Misra  at  Rutgers  Univer¬ 
sity.  The  objective  was  to  design  a  sensor  whose 
foveal  region(s)  will  be  able  to  expand,  contract  and 
roam  in  the  field-of-view.  The  chip  is,  in  essence, 
a  512x512  square  array  with  the  ability  to  “merge” 


its  pixels  into  regions,  and  output  only  one  value  for 
each  such  rectangular  “super  pixel”.  The  largest  su¬ 
per  pixel  is  an  8x8  region.  There  are  three  modes  of 
operation.  In  Variable  Resolution  Mode,  the  resolu¬ 
tion  of  the  entire  chip  can  be  selected  from  highest 
to  lowest,  or  anywhere  inbetween.  The  Multiple 
Region  of  Interest  mode  provides  multiple  active 
windows,  possibly  with  different  resolutions,  while 
reading  data  out  from  the  rest  of  the  array  is  inhib¬ 
ited.  The  third  mode  is  a  combination  of  the  first 
two  modes.  This  third  mode  would  resemble  the 
sampling  of  a  human  retina  if  so  programmed.  The 
design  permits  multiple  foveae  within  the  retina. 
The  authors  demonstrated  significant  speed-up  in 
data  acquisition  for  a  variety  of  tasks  from  indus¬ 
trial  inspection  to  target  tracking. 

Hexagonal  Tessellation 

Hexagonal  sampling  tessellates  the  frequency  plane 
more  efficiently  than  rectangular  sampling.*  Pous- 
sart  and  Trembley  [91]  at  Laval  designed  a  200  x 
200  array  with  a  hexagonal  grid.  This  chip  facil¬ 
itates  parallel  access  to  the  data  in  a  particular  lo¬ 
cal  neighboriiood.  For  rapid  convolution,  this  local 
neighborhood  is  subsampled  along  three  principal 
axes  of  the  grid,  thus  reducing  the  data  needed  for 
convolution  in  the  local  neighborhood  of  each  pixel. 
Their  MAR  (Multi-port  Array  Photo-Receptor  sys¬ 
tem)  performs  zero-crossing  detection  at  seven  spa¬ 
tial  frequencies  in  16  milliseconds.  Edge  detection 
is  computed  in  real  time. 

Binary  Optics 

By  etching  desired  geometrical  shapes  directly  into 
the  surface  of  an  optical  material,  a  designer  can  pro¬ 
duce  optical  elements  with  properties  that  were  pre¬ 
viously  impossible  to  achieve.  This  method,  called 
binary  optics,  can  perform  simple  optical  processing 
before  the  light  is  detected. 

As  VLSI  microlithographic  techniques  have  ad¬ 
vanced,  inexpensive  fabrication  of  binary  optical 
devices  has  become  possible  [93].  Veldkemp  of 
Lincoln  Lab  at  MIT  has  developed  a  micro  lens  ar¬ 
ray  in  which  each  lens  is  only  200  microns  in  diam¬ 
eter.  One  application  of  such  an  array  would  be  to 

'  Nature  prefers  hexagonal  sampling,  which  is  actually  found 
in  the  mammalian  retina. 


338 


focus  light  onto  tiny  photodetectors  thus  saving  sil¬ 
icon  area  for  processing  hardware.  Some  of  the  first 
applications  of  the  idea  are  already  on  the  market: 
Hitachi  FP-CIO  HI-8  video  coders  use  a  micro-lens 
array  CCD,  and  the  Sony  XC-75  video  camera  dou¬ 
bles  the  sensitivity  to  f8  @  2000Lux  using  their  Hy- 
perHAD  CCD  structure  which  uses  micro  lenses.  In 
addition,  binary  optics  devices  have  been  applied  to 
automatic  target  recognition  and  space  applications. 

McHugh  of  Hughes  Danbury  Optical  Systems  ex¬ 
perimented  with  binary  optical  techniques  and  found 
that  they  can  generate  virtually  any  transformation 
of  an  optical  wave  front.  The  first  application  that 
used  this  new  capability  was  a  binary  optical  com¬ 
ponent  that  optically  mapped  the  log-polar  plane  to 
the  Cartesian  plane.  This  device,  in  effect,  samples 
images  at  log-polar  resolution  and  optically  trans¬ 
forms  them  for  sensing  on  a  Cartesian  grid.  This 
way  an  optical  log-polar  foveating  sensor  is  pro¬ 
duced,  while  the  mapping  to  the  Cartesian  plane  has 
become  “free  of  charge”. 

Color  and  Polarization 

Wolff  at  Johns  Hopkins  University  uses  liquid  crys¬ 
tal  polarizers  whose  polarization  angles  are  el^- 
tronically  controlled  [100].  It  has  been  reported  that 
by  eliminating  mechanical  rotation  of  filters,  switch¬ 
ing  time  between  different  polarization  angles  is  re¬ 
duced,  and  accuracy  of  results  is  improved.  Wolff 
hopes  to  build  polarization  cameras  with  polarizers 
in  each  element  of  the  CCD  array  for  acquisition  of 
polarized  images  in  real-time.  For  specularity  detec¬ 
tion,  material  classification  and  object  recognition, 
color  and  polarization  carry  independent  and  com¬ 
plementary  information:  polarization  for  specular¬ 
ity,  and  color  for  diffuse  surfaces  and  light  sources. 
Sensors  for  real-time  combination  of  both  color  and 
polarization  images  will  add  rich  information  to  vi¬ 
sion  systems. 

3,3  Computational  Modules  for  Sensory 
Information  Processing 

While  not  strictly  a  computational  "sensor",  there  is 
a  class  of  computational  modules  for  sensory  infor¬ 
mation  processing  which  exploit  VLSI  technologies 
in  a  similar  manner  as  computational  sensors. 

These  computational  modules  are  useful  when 
there  is  not  enough  space  on  a  single  chip  to  accom¬ 


modate  complex  PE’s,  or  the  data  to  be  processed 
comes  from  other  modules. 

Smoothing  and  Optimization  by  Resistive  Net¬ 
works 

At  Caltech,  several  regularization  techniques  have 
been  implemented  on-chip.  For  example,  consider 
the  problem  of  fitting  a  2D  surface  to  a  set  of  sparse, 
noisy  depth  measurements  by  imposing  a  “smooth¬ 
ness”  constraint  This  method  produces  quadraticly 
varying  functions.  This  can  be  solved  using  simple 
linear  resistive  networks  by  virtue  of  the  fact  that 
the  electrical  power  dissipated  in  linear  networks  is 
quadratic  in  the  current  or  voltage  [71]. 

Mapping  2D  motion  algorithms  onto  analog  chips 
has  turned  out  to  be  surprisingly  difficult.  A  ro¬ 
bust  motion  detection  circuit  implemented  in  ana¬ 
log  VLSI  has  yet  to  be  demonstrated,  but  early  effort 
has  been  made  by  Tanner  at  Caltech  [88]  [89].  He 
successfully  built  and  tested  an  8x8  pixel  chip  that 
outputs  a  single  uniform  velocity  averaged  over  the 
entire  image.  His  chip  reports  values  of  x  and  y  ve¬ 
locity  which  minimize  the  least  square  error  in  the 
image  brightness  constraint  equation. 

Bair  and  Koch  have  successfully  built  an  ana¬ 
log  VLSI  chip  that  computes  zero  crossings  of  the 
difference  of  Gaussians.  It  takes  the  difference  be¬ 
tween  two  copies  of  an  image,  supplied  by  a  1-D 
array  of  64  photoreceptors,  each  smoothed  by  a  sep¬ 
arate  linear  first-order  resistive  network,  and  reports 
the  zero-crossings  in  this  difference  [6].  This  imple¬ 
mentation  has  the  particular  advantage  of  exploiting 
the  smoothing  operation  naturally  performed  by  re¬ 
sistive  netwoiks,  and  therefore  avoids  the  burden 
of  additional  circuitry.  The  network  resistance  and 
the  confidence  of  the  photoreceptor  input  are  inde¬ 
pendently  adjustable  for  each  network.  Also,  an 
adjustable  threshold  on  the  slope  of  zero-crossings 
can  be  set  to  cause  the  chip  to  ignore  weak  edges 
due  to  noise. 

Binary  line  processes  which  model  discontinu¬ 
ities  in  intensity  within  the  stochastic  framework  of 
Markov  Random  Fields  provide  a  method  to  detect 
discontinuities  in  motion,  intensity,  and  depth.  This 
is  achieved  by  selectively  imposing  the  smoothness 
assumption.  Harris  and  Koch  have  invented  the 
“resistive  fuse”,  which  is  the  first  hardware  circuit 
that  explicitly  implements  line  processes  in  a  con¬ 
trolled  fashion  [31].  Like  a  normal  house  fuse,  a 
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resistive  fuse  operates  as  a  linear  resistor  for  small 
voltage  drop  and  as  an  open-circuit  for  large  voltage 
drops.  A  20x20  rectangular  grid  network  of  fuses 
has  been  demonstrated  for  smoothing  and  segment¬ 
ing  test  images  which  are  scanned  onto  the  chip. 

Pyramid 

Van  der  Wal  and  Burt  at  David  Samoff  Research 
Center  developed  a  VLSI  pyramid  chip  PYR  [94], 
Combined  with  external  framestore,  the  PYR  chip 
is  capable  of  computing  Gaussian  and  Laplacian 
pyramid  transforms  simultaneously.  These  trans¬ 
forms  consist  of  Gaussian  filtering  and  consecutive 
subsampling,  and,  for  Laplacian,  image  subtraction. 
The  Chip  has  a  separable  5  by  5  filter  and  four 
1024-sample-long  delay  lines.  Each  filter  tap  has 
a  preassigned  set  of  possible  values.  Coefficient 
values  from  this  set  can  be  changed  under  software 
control.  PYR  has  special  features  such  as  double 
precision,  double  sample  density,  image  border  ex¬ 
tension  and  automatic  timing  control.  At  ISMHz 
a  single  chip  can  compute  Gaussian  and  Laplacian 
pyramids  at  44  frames/second  for  512  by  480  im¬ 
ages.  PYR  is  implemented  in  digital  VLSI  using  the 
CMOS  standard  cell  library  from  VLSI  Technology, 
Inc.  Digitized  image  samples  pass  through  the  chip 
sequentially,  in  raster  scan  order. 

4  Issues 

Successful  development  of  a  computation  sensor  re¬ 
lies  on  careful  consideration  of  several  issues  includ¬ 
ing; 

•  choice  of  the  circuitry;  digital  vs.  analog  elec¬ 
tronics,  choice  of  sensors  with  respect  to  spec¬ 
tral  bandwidth  (color)  and  polarizers, 

•  choice  of  an  algorithm, 

•  state-of-the-art  VLSI, 

•  prototyping  infrastructure;  design  tools  and 
fabrication  facilities, 

•  applications, 

•  education,  workshops/networking,  literature. 

All  of  these  issues  are  di.scussed  in  the  following 
sections. 


4.1  Analog  vs.  Digital 

Both  digital  and  analog  circuits  can  be  implemented 
using  VLSI  technology.  The  analog  approach  can 
be  conceptually  divided  into  continuous-time  (un¬ 
docked)  and  discrete-time  (clocked)  processing. 
The  choice  of  technology  depends  on  the  particu¬ 
lar  application,  but  several  general  remarks  are  in 
order.  Compared  to  digital,  the  traditional  disad¬ 
vantage  of  analog  electronics  is  its  susceptibility  to 
noise,  yielding  low  precision.  The  source  of  this 
noise  can  be  on-chip  switching  electronics  which 
require  special  considerations  for  hybrid  designs. 
Also,  analog  electronics  do  not  provide  efficient 
long-term  storage;  typical  storage  times  are  about 
one  second.  On  the  other  hand,  digital  processing 
requires  A/D  and  D/A  conversion,  which  usually 
imposes  limitations  on  total  circuit  speed.  Analog 
electronics  are  characterized  by; 

•  high  speed, 

•  low  latency, 

•  low  precision  (typically  6  to  8  bits), 

•  short  data  storage  time  (typically  1  second), 

•  sensitivity  to  on-chip  digital  switching;  and 

•  a  long  design  and  testing  process. 

In  general,  analog  hardware  takes  less  chip  area 
than  digital  mechanisms  of  the  same  functionality. 
Most  participants  at  the  workshop  were  experts  in 
analog  circuitry  which  seems  to  be  preferred;  how¬ 
ever,  many  recognized  the  importance  of  digital 
electronics  for  computational  sensing. 

Analog  VLSI  offers  two  interesting  advantages 
for  computational  sensor  design.  First,  the  physical 
properties  of  the  solid-state  layers  and  devices  can 
sometimes  be  exploited  to  yield  elegant,  new  solu¬ 
tions.  One  such  example  is  to  exploit  the  physics  of 
a  resistive  sheet  (or  dense  grid)  to  compute  desired 
quantities. 

The  second  interesting  advantage  of  analog  VLSI 
is  charge-domain  processing,  best  exemplified  by 
CCD  technology,  which  offers  an  area-efficient 
mechanism  for  transferring  data.  In  addition,  cre¬ 
ative  processing  schemes  can  be  developed  to  pro¬ 
cess  the  data  in  charge-domain  as  it  is  transferred. 
CCD  technology  has  already  provided  several  useful 
examples  of  integrated  sensing  and  signal  process¬ 
ing. 
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4.2  Algorithms  for  Computational  Sensors 

While  the  VLSI  computational  sensor  offers  excit¬ 
ing  opportunities,  one  must  be  careful  in  deciding 
which  algorithms  or  applications  will  benefit  from 
such  an  implementation.  At  the  present  state  of  tech¬ 
nology,  successful  design  of  working  VLSI  circuits, 
especially  analog  ones,  is  a  lengthy  process. 

Algorithms  must  be  carefully  selected  or  invented 
to  match  the  architecture  to  the  circuitry  for  max¬ 
imum  performance  -  there  are  definite  limitations 
on  circuitry  and  architectures.  Circuitry  has  lim¬ 
ited  precision  and  storage.  Until  technology  allows 
much  denser  circuits  (or  3D  structures)  for  example, 
there  is  not  enough  room  to  fabricate  a  complex  PE 
at  each  photo  site. 

Simple  cell-parallel  algorithms  that  detect  local 
cues  or  integrate  local  information  over  time  or  mul¬ 
tiple  channels  (eg.  spectrum)  at  each  cell  are  most 
ideal. 

When  a  complex  PE  is  required,  processing  and 
sensing  can  take  place  on  separate,  but  tightly  cou¬ 
pled  (preferably  on-chip)  modules.  The  cost  of 
transferring  data  must  be  minimized  in  order  to  jus¬ 
tify  the  use  of  VLSI  over  conventional  computer 
systems.  CCD  row-parallel  transfer  is  one  way  to 
perform  the  transfer  at  a  reasonable  speed.  Also, 
some  algorithms  do  not  directly  exhibit  parallelism 
in  the  focal  plane:  they  often  require  significant  lo¬ 
cal  data  storage  al  each  PE.  In  stereo  algorithms,  for 
example,  optical  signals  are  to  be  combined  from 
two  different  focal  planes.  In  this  case,  data  are 
read  out  and  processed  on  a  separate  computational 
module. 

There  are  optimizations  and  other  techniques  that 
map  naturally  to  physical  processes  in  silicon;  such 
as  relaxation  processes  implemented  on  resistive 
grids.  The  advantage  of  these  physics-based  pro¬ 
cessors  over  computer  implementation  is  that  they 
minimize  a  multi-dimensional  energy  function  by 
reaching  a  stable  state  of  a  continuous-time  system, 
potentially  reducing  round-off  error  and  numerical 
instability  from  which  an  iterative  solution  by  a  dig¬ 
ital  computer  may  suffer. 

In  summary,  the  following  are  some  general  char¬ 
acteristics  of  algorithms  which  are  good  candidates 
for  computational  sensors  implementation; 

•  Algorithms  that  are  simple  and  robust  to  noise, 
and  are  based  on  sensor  or  cue  integration 


•  Algorithms  that  exploit  a  significant  level 
of  parallelism  without  requiring  significant 
storage  capacity,  wafer  real  estate,  or  inter¬ 
processor  data  transfer. 

•  Algorithms  that  map  naturally  to  physical  pro¬ 
cesses  etKOuntered  in  semiconductors, 

•  Algorithms  that  could  exploit  the  intercommu¬ 
nication  and  propagation  afforded  by  charge- 
transfer,  surface  acoustic  waves,  and  optical 
properties. 

43  VLSI  Technology 

CMOS,  Bipolar,  and  BiCMOS  are  the  most  avail¬ 
able  VLSI  technologies.  CMOS  is  characterized  by 
very  dense  packaging,  low  power  consumption,  and 
high  input  impedance.  Good  switching  properties 
make  it  well  suited  for  digital,  switching,  and  hy¬ 
brid  circuits.  It  is  widely  accessible  and  relatively 
inexpensive  technology.  CCD’s  are  implemented  in 
MOS  technology. 

Bipolar  technology  is  characterized  by  low  noise 
and  fast  circuitry,  but  consumes  more  power  and 
takes  more  substrate  real  estate.  It  is  not  as  accessi¬ 
ble  to  the  wider  research  community  as  it  probably 
should  be. 

BiCMOS  combines  the  advantages  of  both 
(TMOS  and  Bipolar  technologies. 

Semiconductor  material  other  than  silicon  is  also 
available.  GaAs  compounds  yield  very  high  speed 
circuitry  and  are  well  suited  to  electro-optical  ap¬ 
plications.  GaAs  technology  is  less  available,  how¬ 
ever,  and  is  considerably  more  expensive. 

The  trend  in  VLSI  is  toward  smaller  device  ge¬ 
ometries.  This  produces  both  smaller  and  faster 
digital  circuits  and  hence  more  functionality  per  unit 
area.  This  scaling,  however,  is  not  as  beneficial  to 
analog  circuitry  as  to  digital.  Most  active  devices 
are  designed  at  a  given  size  and  scaling  and  would 
not  preserve  desired  functional  features  after  a  scale 
change.  Analog  MOS  circuits  benefit  more  from 
improvements  in  fabrication  process  quality.  Fac¬ 
tors  such  as  oxide  quality  and  thickness,  or  tighter 
control  of  threshold  voltages  would  greatly  benefit 
analog  circuit  performarKe. 

Great  interest  has  been  shown  in  3D  VLSI.  One 
possibility  is  optical  signal  communication  between 
stacked  chips.  This  could  be  accomplished  with 
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the  availability  of  silicon-compatible  semiconduc¬ 
tor  emitters  and  IR  detectors  [90].  This  technique 
would  also  require  and  exploit  integrated  optics 
capability  such  as  binary  optics.  Alternatively, 
a  conducting  feedthrough  could  be  developed  for 
making  distributed  point-to-point  electrical  connec¬ 
tions  [70]. 

Micro  fiber-optics  could  be  used  to  route  data  in 
parallel  from  module  to  module.  The  optical  ap¬ 
proach  has  the  advantage  of  possible  optical  pro¬ 
cessing  during  the  data  transmission  itself,  but  has 
the  disadvantage  of  high  power  consumption  and 
heat  dissipation.  This  technology  has  not  been 
developed  far  enough  to  become  accessible  to  the 
wider  research  community. 

4.4  Applicatioas 

As  VLSI  technology  advances  and  becomes  acces¬ 
sible  to  a  wider  research  community,  a  number  of 
ideas  that  combine  sensing  and  processing  on  a  chip 
are  emerging.  Many  attempts,  however,  are  too 
quick  to  postulate  miraculous  chips  and  systems 
which  have  little  chance  of  ever  working. 

Several  successful  examples  of  computational 
sensors  have  been  driven  by  applications,  and  the 
workshop  participants  have  agreed  that  this  will  re¬ 
main  true  for  most  successful  developments.  A  truly 
successful  “marriage”  of  sensing  and  compulation 
can  be  done  only  by  careful  analysis  of  application 
requirements  in  conjunction  with  implementation 
technologies. 

While  a  wide  variety  of  applications  are  conceiv¬ 
able,  the  following  are  potential  applications  that 
have  been  suggested  during  the  workshop: 

•  A  high  resolution  camera  (2000  x  2(XX)  and 
up). 

•  Face  recognition  for  credit  purchase,  security, 
and  human-computer  interfaces. 

•  An  inexpensive  anti-collision  stereo  sensor  for 
automobiles. 

•  Motion  detection  and  tracking  for  automobiles, 
security,  and  human-computer  interfaces. 

•  Automatic  local  brightness  adjustment  of  im¬ 
ages. 

•  Tactile  sensors  for  material  handling. 


•  Insect  robots  for  the  toy  industry. 

•  High-speed  industrial  inspection,  chip  reticle 
alignment. 

•  E)ocument  understanding  and  optical  character 
recognition. 

•  A  light-weight  amacronic  sensor/display  de¬ 
vice  for  virtual  reality. 

•  Image  compression  for  home  video  appliances. 

•  Medical  sensors/implants. 

•  Automatic  target  recognition  -  signal  prepro¬ 
cessing  for  specialized  sensors  (gain,  bias,  fil¬ 
tering)  and  multi-sensor  integration.  An  ex¬ 
ample  is  a  computational  sensor  to  perform  the 
functions  of  detection,  inscan  calibration,  and 
output  multiplexing  of  FLIR. 

•  Space  robotics  for  orbital  replacement,  satellite 
retrieval,  and  planetary  exploration. 

•  Remotely  and  automatically  piloted  vehicles  - 
sensors  to  make  UGV,  AAV,  AUV  low  cost. 

4.5  Prototyping  Infrastructure 
Design  tools 

An  issue  which  received  unanimous  agreement 
among  woikshop  participants  is  the  lack  of  ana¬ 
log  VLSI  design  tools  equivalent  to  those  for  digital 
design.  These  tools  include  design  aids  from  lay¬ 
out  to  testing,  including  extraction,  verification  and 
simulation.  Analog  circuits  are  more  sensitive  to 
parasitics  than  digital  circuits.  Accurate  techniques 
for  including  these  parasitics  in  the  extracted  files 
would  reduce  the  number  of  design  iterations  due  to 
unexpected  circuit  behavior. 

Analog  modeling  and  simulation  capabilities  are 
still  inadequate.  Much  of  the  attention  in  modeling 
is  directed  at  the  effects  of  extremely  short  chan¬ 
nel  lengths  on  MOS  transistor  operation.  Analog 
design  rarely  uses  minimum  size  transistors,  but  is 
more  critically  dependent  upon  operating  under  a 
different  bias  condition:  subthreshold  and  saturation 
regions.  The  proper  modeling  of  bias-dependent  ca¬ 
pacitances  is  critical  for  modeling  circuit  dynamics 
and  stability.  There  is  little  or  no  support  for  simulat¬ 
ing  charge-domain  devices  like  CCD’s.  Statistical 
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modeling  is  an  important  predictive  element  of  ana¬ 
log  design,  providing  assurance  that  the  resulting 
circuits  will  meet  the  prescribed  design  constraints. 
Without  it,  a  circuit  may  be  functional  and  within 
specifications  for  a  given  process  model,  but  actual 
process  variation  may  result  in  an  out-of-spec  or 
inoperable  circuit. 

It  has  been  noted  that  a  data  book  for  standard 
analog  cells  would  be  very  useful.  While  it  will  be 
more  difficult  than  the  digital  domain,  it  is  necessary 
to  develop  a  library  of  standard  building  blocks  of 
compatible  electronic  and  sensor  components  with 
which  one  can  design  a  new  computational  sensor. 

Fabrication  Facilities 

The  MOSIS  Service  is  a  prototyping  service  offer¬ 
ing  fast-turnaround  standard  cell  and  full-custom 
VLSI  circuit  development  at  very  low  cost.  The 
MOSIS  Service,  begun  in  1980,  provides  fabrica¬ 
tion  services  to  goverrunent  contractors,  agencies, 
and  university  classes  under  the  sponsorship  of 
the  Defense  Advanced  Research  Projects  Agency 
(DARPA)  with  assistance  from  the  National  Sci¬ 
ence  Foundation  (NSF).  MOSIS  has  developed  a 
methodology  that  allows  the  merging  of  many  dif¬ 
ferent  projects  from  various  organizations  onto  a 
single  wafer.  Instead  of  paying  for  the  cost  of  mask¬ 
making,  fabrication,  and  packaging  for  a  complete 
run  (currently  between  $50,(XX)  and  $80,000)  MO¬ 
SIS  users  pay  only  for  the  fraction  of  the  silicon 
that  they  use,  which  can  cost  as  little  as  $400.  Ini¬ 
tially,  the  MOSIS  user-base  was  primarily  university 
and  government  users.  MOSIS’  success  in  serving 
this  group  of  users  led,  in  recent  years,  to  a  natu¬ 
ral  expansion  into  the  industrial  sector,  with  rapidly 
growing  use  of  MOSIS  by  commercial  companies. 
MOSIS  foundries  have  also  taken  advantage  of  the 
frequent  prototype  runs  for  their  own  needs  as  well 
as  those  of  their  clients.  MOSIS  is  located  at  the 
Information  Sciences  Institute  of  the  University  of 
Southern  California  (USC/ISI)  in  Marina  del  Rey, 
California. 

The  MOSIS  program  has  been  a  successful  mech¬ 
anism  for  promoting  VLSI  applications.  MO¬ 
SIS’  ease  of  access,  quick  turnaround,  and  cost- 
effectiveness  have  afforded  designers  opportuni¬ 
ties  for  frequent  prototype  iterations  that  otherwise 
might  not  even  have  been  considered.  With  MOSIS’ 
low  cost  for  “tiny-chip”  fabrication,  silicon  can  be 


used  as  a  rapid  prototyping  vehicle.  Small  func¬ 
tional  building  blocks  can  be  easily  fabricated  and 
tested  before  too  much  time  is  invested  in  building 
and  integrating  a  full  system.  Furthermore,  many 
ideas  and  needed  intuition  can  be  gained  through 
“playing”  with  these  actual  woiking  chips.  Success¬ 
ful  designers  of  existing  functional  computational 
sensors  have  reported  that  silicon  prototyping,  com¬ 
bined  with  higher  level  algorithm  simulation,  has 
proven  to  be  a  useful  system-building  approach  in 
computational  sensors. 

MOSIS  offers  two  monthly  runs  of  a  standard 
2um,  double-layer  metal.  CMOS  process.  One  of 
these  runs  usually  includes  a  second  layer  of  polysil¬ 
icon.  Typically  these  designs  are  fabricated,  bonded 
and  returned  in  about  two  months.  In  addition  to 
these  standard  runs,  a  L2um  CMOS  run  goes  out 
about  once  every  month  and  there  are  more  infre¬ 
quent  runs  at  0.8um.  Every  other  month  includes 
a  low-noise  2um  analog  CMOS  run  which  has  op¬ 
tions  for  second  poly,  a  NPN  bipolar  transistor  in 
the  n-well,  and  a  buried  channel  CCD. 

MOSIS’s  capability,  however,  is  limited  for  the 
research  and  development  of  computational  sensors, 
(^ality  bipolar  and  depletion-mode  MOS  devices 
are  unavailable.  MOSIS  is  beginning  to  offer  GaAs 
(instead  of  the  more  usual  Silicon)  process  runs  on 
a  regular  basis. 

At  this  point,  MOSIS  does  not  provide  a  capa¬ 
bility  for  optical  electronics  fabrication.  University 
researchers  must  rely  on  teaming  with  industries 
which  have  the  fabrication  capability  in  this  area.  It 
is  noteworthy  that  both  the  European  research  com¬ 
munity  and  the  Japanese  micro-sensor  project  will 
have  a  common  facilities  including  capabilities  for 
optical  electronics  fabrication. 

4.6  Education,  Workshops/Networking 
and  Literature 

Understanding  semiconductor  and  device  physics 
as  well  as  techniques  for  maiiceting  custom-made 
integrated  circuits  are  essential  prerequisites  to  de¬ 
veloping  a  successful  computational  sensor.  For 
the  complete  success  of  a  computational  sensor,  av¬ 
enues  of  communication  between  VLSI  designers, 
computer  vision  researchers,  and  product  develop¬ 
ers  must  be  developed.  These  groups  would  ex¬ 
change  information  about  the  opportunities  and  dif¬ 
ficulties  in  each  others’  fields.  Vision  (and  other 
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sensor)  researchers  must  be  made  aware  of  what  is 
available  in  VLSI  technology,  and  VLSI  designers 
must  understand  the  problems  of  machine  vision. 
This  workshop  was  very  productive.  It  was  recom¬ 
mended  that  follow-on  workshops  or  conferences 
be  held. 

It  was  proposed  that  universities  and  industries 
team-up  to  allow  students  to  obtain  more  hands-on 
experience.  This  is  an  old  idea  that  still  has  diffi¬ 
culty  working  in  practice.  Namely,  most  students 
and  university  professors  are  more  likely  to  under¬ 
take  theoretical  research  than  to  work  on  the  “real 
thing”.  This  is  primarily  due  to  the  fact  that  deal¬ 
ing  with  hardware  tends  to  extend  time  in  graduate 
school  for  students,  and  reduce  the  publishing  rate  of 
professors.  This  problem  received  some  attention, 
and  reviews  of  academic  standards  were  suggested. 
It  was  suggested  that  more  credit  should  be  given  to 
efforts  which  produce  working  prototype  devices  or 
systems. 

The  body  of  experience  and  knowledge  of  com¬ 
putational  sensors  is  currently  scattered  over  a  large 
number  of  disciplines  and  corresponding  publica¬ 
tions.  Publications  range  from  journals  on  elec¬ 
tronic  circuits  and  signal  processing  to  publications 
on  neural  networks  and  vision  research.  To  ef¬ 
fectively  communicate  knowledge  about  computa¬ 
tional  sensors,  it  was  suggested  that  a  new  journal 
be  created. 

Another  type  of  cooperation  is  to  distribute  work¬ 
ing  prototype  sensors  in  among  the  user  community. 
An  excellent  example  is  the  log-polar  camera  proto¬ 
type  that  University  of  Pennsylvania  has  offered  to 
share  with  interested  researchers.  This  type  of  co¬ 
operation  is  of  mutual  benefit  to  the  sensor  designers 
as  well  as  to  application  developers.  Designers  of 
the  computational  sensor  receive  much  needed  feed¬ 
back  about  the  actual  need  and  practical  value  of  the 
sensor,  while  application  researchers  can  investigate 
new  areas  previously  limited  by  the  absence  of  these 
specialized  devices. 

5  Recommendations 

In  light  of  the  previous  analysis,  the  workshop  has 
recommended  the  following: 

1.  Create  a  research  and  development  program 
in  computational  sensors.  The  program  must 
have  the  following  characteristics  (Figure  1): 


•  Interdisciplinary  -  the  program  must  in¬ 
clude  sensing,  algorithms,  VLSI,  mate¬ 
rial,  and  applications; 

•  Multi-modal -the  program  must  deal  with 
not  only  the  image  or  visual  modality,  but 
also  with  other  sensing  modalities  includ¬ 
ing  tactile,  acoustic,  pressure,  accelera¬ 
tion,  chemical,  and  so  on; 

•  Prototyping-oriented  -  individual  projects 
under  this  program  must  be  oriented  to¬ 
ward  producing  working  prototype  de¬ 
vices  or  systems; 

•  Applications  -  individual  projects  must 
identify  potential  applications  and  possi¬ 
ble  avenues  of  technology  transfer  to  real 
world  applications. 

2.  Improve  the  infrastructure  for  research  and  de¬ 
velopment  of  computational  sensors: 

•  Fabrication  facilities  -  MOSIS  (or  similar 
facilities)  must  be  expanded  to  include 
technologies  for  optical  and  mechanical 
sensor  development; 

•  Tools  -  Tools  for  designing  and  testing 
computational  sensors  can  be  far  more 
complicated,  than  they  are  for  standard 
VLSI  design.  Standardization,  and  li¬ 
brary  and  tool  development  are  essential; 

•  Education  -  Hands-on  experience  must  be 
provided  to  graduate  students; 

•  Networking  and  workshops  -  Researchers 
in  computational  sensors,  by  its  nature, 
are  scattered  in  multiple  fields,  and  mech¬ 
anisms;  workshops  and  consortiums  must 
be  developed  to  bring  them  together. 


The  following  bibliography  contains  pa¬ 
pers  COLLECTED  DURING  AND  AFTER  THE  WORK¬ 
SHOP  BY  THE  CONTRIBUTIONS  OF  PARTICIPANTS. 
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NSF/DARPA  Workshop  on  Machine  Learning  and  Vision; 

A  Summary 


Ryszard  S.  Michalski 
Peter  W.  Pachowicz 
Center  for  Artificial  Intelligence 
George  Mason  University 

This  report  gives  a  brief  account  of  the 
NSF/DA^A  Workshop  on  Machine  Learning 
and  Vision,  organized  by  George  Mason 
University  in  collaboration  with  the  University 
of  Maryland,  October  15-17,  1991  in  Harpers 
Ferry,  WV.  The  purpose  of  the  workshop  was  to 
bring  together  researchers  in  vision  and  learning 
to  discuss  the  possibilities  of  CTOss-fertilizing  the 
two  fields,  and  implementing  learning 
ci^abilities  in  vision  systems. 

The  workshop  was  attended  by  about  40 
participants  representing  universities,  industrial 
and  governmental  laboratories,  and  several 
sponsoring  agencies.  The  workshop  started  with 
two  general  presentations,  one  on  machine 
vision  (A.  Rosenfeld),  and  the  second  on 
machine  learning  (R.S.  Michalski).  Subsequent 
discussions  were  conducted  in  three  sessions:  1) 
Learning  in  object  recognition  (organized  by  J. 
Shavlik  and  T.  Poggio),  2)  Learning  in 
navigation  (organized  by  T.  Dean  and  T. 
Kanade),  and  3)  Learning  in  sensory-motor 
control  (organized  by  R.  Bajcsy  and  T. 
Mitchell). 

The  session  on  object  recognition  discussed 
issues  related  to  types  of  tasks  and  important 
subtasks  in  object  recognition.  Two  basic  types 
were  distinguished:  learning  shape  descriptions 
and  learning  surface  (texture)  descriptions.  The 
shape  learning  subtasks  were  classified 
according  to  their  difficulty:  isolated  object 
recognition,  recognition  of  stifle  objects  in  a 
scene,  and  recognition  of  objects  that  fulfill  a 
functional  goal  (e.g.,  an  object  that  could  be 
used  as  a  chair).  The  following  issues  were 
considered  as  important  for  learning  in  vision: 
relationship  between  2D  and  3D  vision,  number 
of  training  examples  needed,  use  of  prior 
knowledge,  discovery  of  good  representations, 
attribute  selection,  variability  of  the 
envirorunent,  and  occlusion. 

The  session  on  learning  in  navigation  classified 
the  problems  that  require  learning  along  such 
dimensions  as:  constrained  vs.  unconstrained 
navigation,  static  vs.  dynamic  navigation,  and 


The  Worieshop  was  sponsored  jointly  by  the  National 
Science  FoundUion  arid  the  Defense  Advanced  Research 
Projects  Agency  undo'  Grant  No.  IRI-9208947. 


Azriel  Rosenfeld 
Yiannis  Aloimonos 
Colter  for  Automation  Research 
University  of  Maryland 

structured  vs.  unstructured  environment. 
Navigation  tasks  incorporating  learning 
functions  were  analyzed  according  to  shallow  vs. 
deep  inference,  resource  consideration, 
availability  of  supplementary  knowledge,  and 
complexity  of  behavior.  Issues  for  machine 
learning  include  (i)  learning  in  a  constrained 
spatio-temporal  context,  (ii)  building 
representations  to  facilitate  planning  in  a  spatio- 
temporal  context,  (iii)  memory  management 
during  learning,  (iv)  combining  sensor 
information,  and  (v)  optimal  feedback  control. 
The  session  on  learning  in  sensory-motor 
control  identified  several  bottleneck  problems. 
Hie  major  goal  of  machine  learning  is  viewed  as 
automatic  combining  of  specific  vision  and 
action  modules  in  a  task-indq)endent  way.  This 
includes  (i)  learning  efficient  visual  search,  (ii) 
learning  invariances  that  facilitate  object 
identification  under  different  imaging 
transformations  and  occlusions,  (iii)  learning 
module  configuration  and  coordination  for 
sensory-motor  tasks,  and  (iv)  learning 
calibration  between  sensing  and  action. 

In  summary,  researchers  agreed  that  many 
crucial  elements  of  machine  vision  cannot  be 
considered  in  isolation  from  machine  learning. 
However,  to  be  successful  in  the  integration  of 
learning  and  vision,  researchers  should  pick  a 
particular  vision  problem,  apply  acceptable 
restrictions  on  the  problem,  simplify  the  data, 
find  solutions  by  applying  learning  technology, 
and  then  improve  the  solutions  by  gradually 
relaxing  restrictions  on  the  problem.  It  would 
be  desirable  to  sponsor  several  long-term 
projects  focused  on  both  industrial  and  military 
applications.  The  creation  of  sharable  testbeds  is 
recommended  for  evaluation  of  results. 

Reference 
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March  4, 1993 


Abstract 

A  DARPA  workshop  on  automatic  sensor  interpretation 
was  held  to  bring  together  government  sponsors  and  peo¬ 
ple  carrying  out  research  and  development  in  fields  re¬ 
lated  to  sensor  interpretation.  We  indicate  here  some  of 
the  speakers  and  their  topics.  As  a  result  of  the  Work¬ 
shop,  a  BAA  was  issued  for  a  university  research  initia¬ 
tive  in  theory  and  strategies  for  automatic  target  recog¬ 
nition  (ATR). 

1  Introduction 

The  DARPA  Automatic  Sensor  Interpretation  Work¬ 
shop  was  held  at  George  Mason  University  September 
30  •  October  2,  1992.  The  purpose  was  to  bring  to¬ 
gether  government  sponsors  and  specialists  in  various 
disciplines  related  to  sensor  interpretation.  The  work¬ 
shop  was  motivated  by  an  automatic  target  recognition 
inter-oflice  effort  within  DARPA  involving  the  Advanced 
Systems  Technology  Office  (ASTO),  the  Software  and 
Intelligent  Systems  Technology  Office  (SISTO),  The  De¬ 
fense  Sciences  office  (DSO),  and  the  Microtechnology  Of¬ 
fice  (MTO). 

As  a  result  of  the  workshop,  BAA  93-07,  “Univer¬ 
sity  Research  Initiative  into  the  Theory  and  Strate¬ 
gies  for  Automatic  Target  Recognition,”  was  issued  on 
November  4,  1992.  This  BAA  solicited  proposals  from 
U.S.  graduate  education  institutions  to  research  the  the¬ 
ory  and  strategies  for  ATR.  Of  particular  interest  were 
those  approaches  that  investigated  the  fundamental  ba¬ 
sis  for  detecting  and  recognizing  dim  or  obscured,  quasi- 
resolved  targets  located  in  severe  ground  clutter,  and 
which  would  validate  their  results  using  real  data.  Eval¬ 
uation  of  the  proposals  was  completed  in  February  1993, 
and  the  contracts  should  be  finalized  by  the  middle  of 
1993. 

2  Summary  of  the  Workshop 


2.1  Government  Presentations 

•  “Energizing  DARPA  Application  Programs  with 
University  Participation,”  Dr.  Duane  Adams, 
Deputy  Director,  DARPA 

•  “An  Overview  of  the  DARPA  Surveillance  and  Ihr- 
geting  Programs,”  Dr.  Larry  Stotts,  Asst.  Director, 
Advanced  Systems  Technology  Office,  DARPA 

•  “Issues  and  Approaches  in  Automatic  Target  Recog¬ 
nition,”  Mr.  Edward  Zelnio,  Air  Force  Wright  Aero¬ 
nautical  Laboratory 

•  “Night  Vision  ATR  Performance  Evaluation,” 
Dr.  Clarence  Walters,  Night  Vision  and  Electro- 
Optic  Directorate 

•  “High  Performance  Sensor  Procesnng,”  Dr.  Do¬ 
minick  Giglio,  Sensors  and  Processing,  ASTO, 
DARPA 

•  “Image  Understanding,”  Mr.  Oscar  Firschein,  Im¬ 
age  Understanding,  Software  and  Intelligent  Sys¬ 
tems,  DARPA 

•  “Spatial  Reasoning  for  Tsctical  Fhsion,” 
Mr.  Richard  Anthony,  Signals  Warfare  Directorate 

•  “Wavelet  Techniques  for  Detection/Recognition,” 
Lt.  Col.  Jim  Crowley,  Applied  Science,  Defense  Sci¬ 
ences  Office,  DARPA 

•  “Neural  and  Gabor  Techniques  for  ATR,”  Major 
Steve  Rogers,  Air  Force  Institute  of  Technology 

•  “Neural  Nets  for  ATR,”  Dr.  Barbara  Yoon,  Mi¬ 
crotechnology  Tedinology  Office,  DARPA 

2.2  Technical  Overviews 

Some  of  the  techical  overviews  were: 


•  “State  of  the  Art  Array  Processing  Algorithms 
for  simultaneous  Detection  and  Estimation,”  Prof. 
Harry  Van  TVees  and  Prof.  Yariv  Ephraim,  George 
Mason  University 


The  best  way  to  capture  the  fiavor  of  the  workshop  is  to 
present  the  titles  and  authors  of  some  of  the  key  presen¬ 
tations. 
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•  “Optimal  Filtering  for  Multiband  Detection,” 
Prof.  Irving  Reed,  University  of  Southern  Califor¬ 
nia 

•  “Image  Processing  and  Analysis,  an  Overview,” 
Prof.  Azriel  Rosenfeld,  University  of  Maryland 

•  “Gabor,  Wavelet,  Morphological,  and  Distortion- 
Invariant  Filter  Sets,”  Prof.  David  Casasent, 
Carnegie  Mellon  University 

•  “The  Challenges  of  Automatic  Sensor  Interpreta¬ 
tion,”  Dr.  Dan  Dudgeon,  MIT  Lincoln  Labs 

2.3  Panel  Discussions 

The  panel  discussions  were  of  particular  interest: 

•  “Issues  and  Challenges  in  ATR,”  Prof.  Clay¬ 
ton  Stewart,  George  Mason  University  (modera¬ 
tor),  Mr.  Trent  DePersia,  ASTO-DARPA;  Major 
Steve  Rogers,  Air  Force  Institute  of  Technology; 
Capt.  Steve  Suddarth,  Air  Force  Office  of  Scientific 
Research,  and  Mr.  Edward  Zelnio,  Air  Force  Wright 
Laboratory. 

•  “Issues  and  Challenges  in  Advanced  Sensor  Process¬ 
ing,”  Dr.  Larry  B.  Stotts,  ASTO-DARPA  (moder¬ 
ator),  Dr.  Jurgen  Gobien,  SAIC,  Dr.  Les  Novak, 
MIT  Lincoln  Labs;  Dr.  Serpil  Ayashli,  MIT  Lincoln 
Labs;  Dr.  Dino  Sofianos,  SAIC,  and  Dr.  Jack  Ced- 
erquist,  ERIM. 


3  Conclusions 

This  workshop  brought  together  people  in  various  disci¬ 
plines  related  to  advanced  ATR.  As  a  result  of  the  work¬ 
shop  it  was  realized  that  new  ideas  from  the  university 
community  are  required  in  ATR  to  deal  vnth  the  new 
sophisticated  threats  such  as  fleetingly  mobile  targets 
in  “deep  hide,”  decoys  and  deception,  and  short  expo¬ 
sure  times.  In  addition,  it  is  necessary  for  ATR  systems 
to  function  in  day  or  night,  or  adverse  weather.  It  is 
hoped  that  the  research  contracts  resulting  from  BAA 
93-07,  “University  Research  Initiative  into  the  Theory 
and  Strategies  for  Automatic  Target  Recognition,”  can 
provide  some  of  these  new  ideas. 


2.4  Technical  Papers 

The  technical  papers,  too  numerous  to  list,  were  in  the 
areas  of  SAR,  IR,  and  laser  sensors;  multisensor  fusion, 
use  of  wavelets  and  Gabor  techniques,  neural  networks, 
and  ims^e  understanding.  Of  particular  interest  to  the 
lU  community  were  the  following  papers  in  the  Image 
Understanding  Session: 

•  “Computational  Sensors,”  Prof.  Takeo  Kanade, 
Carnegie  Mellon  University 

•  “Active  Vision,  Task-oriented  Vision,”  Prof.  Chris 
Brown,  University  of  Rochester 

•  “Understanding  SAR  images,”  Prof.  Rama  Chel- 
lapa.  University  of  Maryland 

•  “Fusion  of  Multispectral  and  Panchromatic  Imagery 
for  Automated  Cartography,”  Prof.  Dave  McKeown, 
Carnegie  Mellon  University 
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Young  Investigator  Reports 


Steven  Abrams*  Colurntna  University 
Planning  Viewpoints  in  an  Active  Environment 


Bruce  Draper,  University  of  Massachusetts 
Learning  in  Vision 


Noah  S.  Friedland,  University  of  Maryland 
An  Integrated  Approach  to  Object  Recognition 


Robert  Mandelbaum,  University  of  Pennsylvania 
Mvlti-Sensor  Fusion  for  a  Multi-Agent  Robotic  System 


Ray  Rimey,  University  of  Rochester 
Studying  Control  of  Selective  Perception  with  T-Vforld  and  TEA 


Jefferey  Shufelt,  Carnegie  Mellon  University 
Incorporating  Vanishing  Point  Geometry  into  a  Building  Extraction  System 


William  M.  Wells  m,  Massachusetts  Institute  of  Technology 
Statistical  Object  Recognition  with  the  Expectation-Maximization  AlgprWm 


Mourad  Zerroug,  University  of  Southern  California 
From  Monocular  Intensity  Images  to  Volumetric  Descriptions 


Section  Vm 

Cartography/Photogrammetry 
Camera  Calibration 
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P.O.  Box  8,  Schenectady,  NY,  12301. 
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Abstract 

In  this  paper,  a  method  of  determiaiag  the  essen¬ 
tial  matrix  for  uncalibrated  cameras  is  given,  based  on 
line  matches  in  three  images.  The  three  cameras  may 
have  different  unknown  calibrations,  and  the  essential 
matrices  corresponding  to  each  of  the  three  pairs  of 
cameras  may  be  determined.  Determination  of  the  es¬ 
sential  matrix  for  uncalibrated  cameras  is  important, 
forming  the  basis  for  many  algorithms  such  as  com¬ 
putation  of  invariants  image  rectification,  camera  cal¬ 
ibration  and  scene  reconstruction. 

in  the  case  where  a  fourth  view  is  available,  and  all 
four  cameras  are  assumed  to  have  the  same  unknown 
calibration,  the  method  of  [Faugeras-Maybank-92a, 
Faugeras-Maybank-92b]  may  be  used  to  calibrate  the 
camera.  The  scene  may  then  be  reconstructed  ex¬ 
actly  (up  to  a  scaled  Euclidean  transformation).  This 
extends  previous  results  of  Weng,  Huang  ana  Ahttja 
([Weng-92})  who  gave  a  method  for  scene  reconstruc¬ 
tion  from  13  line  correspondences  using  a  calibrated 
camera.  The  present  paper  shows  that  the  the  cam¬ 
era  may  be  calibrated  at  the  same  time  that  the  scene 
geometry  is  determined. 


1  Introduction 

A  traditional  approach  to  analysis  of  perspective 
images  in  the  field  of  Computer  Vision  has  been  to 
attempt  to  measure  and  model  the  camera  that  took 
the  image.  A  large  body  of  literature  has  grown  up 
seeking  to  calibrate  the  camera  and  determine  its  par 
rameters  as  a  preliminary  step  to  image  understand¬ 
ing.  The  papers  fBeyer-92]  and  [Beardsiey-92]  rep¬ 
resent  two  of  the  latest  appro«u;hes  to  camera  ceui- 
bration.  A  recent  view  ([Faugeras-92])  is  that  cam¬ 
era  calibration  is  not  desirable  or  necessary  in  many 
image  understanding  situations.  Many  authors  have 
been  led  to  consider  uncalibrated  cameras.  The  study 
of  projective  invariants  f  [Mundy-ZiBserman-92])  is  an 
example  of  a  growing  field  based  on  the  philosophy  of 
avoiding  camera  calibration.  In  fact,  study  of  uncali¬ 
brated  cameras  is  intimately  linked  with  the  study  of 

[trojective  invariants,  for  a  result  of  [Faugeras-92j  and 
HartIey-Gupta-92]  shows  that  under  most  conditions 
a  scene  can  be  determined  up  to  a  projective  transform 
of  projective  3-space  V^hy  a  pair  of  images  taken  by 
uncalibrated  cameras. 

Central  to  the  study  of  pairs  of  images  is  the  essen¬ 


tial  matrix,  introduced  in  [Higgins-Sl]  for  calibrated 
cameras,  but  easily  extended  to  uncalibrated  cameras. 
The  essential  matrix  encodes  the  epipolu  correspon¬ 
dences  between  two  imag^.  It  has  been  shown  to  oe  a 
key  tool  in  scene  reconstruction  from  two  uncalibrated 
views  ([Faugeras-92,  Hartley-Gupta-92])  as  well  as  for 
the  computation  of  invariants  ([Hartrey-93a]).  The 
task  of  image  rectification,  which  seeks  to  line  up 
epipolar  lines  in  a  pair  of  images  as  a  preliminary 
step  to  finding  image  correspondences,  can  be  ac¬ 
complished  usmg  the  uncalibrated  essential  matrix 
([Hartley-93c])  where  previous  methods  have  relied 
on  camera  modelling.  It  is  particularly  interesting 
that  the  essential  matrix  may  be  used  for  the  calibra¬ 
tion  of  a  camera,  and  consequent  scene  reconstruction, 

fiven  four  or  more  views  ([Faugeras-Maybank-92a, 
togeras-Maybank-92b]).  This  result  provides  a 
strong  argument  for  not  assuming  camera  calibration 
a  priori,  and  underlines  the  centrm  role  of  the  essential 
matrix. 

A  recent  paper  Weng,  Huang  and  Ahqja 
([Weng-92])  gave  an  algorithm  for  reconstructing  a 
scene  from  a  set  of  at  least  13  line  correspondences 
in  three  images.  They  assumed  a  calibrated  camera 
in  their  algorithm.  It  is  the  purpose  of  the  present 
paper  to  extend  their  result  to  uncalibrated  cam¬ 
eras,  showing  that  the  essential  matrices  can  be  com¬ 
puted  from  three  uncalibrated  views  of  a  set  of  lines. 
It  is  not  assumed  that  the  three  cameras  all  have 
the  same  calibration.  In  fact,  the  essential  matri¬ 
ces  corresponding  to  each  of  the  three  image  pairs 
may  be  computed.  In  the  case  where  four  views  with 
the  same  camera  ue  avilable,  however,  the  result  of 
[Fauger^Maybank-92a,  Faugeras-Maybank-92b]  may 
be  applied  to  obtain  the  complete  calibration  of  the 
four  cuneras  and  reconstruct  the  scene  up  to  a  scaled 
Euclidean  transformation.  Thus,  in  this  case  it  is 
shown  that  the  assumption  of  calibrated  cameras  is 
unnecessary,  for  the  cameras  may  be  calibrated  at  the 
same  time  that  the  scene  is  reconstructed. 

One  unfortunate  aspect  of  the  algorithm  [Weng-92] 
is  that  13  line  correspondences  in  three  images  are 
necessary,  compared  with  eight  point  correspondences 
(and  with  some  effort  only  six,  [Hartley-93b]).  Al¬ 
though  nothing  can  be  done  with  two  views  or  fewer 

isee  [Weng-92j),  a  counting  argument  shows  that  as 
ew  as  nine  lines  in  three  views  may  be  sufficient,  al- 
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though  it  is  extremely  unlikely  that  a  linear  or  closed 
form  algorithm  can  be  found  in  this  case.  It  is  shown 
in  section  4  of  this  paper  that  if  four  of  the  lines  are 
known  to  lie  in  a  plane,  then  a  linear  solution  exists 
with  only  nine  lines. 

2  Preliminaries 

2.1  Notation 

Consider  a  set  of  points  {x,-}  as  seen  in  two  im¬ 
ages.  The  set  of  points  {x,  }  will  be  visible  at  image 
locations  {u{ }  and  {u^ }  in  the  two  images.  In  normal 
circumstances,  the  correspondence  {u,}  *-*  {u|  }  will 
be  known,  but  the  location  of  the  original  points  {x,-} 
will  be  unknown.  Normally,  unprimed  quantities  will 
be  used  to  denote  data  associated  with  the  first  image, 
whereas  primed  quantities  will  denote  data  associated 
with  the  second  image. 

Since  all  vectors  are  represented  in  homogeneous 
coordinates,  their  values  may  be  multiplied  by  any  ar¬ 
bitrary  non-zero  factor.  The  notation  »  is  used  to 
indicate  equality  of  vectors  or  matrices  up  to  multipli¬ 
cation  by  a  scale  factor. 

Given  a  vector,  t  =  ,  t,  it  is  convenient  to 

introduce  the  skew-symmetric  matrix 

/  0  -U  ty  \ 

[t]x  —  I  0  —tx  I  (1) 

\  0  / 

This  definition  is  motivated  by  the  fact  that  for  any 
vector  v,  we  have  [t]xV  =  t  x  v  and  v[t]x  =  v  x  t. 

The  notation  A*  represents  the  adjoint  of  a  matrix 
A,  that  is,  the  matrix  of  cofactors.  If  A  is  an  invertible 
matrix,  then  A*  « 

2.2  Camera  Models 

Nothing  will  be  assumed  about  the  calibration  of 
the  two  cameras  that  create  the  two  images.  The 
camera  model  will  be  expressed  in  terms  of  a  gen- 
erd  projective  transformation  from  three-dimensional 
real  projective  space,  V^,  known  as  object  space,  to 
the  two-dimensional  real  projective  space  T’^known  as 
image  space.  The  transformation  may  be  expressed  in 
homogeneous  coordinates  by  a  3  x  4  matrix  P  known 
as  a  camera  matrix  and  the  correspondence  between 
points  in  object  space  and  image  space  is  given  by 
Uj  «  Px<. 

For  convenience  it  will  be  assumed  throughout  this 
paper  that  the  camera  placements  are  not  at  infinity, 
that  is,  that  the  projections  are  not  parallel  projec¬ 
tions.  In  this  case,  a  camera  matrix  may  be  written 
in  the  form 

P  =  (M  l-Mt) 

where  Af  is  a  3  x  3  non-singular  matrix  and  t  is  a  col¬ 
umn  vector  t  =  representing  the  location 

of  the  camera  in  object  space  (see  [Hartley-93a]). 

2.3  The  Essential  Matrix 

For  sets  of  points  viewed  from  two  cameras, 
Longuet- Higgins  [Higgins-81]  introduced  a  matrix  that 
has  subsequently  become  known  as  the  essential  ma¬ 
trix.  In  Longuet-Higgins’s  treatment,  the  two  cameras 


were  assumed  to  be  calibrated,  meaning  that  the  in¬ 
ternal  cameras  parameters  were  known.  It  is  not  hard 
to  show  (for  instance  see[Hartley-92])  that  most  of  the 
results  also  apply  to  uncalibrated  cameras  of  the  type 
considered  in  this  p^er. 

The  following  basic  theorem  is  proven  in 
[Higgins-81]. 

Theorem  1.  (Longuet* Higgins)  Given  a  set  of  im¬ 
age  correspondences  {u,}  *->■  {u[}  there  exists  a  3  x  3 
real  matrix  Q  such  that 

ujTQui  =  0 

for  all  i. 

The  matrix  Q  is  called  the  essential  matrix.  Next, 
we  consider  the  question  of  determining  the  essential 
matrix  nven  the  two  camera  transformation  matrices. 
The  fouling  result  was  proven  in  [Hartley-92]. 

Proposition  2.  The  essential  matrix  corresponding  to 
a  pair  of  camera  matrices  P  =  {M  |  — Mt)  and  P'  = 
(Af'  I  -Af't')  is  given  by 

Q«Af'*AfT[Af(t'-t)]x  . 

For  a  proof  of  Proposition  2  see  [Hartley-92]. 

2.4  Computing  Lines  in  Space 

Lines  in  the  image  plane  are  represented  as  3- 
vectors.  For  instance,  a  vector  1  =  (/,  m,  n)^  rep¬ 
resents  the  line  in  the  plane  given  by  the  equation 
lu  +  mv  +  nw^  0.  Similarly,  i^anes  in  3-dimensional 
space  are  represented  in  homogeneous  coordinates  as 
a  4-dimensional  vectc’  x  —  (p,  q,  r,  . 

The  relationship  between  lines  in  the  image  space 
and  the  corresponding  plane  in  object  space  is  given 
by  the  following  lemma. 

Lemma  3.  Let  A  be  a  line  in  V^and  let  the  image  ofX 
as  taken  by  a  camera  with  transformation  r.  atrix  P  be 
1.  The  locus  of  points  in  V^that  are  mapped  onto  the 
image  line  1  is  a  plane,  x,  passing  through  the  camera 
centre  and  containing  the  line  A.  It  is  given  by  the 
formula  x  =  P^l. 

Proof.  A  point  x  lies  on  x  if  and  only  if  it  is  mapped  to 
a  point  on  the  line  1  by  the  action  of  the  transformation 
matrix.  This  means  that  Afx  lies  on  the  line  1,  and  so 

rA/x  =  0  .  (2) 

On  the  other  hand,  a  point  x  lies  on  the  plane  x  if  and 
only  if  x^x  =  0.  Comparing  this  with  (2)  lead  to  the 
conclusion  that  x^  =1^  M  or  x  =  M^\  as  required. 

□ 

2.5  Degrees  of  Freedom 

In  this  section,  we  compute  how  many  views  of  a 
set  of  lines  are  necessary  to  determine  the  positions 
of  the  lines  in  space.  Suppose  that  n  unknown  lines 
are  visible  in  k  views  with  unknown  camera  matrices. 
Suppose  that  the  images  of  the  lines  in  e^u:h  of  the  k 
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views  are  known.  Each  line  in  each  view  gives  rise  to 
two  equations.  In  particular,  suppose  A  is  a  Une  in 
T’^and  1  is  the  image  of  that  Ime  as  seen  by  a  camera 
with  camera  matrix  P.  Let  x  be  a  point  on  A,  then  as 
shown  in  (2)  FPx  =  0.  Since  the  line  A  can  be  spec¬ 
ified  by  two  points,  two  independent  equations  arise. 
The  total  number  of  equations  is  therefore  equal  to 
2nk. 

On  the  other  hand,  each  line  in  T^has  four  degrees 
of  freedom,  so  up  to  projectivity,  n  lines  have  a  total 
of  4n  —  15  degrees  of  freedom,  as  long  as  n  >  5.* 
Furthermore,  each  camera  matrix  has  11  degrees  of 
freedom.  In  summary  : 

#D.O.F  =  4n-15-|-llt  , 

#  equations  =  2nk  . 

To  solve  for  the  line  locations, 

2nt>4n-|-llt-15  .  (3) 

In  particular  for  6  lines  at  least  9  views  are  necessary. 
On  the  other  hand,  for  just  3  views,  at  least  9  lines 
are  necessary. 

Once  the  lines  are  known,  the  camera  matrices  may 
be  computed  using  (2),  and  the  essential  matrices  of 
each  pair  may  be  computed  usin^  Theorem  2. 

The  bounds  given  by  f3)  are  tmnimum  requirements 
for  the  computation  of  tne  essential  matrices  of  all  the 
views.  The  necessity  for  at  least  9  lines  in  3  views  just 
demonstrated  should  be  compared  with  section  3  in 
which  a  linear  method  is  given  for  computing  Q  from 
13  lines  in  3  views.  Also,  compare  with  section  4  in 
which  a  linear  method  is  given  for  computing  Q  under 
the  assumption  that  four  of  the  lines  are  coplanar. 

3  Determination  of  the  Essential  Ma¬ 
trix  from  Line  Correspondences 

This  section  will  investigate  the  computation  of  the 
essential  matrix  of  eui  uncalibrated  camera  from  a  set 
of  line  correspondences  in  three  views.  As  discussed 
in  [Weng-92],  no  information  whatever  about  camera 
placements  may  be  derived  from  any  number  of  line- 
to-line  correspondences  in  two  views.  In  [Weng-92] 
the  motion  and  structure  problem  from  line  correspon¬ 
dences  is  considered.  An  assumption  made  in  that 
paper  is  that  the  camera  is  calibrated,  so  that  a  pixel 
in  each  image  corresponds  to  a  uniquely  specified  ray 
in  space  relative  to  the  location  and  placement  of  the 
camera.  It  will  be  shown  in  this  section  that  this  as¬ 
sumption  is  not  necessary  and  that  in  fact  the  same 
approach  can  be  2ulapted  to  apply  to  the  computation 
of  the  essential  matrix  for  uncalibrated  cameras. 

It  will  be  assumed  that  three  different  views  are 
taken  of  a  set  of  fixed  lines  in  space.  That  is,  it  is 
assumed  that  the  cameras  are  moving  and  the  lines 
are  fixed,  which  is  opposite  to  the  assumption  made 
in  [Weng-92].  It  will  not  even  be  assumed  that  the 
images  are  taken  with  the  same  camera.  Thus  the 
three  cameras  are  uncalibrated  and  possibly  different. 
The  notation  used  in  this  section  will  be  similar  to  that 


used  in  [Weng-9^.  Since  we  are  now  considering  three 
cameras,  the  different  cameras  will  be  distinguished 
using  subscripts  rather  than  primes.  Consequently, 
the  three  cameras  will  be  represented  by  matrices 

(A/o  I  0)  ,  (Ml  I  — Miti)  and  (M2  |  — M2t2) 

where  ti  and  t2  are  the  positions  of  the  cameras  with 
respect  to  the  position  of  the  zero-th  camera,  and  M,  is 
a  non-singular  matrix  for  each  t.  For  convenience,  the 
coordinate  system  has  been  chosen  so  that  the  origin 
is  at  the  position  of  the  zero-th  C2unera,  and  so  to  =  0. 

Now,  consider  a  line  in  space  passing  through  a 
point  X  and  with  direction  given  by  a  vector  i.  Let  N, 
be  the  normal  to  the  plane  passing  through  the  center 
of  the  i-th  camera  and  the  line.  Then,  Ni  is  given  by 
the  expression 


Wi  =  (x  -  t,  )  X  /  =  X  X  ^  -  t<  X  f  . 


Then  for  t  =  1, 2, 


No  X  Ni 


(x  X  /)  X  (x  X  ^  -  t<  X  f) 

-(x  X  f)  X  (t,-  X  /) 

-((xxo.f>.-axxf).t<)o 

(Wo.t,)/ 


(4) 


However,  for  i  =  1,2, 


Ni.ti  =  ((x-t<)  xf).ti 

=  ixxt)  .t,-  -  (ti  X  /)  .ti 

=  No-ti 

Combined  with  the  result  of  (4)  this  yields  the  expres¬ 
sion 

No  xNi  =  {Ni.ti) t  (5) 

for  i  =  1,2.  f^om  this  it  follows,  as  in  [Weng-92]  that 

{N2.t2)No  xNi  =  [Ni .ti)No  X  N2  (6) 

Now,  let  Hi  be  the  representation  in  homogeneous 
coordinates  of  the  image  of  the  line  t  in  the  i-th  view. 
According  to  Lemma  3,  N  is  the  normal  to  the  plane 
(Mi  I  — Miti Hi.  Consequently, 


Ni  =  Mjni  . 

Applying  this  to  (6)  lead  to 

(n2'*'M2t2)(Mo'*'no  x  Mi"**!!!) 

=  (ni'’’Miti)(Mo^no  X  M2''’n2)  (7) 

We  now  state  without  proof  a  simple  formula  concern¬ 
ing  cross  products  : 

Lemma  4.  If  M  is  any  3x3  matrix,  and  u  and  v  are 
column  vectors,  then 


nlinwn  in  [HArtlAy>f)3A]  four  linnt  Yiava  two  of 

fr««floin 


(Mu)  X  (Mv)  =  M*(u  X  v)  .  (8) 
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Applying  (8)  to  each  of  the  two  cross  products  in  (7) 
leads  to 

Afo-Hnj'^AfjtaKno  x 

=  Mo-‘(n,TMxti)(iio  X  Mo'MjTna)  .  (9) 

Now,  cancelling  from  each  side  and  combining 
the  two  cross  products  into  one  gives 

no  X  ((n3''’Af2t2)A/oA/i^ni  -  (ni'’’Miti)Afo  M2''^n2) 

=  0.  (10) 

As  in  [Weng-92],  we  write 

3  =  (n2'^M2t2)Mo*Mi'rni  -  (mTMitOJl/o’Ms'^na 

then  Do  X  B  =  0.  Now,  writing 


from  which  it  follows,  using  Proposition  2  that  Qqi 
is  the  essential  matrix  corresponding  to  the  (ordered) 
pair  of  transformation  matrices  (A^  |  0)  and  {Mi  ] 
—Miti). 

FYom  the  definition  of  =  riU^  -  tsi^  it  follows 
that  ^''^(t  X  ri)  =  0.  If  E  has  rank  2,  then  (t  x  ri) 
can  be  determined  up  to  an  unknown  scale  factor.  If 
the  same  way,  if  F  and  G  have  rank  2,  then  (t  x  rj) 
can  be  similarly  determined.  Since  these  three  vectors 
are  the  columns  of  the  essential  matrix  Qoi ,  it  means 
that  Qoi  can  be  determined  up  to  individually  scaling 
its  columns.  How  to  handle  the  case  where  E,  F  ot  G 
does  not  have  rank  2  is  discussed  in  [Weng-92]. 

Now,  by  interchanging  the  roles  of  the  first  and 
second  cameras  in  this  analysis,  it  is  possible  to  de¬ 
termine  the  matrix  Qio  up  to  individual  scalings  of  its 
columns.  However,  since  Qoi  =  Qio^  the  matrix  Qoi 
can  be  determined  up  to  scale. 


Miti  =  t 

A/2t2  =  U 


vector  B  can  be  written  in  the  form 

/  ni'^(riu'^ -t8i‘'’)n2  \  /  ni'^£;ii2  \ 

3=  I  ni''’(r2u'’’ -  t82''^)n2  1  =  j  ni''’Fn2  I  . 

\  ni‘''(r3u‘^  -t83''')n2  /  V  “1^0112  / 

(13) 

Where  E,  F  and  G  are  defined  by  this  formula.  There¬ 
fore,  we  have  the  basic  equation 


This  is  essentially  the  same  as  equation  (2.13)  in 
[Weng-92l,  derived  here,  however,  for  the  case  of  un- 
calibrated  cameras.  As  remarked  in  [Weng-92],  for 
each  line  t,  equation  (1^  gives  rise  to  two  linear  eciua- 
tions  in  the  entries  of  E,  F  and  G.  Given  13  lines  it 
is  possible  to  solve  for  E,  F  and  G,  up  to  a  common 
scale  factor. 

We  now  define  a  matrix  Qoi  by 

Qoi  =  (t  X  ri,t  X  r2,t  x  rs) 

This  may  be  written  as  Qoi  =  [t]x  (ri,r2,r3).  Then, 
we  see  that 

/  ri^ 

Qoi^  =  —  I  r2^ 

V  ^3^ 

and  in  view  of  the  definitions  of  r,-  and  t  given  in  (12), 
we  have 

Qoi^  =  A/o*M,T[A/,ti]* 


4  Computation  from  9  lines 

If  the  lines  are  known  to  satisfy  certain  geometric 
constraints,  then  it  is  possible  to  compute  the  essential 
matrix  using  fewer  lines  in  three  views.  The  general 
idea  is  that  if  the  projective  geometry  of  some  (dane  in 
the  image  can  be  fixed,  then  the  determination  of  the 
epipolar  geometry  is  simplified.  This  observation  was 
applied  to  the  determination  of  Q  from  point  corre¬ 
spondences  in  [Zisserman-Hartley-93].  Instead  of  con¬ 
sidering  the  configuration  of  9  lines  of  which  four  are 
coplanar,  we  consider  four  points  in  a  plane  and  five 
lines  not  in  the  plane.  From  four  lines  in  a  plane  it  is 
easy  to  identify  four  points  as  the  intersections  of  pairs 
of  lines.  Thus,  let  Xj , . . .  X4  be  four  points  lying  in  a 
plane  ir  in  T®.  Let  the  images  of  these  points  as  seen 
in  three  images  be  ii,’,  and  «(•'.  We  suppose  for  con¬ 
venience  that  the  images  have  been  sunjected  to  ap¬ 
propriate  projective  transforms  so  that  Uj  =  u{  =  uj' 
for  all  I.  Then,  a  necessary  and  sufficient  condition 
for  any  further  point  x  to  lie  in  the  plane  x  is  that  x 
projects  to  the  s^une  point  in  all  three  images. 

This  observation  may  be  viewed  in  a  different  way. 
We  may  assume  that  the  image  planes  of  the  three 
images  are  all  identical  with  the  plane  x  itself,  since 
by  an  appropriate  choice  of  projective  coordinates  in 
each  of  the  image  planes,  it  may  be  ensured  that  the 
projective  mapping  from  plane  x  to  each  of  the  image 
planes  is  the  identity  coordinate  map.  The  projective 
mapping  associated  with  each  camera  maps  a  point  x 
in  space  to  the  image  point  11  in  which  the  line  throi^h 
X  and  the  camera  centre  pierces  the  image  plane.  Co¬ 
ordinates  for  P^may  be  chosen  so  that  the  plane  x  is 
the  plane  at  infinity  and  the  first  camera  is  placed  at 
the  point  (0,0,0,!)^.  Let  the  other  two  cameras  be 

placed  at  the  points  ^  ^  ^  ^  ^  three 

camera  transformation  matrices  are  then  P  =  (/  |  0)  , 
P'  =  (/  I  -a)  and  P"  =  (7  |  -b).  If  we  can  compute 
the  vectors  a  and  b,  then  the  essential  matrices  can 
be  computed  using  Theorem  2. 

Now  consider  a  line  A  in  P^which  does  not  lie  in 
the  image  plane.  Let  the  projections  of  A  with  respect 
to  the  three  cameras  be  t,  t'  and  t" .  Since  A  does  not 
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lie  in  the  image  plane,  its  three  images  will  be  distinct 
lines.  However,  lines  /,  t  and  C*  must  all  meet  at  a 
common  point,  namely  the  point  at  which  A  meets  the 
image  plane. 

Given  C  and  t!'  the  line  A  may  be  retrieved  as 
the  intersection  of  the  three  planes  defined  by  each  line 
and  its  corresponding  camera  centre.  Each  such  plane 
may  be  computed  explicitly.  In  particular  from  (2) 

the  three  planes  are  equal  to  I  z.  ^  q  ^ ,  P'^ t!  = 

and  ^  y  The  fact  that 

these  three  planes  meet  in  a  common  line  implies  that 
the  4  X  3  matrix 


(  t! 

I 


A  = 


fit  t" 
0  /'■^a 


has  rank  2.  Hence,  there  must  be  a  linear  dependency 
between  the  columns  of  A. 

As  remarked  above,  the  lines  t,  I'  and  I"  are  coin¬ 
cident,  so  there  is  a  relationship  at  4-  /?£'  ft'  =  0. 
This  gives  a  Unear  dependency  between  the  first  three 
rows  of  A.  Since  /,  t  and  t"  are  known,  the  weights 
a,  P  and  7  may  be  computed  explicitly.  Since  A  has 
rank  2,  this  dependency  must  also  apply  to  the  last 
row  as  well  which  me2uis  that 


This  is  a  single  linear  equation  in  the  coordinates  of 
the  two  vectors  a  and  b.  Given  five  such  equations, 
arising  from  five  lines  not  lying  in  the  plane  ir,  it  is 
possible  to  solve  for  a  and  b  up  to  an  unknown  (but 
insignificant)  scale  factor. 


The  above  discussion  was  concerned  with  the  case 
in  which  the  plane  k  was  defined  by  four  points.  Any 
other  planar  object  which  uniquely  defines  a  projec¬ 
tive  basis  for  the  plane  may  be  used  just  as  weU,  for 
example  four  coplanar  Unes  (as  already  noted).  This 
shows  that  four  coplanar  lines  plus  five  lines  not  in 
the  plane  are  sufficient  (in  3  views)  to  determine  the 
essential  matrices. 

5  Conclusion 

The  two  algorithms  given  above  can  be  used  to  de¬ 
termine  the  essential  matrices  for  the  purposes  of  in¬ 
variant  computation,  scene  reconstruction,  image  rec¬ 
tification  or  some  other  purpose. 

Most  interesting  would  be  the  case  in  which  we 
have  four  views  with  the  same  camera.  Then 
the  cameras  can  be  calibrated  and  the  entire 
scene  reconstructed  up  to  scaled  Euclidean  trans¬ 
form  from  Une  correspondences  in  three  views. 
In  order  to  implement  this  method,  an  effi¬ 
cient  implementation  of  the  calibration  algorithm 
of  Faugeras  and  Maybank  ([Faugeras-Maybank-92a, 
Faugeras-Maybank-92b])  would  be  required.  At  the 
present  time,  no  such  implementation  is  available, 
so  the  calibration  method  described  in  this  paper 
also  remains  unimplemented.  This  paper,  there¬ 
fore  represents  a  contribution  to  the  theory  of  cal¬ 
ibration  and  scene  reconstruction.  It  seems  likely, 
however,  that  an  efficient  implementation  of  the  al¬ 
gorithms  of  this  paper  and  ^augeras-Maybank-92a, 
Faugeras-Maybank-92b]  will  oecome  available  in  the 
future. 


Summary  of  the  algorithm  The  algorithm  for  de¬ 
termining  the  essential  matrices  from  four  coplanar 
points  and  five  Unes  in  three  images  is  as  follows.  We 
start  with  coordinates  u,-,  u|-  and  the  images  of 
the  points  in  the  three  images  and  also  I,  t  and  f", 
the  images  of  the  lines.  The  steps  of  the  algorithm  are 
as  follows. 

1.  Determine  two-dimensional  projective  transfor¬ 
mations  rep  esented  by  non-singular  3x3  ma¬ 
trices  K'  and  K"  such  that  for  each  i  =  1,...4 
we  have  u,-  =  /C'uJ  =  K"vi!f. 

2.  Replace  each  line  by  the  transformed  line 
Wti,  and  each  t!  by  K"*t;. 


3.  For  each  t  =  1, ...  ,5  find  coefficients  a,-.  Pi  and 
7i  such  that  -b  -  +  7,/"  =  0. 


4.  Solve  the  set  of  five  linear  equations  Pit^^a  + 
7<f"^b  =  0  to  find  the  vectors  a  and  b,  up  to 
an  indeterminate  scale. 

5.  The  three  essential  matrices  are- 
K"'^ [b]x  and  K"'^ [b  -  a]x  K' . 
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Abstract 

Recent  work  has  shown  that  a  number  of  vision 
problems  become  simpler  when  one  models 
projection  from  3-D  to  2-D  as  a  non-rigid  lin¬ 
ear  transformation.  This  projection  model  has 
been  used  fruitfully  to  solve  problems  in  mo¬ 
tion  understanding,  search  based  object  recog¬ 
nition,  the  indexing  of  models  in  a  data  base, 
and  in  alignment  types  of  recognition.  These 
results  have  been  largely  restricted,  however, 
to  models  and  scenes  that  consist  only  of  3-D 
points.  In  this  paper  we  show  what  happens 
when  one  attempts  to  extend  these  results  to 
the  somewhat  more  complicated  domain  of  ori¬ 
ented  points.  In  particular,  we  show  how  to 
perform  indexing  with  these  features,  how  to 
derive  structure  from  motion  sequences,  and 
how  to  determine  the  image  of  a  model  from 
a  small  number  of  correspondences.  However, 
we  show  that  all  of  these  problems  become  fun¬ 
damentally  more  complex  with  oriented  point 
features.  More  space  is  required  for  index¬ 
ing,  more  images  are  required  to  derive  struc¬ 
ture,  and  new  views  cannot  be  synthesized  lin¬ 
early  from  old  views.  Moreover,  these  are  not 
just  idioeyncracies  of  our  approach,  they  are 
inherent  properties  of  the  problems  that  we 
address.^ 

1  Introduction 

Recent  work  has  shown  that  a  number  of  vision  prob¬ 
lems  become  simpler  when  one  models  projection  from 
3-D  to  2-D  as  a  non-rigid  linear  transformation.  These 
results  have  been  largely  restricted,  however,  to  mod¬ 
els  and  scenes  that  consist  only  of  3-D  points.  In  this 
paper  we  show  what  happens  when  one  attempts  to  ex¬ 
tend  these  results  to  the  somewhat  more  complicated 
domain  of  oriented  points.  In  particular,  we  show  how 

*  Research  described  in  this  paper  was  conducted  at  the 
MIT  Artificial  Intelligence  Laboratory.  Support  for  this  re¬ 
search  was  provided  in  part  by  the  University  Research  Initia¬ 
tive  under  Office  of  Naval  Research  contract  N00014-86-K- 
0685,  and  in  part  by  the  Advanced  Research  Projects  Agency 
under  Army  contract  DACA76-85-C!-0010  and  under  Office 
of  Naval  Research  contract  N00014-85-K-0124. 


to  perform  indexing  with  these  features,  how  to  derive 
structure  from  motion  sequences,  and  how  to  determine 
the  image  of  a  model  from  a  small  number  of  correspon¬ 
dences.  However,  we  show  that  all  of  these  problems  be¬ 
come  fundamentally  more  complex  with  oriented  point 
features. 

In  this  work,  one  assumes  that  a  3-D  object  is  trans¬ 
formed  by  an  arbitrary  affine  transformation,  followed 
by  a  scal^  orthographic  projection.  Applying  an  affine 
transformation  to  a  set  of  3-D  points  is  equivalent  to  ap¬ 
plying  an  arbitrary  3z3  matrix  to  the  points,  and  then 
translating  them.  We  will  call  this  a  3^D  to  2-D  linear 
transformation.  Standard  models  of  projection  assume 
that  a  3-D  object  is  displaced  in  a  scene  with  a  rigid. 
Euclidean  motion,  and  then  projected  to  a  2-D  image, 
using  either  perspective  or  scaled  orthographic  projec¬ 
tion.  Since  a  3-D  affine  transformation  is  a  non-rigid 
generalization  of  3-D  rigid  transformations,  the  linear 
projection  model  includes  more  standard  projections  as 
a  subset. 

We  focus  on  three  pieces  of  past  work  that  dealt  with 
the  linear  projection  of  3-D  point  features  into  2-D  im¬ 
age  features.  First,  UUman  and  BasriflS]  show  that  any 
novel  2-D  view  of  a  rigid  3-D  structure  is  a  linear  combi¬ 
nation  of  a  small  number  of  basis  views  of  the  structure. 
This  means  that  one  can  fully  represent  3-D  structure 
implicitly,  with  a  few  2-D  views,  and  use  these  to  predict 
the  appearance  of  the  full  structure  bfised  on  the  loca¬ 
tion  of  a  few  of  its  points.  Second,  Jacobs[5]  considers  the 
problem  of  using  an  ordered  group  of  2-D  image  points  to 
index  into  lookup  tables  where  one  represents  groups  of 
3-D  model  points.  This  indexing  step  finds  geometrically 
consistent  matchings  between  the  image  and  the  model. 
It  is  shown  that  one  can  optimally  perform  this  indexing 
by  representing  each  model  group  with  a  pair  of  lines 
in  two  orthogonal  index  spaces,  using  an  image  group 
to  compute  a  key  into  each  index  space  and  then  inter¬ 
secting  the  result  of  two  table  lookups.  (Lamdan  and 
Wolf8on[9]  had  previously  used  linear  projections  to  de¬ 
vise  a  more  limited  3-D  to  2-D  indexing  method).  Third, 
Koenderink  and  van  Doorn[7]  show  that  given  two  2-D 
views  of  a  set  of  3-D  points,  one  could  infer  the  3-D 
affine  structure  of  those  points.  That  is,  one  learns  the 
3-D  structure  up  to  an  arbitrary  affine  transformation, 
which  is  all  that  one  can  determine  given  our  projection 
model  (see  Shashua[12]  for  related  results).  There  has 
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been  other  work  that  capitalizes  on  the  linear  projection 
model,  but  which  is  less  directly  related  to  our  current 
results.  This  includes  Roberts[ll]  early  method  of  de¬ 
termining  object  poses  for  recognition,  and  Cass’[3]  and 
Breuers[2]  recent  '  ork  in  object  recognition. 

All  the  work  mentioned  above  is  restricted  to  scenes 
consisting  of  3-D  points.  (We  use  “scenes”  or  “models” 
interchangeably  to  refer  to  3-D  configurations  of  features 
that  might  be  viewed  from  arbitrary  directions).  In  this 
paper,  we  consider  scenes  that  also  include  3-D  orien¬ 
tations.  By  orientation,  we  mean  that  one  or  more  3-D 
vectors  are  associated  with  some  or  all  of  the  3-D  points, 
where  only  the  directions  and  not  the  magnitudes  of  the 
vectors  are  known.  For  example,  if  we  form  a  model  from 
distinguished  points  on  a  curve  along  with  their  tangent 
vectors,  we  get  an  orientation  vector  associated  with  each 
point.  If  we  use  the  vertices  of  a  polyhedra  as  our  model, 
the  direction  of  the  lines  that  form  the  vertices  associates 
two  or  three  direction  vectors  with  each  point.  Oriented 
points,  then,  are  of  practical  significance.  They  also  pro¬ 
vide  perhaps  the  simplest  generalization  of  past  work.  A 
different  generalization  from  point  features  is  described 
by  Basri  and  Ullman[l],  who  consider  the  images  pro¬ 
duced  by  solid  objects  with  smooth  surfaces. 

We  will  show  that  when  we  use  oriented  points  the 
problems  of  image  construction  by  linear  combination, 
indexing,  and  affine  structure  from  motion  all  become 
fundamentally  more  difficult.  We  show  that  novel  views 
of  an  object  are  not  linear  combinations  of  past  views, 
except  in  a  trivial  sense,  although  we  do  show  how  new 
views  can  be  reconstructed  nonlinearly  based  on  old 
views.  We  show  that  indexing  cannot  be  done  by  repre¬ 
senting  model  groups  using  a  pair  of  1-D  lines.  To  per¬ 
form  indexing,  we  must  represent  each  group  of  model 
features  by  a  2-D  surface  in  an  index  space.  We  show 
how  to  build  this  representation,  but  our  results  prove 
that  representing  groups  of  oriented  point  features  in  an 
index  table  inherently  requires  much  more  space  than 
is  needed  to  represent  groups  of  simple  point  features. 
And  we  show  that  correspondences  between  features  in 
four  views  of  oriented  3-D  points  are  needed  to  deter¬ 
mine  their  affine  structure,  whereas  only  two  views  were 
needed  to  derive  the  structure  of  simple  point  features. 
We  also  show  how  to  derive  the  3-D  structure  of  oriented 
point  features  by  solving  a  simple  set  of  linear  equations. 

In  addition  to  presenting  these  concrete  results,  a  sec¬ 
ond  goal  of  this  paper  is  to  demonstrate  the  value  of 
approaching  such  problems  by  first  characterizing  the 
set  of  images  that  a  model  can  produce.  We  begin  by 
providing  a  simple  analytic  mapping  from  a  3-D  model 
to  a  geometric  structure  that  represents  all  the  images 
that  the  model  can  produce,  when  it  is  viewed  using  all 
possible  linear  transformations.  Given  this  mapping,  we 
show  how  past  results  concerning  simple  point  features 
can  be  rederived  and  extended. 

2  Descriptions  of  a  Model’s  Images 

We  take  a  geometric  approach  to  the  problem  of  rep¬ 
resenting  a  model’s  images.  We  describe  models  using 
manifolds  in  image  apace.  For  our  purposes  a  manifold 
may  be  thought  of  as  just  a  simple  n-dimensional  sur¬ 


face  in  a  higher  dimensional  space.  An  image  space  is 
just  a  particular  way  of  representing  an  image.  If  we 
describe  an  image  using  some  parameters  then  each  pa¬ 
rameter  is  a  dimension  of  image  space,  and  each  set  of 
values  of  these  parameters  is  a  point  in  image  space  cor¬ 
responding  to  one  or  more  images  that  are  described 
by  this  set  of  parameters.  For  example,  if  our  image 
consists  of  a  set  of  n  2-D  points,  then  we  can  describe 
each  image  by  the  cartesian  coordinates  of  these  points, 
(xi,yi,X2,y2,  -Xn,yn)-  These  coordinates  describe  a 
2n-dimensiona]  image  space.  Suppose  then  that  our 
models  consist  of  sets  of  3-D  points.  As  we  apply  all  pos¬ 
sible  transformations  to  these  models,  we  produce  a  large 
set  of  images  that  correspond  to  a  manifold  in  our  2n- 
dimensional  image  space.  We  will  therefore  talk  about  a 
group  of  model  features  producing  or  corresponding  to  a 
manifold  in  image  space.  We  will  generally  assume  that 
these  groups  are  ordered,  avoiding  the  problem  of  find¬ 
ing  correspondences,  although  canonical  orderings  can 
in  some  cases  be  defined  (see  Clemens  and  Jacobs[4]). 
Representing  the  manifolds  of  a  group  of  features  is  just 
a  method  of  representing  its  potential  images,  and  im¬ 
plicitly,  of  representing  its  3-D  structure.^ 

We  can  use  this  geometric  approach  to  solve  a  number 
of  problems  in  which  we  reason  about  the  3-D  structures 
that  are  consistent  with  one  or  more  2-D  images.  In  this 
view,  the  problem  of  determining  which  scenes  are  com¬ 
patible  with  a  set  of  images  becomes  the  problem  of  de¬ 
termining  which  manifolds  in  image  space  could  contain 
the  points  that  correspond  to  these  images.  We  begin 
by  reviewing  results  presented  in  Jacobs[5]  for  models 
consisting  of  3-D  point  features.  This  will  provide  an 
example  of  our  approMh,  and  present  results  on  which 
we  will  build. 

We  first  note  that  there  are  several  equivalent  ways 
of  formulating  our  projection  model.  We  have  already 
described  it  as  applying  an  arbitrary  3-D  affine  transfor¬ 
mation  to  a  3-D  model,  followed  by  scaled  orthographic 
projection.  For  point  features,  a  3-D  affine  transforma¬ 
tion  is  modeled  by  applying  an  arbitrary  3x3  matrix  to 
the  points,  then  adding  an  arbitrary  3-D  translation  vec¬ 
tor.  This  is  the  method  used  by  Koenderink  and  van 
Doorn[7].  This  formulation  is  equivalent  to  assuming 
that  our  3-D  scene  is  projected  in  parallel,  in  an  arbi¬ 
trary  direction,  onto  a  2-D  plane,  if  we  then  allow  the 
resulting  2-D  image  to  be  transformed  by  an  arbitrary 
2-D  affine  transformation.  A  2-D  affine  transformation 
of  an  image  is  equivalent  to  taking  a  new  photograph 
of  the  image,  assuming  scaled  orthographic  projection. 
This  formulation  is  used  in  [5],  and  we  will  use  it  here. 
Both  projection  methods  are  equivalent  to  applying  an 
arbitrary  3x2  matrix  to  point  features,  and  then  arbi- 
traritly  translating  the  resulting  2-D  points. 

There  are  two  parts  to  describing  the  images  that 
models  can  produce,  given  these  transformations.  First, 
what  is  the  image  space?  Second,  what  manifolds  do 
models  correspond  to  in  this  image  space?  If  we  think 
about  image  formation  as  resulting  from  parallel  projec- 

^It  is  possible  to  define  image  spaces  in  which  modeb  do 
not  correspond  to  manifolds,  but  such  representations  seem 
far-fetched,  and  we  will  not  consider  them  here. 
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tion  followed  by  a  2-0  affine  transformation,  then  it  is 
natural  to  represent  images  in  a  way  that  is  invariant  un¬ 
der  2-D  affine  transformation.  This  allows  us  to  separate 
the  information  in  the  image  that  depends  on  the  model 
from  the  information  that  depends  solely  on  the  viewing 
transformation.  We  therefore  represent  images  of  point 
features  as  follows.  We  use  the  first  three  points  of  the 
image  to  define  an  affine  basis.  That  is,  if  we  denote  the 
image  points:  (qi,  q2,  •••,  qn),  let: 

o  =  qi  u  =  q2  -  qi  v  =  q3  -  q^ 

Then  we  may  fully  describe  the  locations  of  the  remain¬ 
ing  points  using  affine  coordinates  derived  with  respect 
to  this  basis.  For  example,  we  describe  q^  with  the  par 
rameters  (04,  where: 

q4  =  o-f  a4U-t-/94V 

It  is  importamt  to  what  follows  that  the  affine  coordinates 
of  a  point  are  left  unchanged  by  any  affine  transform. 
That  is,  for  any  2x2  matrix.  A,  and  any  2-D  translation, 
t,  if  q4  =  o  -b  a4U  -H  /?4V  then  -H  t  =  (Ao  -b  t)  + 
a4i4u  -b  PaAv. 

An  image  is  fully  described  by  the  parameters: 
(o,u,  v,(a4,jd4),  ...(a„,^„)).  Due  to  the  model  of  pro¬ 
jection  we  use  we  may  ignore  the  first  three  of  these 
parameters.  To  see  this,  we  note  that,  except  in  de¬ 
generate  cases,  there  exists  an  affine  transform  that  will 
map  any  three  image  points  to  any  other  three  im¬ 
age  points.  Therefore  if  a  scene  can  produce  the  im¬ 
age,  (o,  u,  V,  (04,  ...(ttn,  fin)),  if  can  also  produce  the 
image  (o',u', v', (a4,/?4), ...(a„, Ai))  for  any  choice  of 
(o',u',  v'),  by  combining  the  affine  transform  that  maps 
(o,  u,  v)  to  (o',  u',  v')  with  the  affine  transform  that  was 
part  of  the  projection  that  produced  the  original  image. 
Meanwhile,  this  affine  transformation  will  not  effect  the 
remaining  parameters  of  the  image.  Therefore,  the  par 
rameters  (o,  u,  v)  provide  no  information  about  whether 
a  scene  could  produce  an  image. 

The  remaining  image  parameters  form  what  we  will 
call  an  affine  space.  An  image  with  n  ordered  points  is 
mapped  into  a  point  in  a  2(n— 3)-dimensional  affine  space 
by  finding  the  affine  coordinates  of  the  image  points, 
using  the  first  three  as  a  basis.  We  divide  the  affine 
space  into  two  orthogonal  subspaces,  an  a-space,  and 
a  fi-space.  The  a-space  is  the  set  of  a  coordinates  of 
the  image’s  affine  coordinates,  and  the  fi-space  is  simi¬ 
larly  defined.  The  affine  space  is  then  equal  to  the  cross 
product  of  the  a-space  and  the  fi-space,  and  each  image 
corresponds  to  a  point  in  each  of  these  two  spaces.  The 
previous  paragraph  states  that  the  images  that  a  scene 
can  produce  are  fully  described  by  the  locus  of  points 
these  images  map  to  in  affine  space. 

We  now  extend  this  image  space  to  provide  an  affine 
invariant  representation  of  oriented  point  features.  To 
simplify  this  representation,  we  assume  that  each  model 
contains  at  least  three  oriented  points.  We  then  continue 
to  use  three  image  points  to  define  an  affine  basis,  and 
describe  the  points’  orientation  vectors  using  this  basis. 
Our  image  consists  of  points  with  associated  direction 
vectors.  Without  loss  of  generality  we  may  locate  these 
vectors  at  the  origin  (see  figure  1).  We  describe  any  ad¬ 
ditional  image  points  using  their  affine  coordinates,  and 


Figure  1:  The  three  points  shown  are  used  as  an  affine 
basis,  and  the  slopes  of  the  vectors  are  found  in  this 
coordinate  system. 


we  describe  each  orientation  vector  by  its  affine  slope. 
The  affine  slope  of  a  vector  at  the  origin  is  just  where 
(a,  fi)  is  any  point  in  the  direction  of  the  vector. 

It  is  easily  seen  from  the  properties  of  affine  transfor¬ 
mations  that  the  affine  slope  of  a  vector  is  well  defined 
and  is  invariant  under  affine  transformations.  This  rep¬ 
resentation  of  vectors  is  equivalent  to  an  affine  invariant 
representation  derived  by  Van  Gool  et  al.[l^  using  dif¬ 
ferent  methods.  We  use  affine  slope  to  define  a  new 
image  space  that  combines  point  and  vector  infcwma- 
tion.  We  describe  an  image  with  the  affine  coordinates 
of  any  points  beyond  the  first  three,  (04,^4,  ...a„,/9U), 
and  with  the  afl^e  slopes  of  all  vectors,  which  we  will 
call  (00 . 0m)-  We  call  the  space  defined  by  these  pa¬ 

rameters  affine  slope  space.  As  before,  the  problem  of 
determining  a  scene’s  manifold  becomes  one  of  deter¬ 
mining  the  set  of  affine  invariant  values  it  may  produce 
when  viewed  from  all  directions.  We  may  again  ignore 
the  actual  position  of  the  first  three  image  points,  except 
in  using  them  to  determine  the  affine  invariant  values  of 
the  remaining  points  and  orientation  vectors. 

We  have  defined  a  simple  image  space  for  images  of 
point  features,  and  a  slightly  more  complex  space  for  ori¬ 
ented  points.  We  now  describe  a  mapping  from  any  3-D 
scene  to  its  corresponding  geometric  shape  in  these  im¬ 
age  spaces.  First,  we  assume  that  scenes  of  simple  point 
features  contain  at  least  five  points:  Px,P2iP3iP4>Pj- 
Then  we  can  show  that  the  set  of  all  affine  coordinates 
that  such  a  model  can  produce  is  described  by  a  series 
of  equations: 


(<*j.fij)  =  (oj,h)  + 


((04,^4) -(04.64)) 


(1) 


We  have  one  such  equation  for  each  model  point  beyond 
the  first  four.  The  values  of  ,  aj  and  bj  are  just  mea¬ 
surable  properties  of  the  3-D  scene  that  do  not  depend 
on  viewpoint  at  all.  r^-  is  the  ratio  of  the  height  of  pj 
above  the  plane  formed  by  the  first  three  scene  points 
to  the  height  of  p^  above  this  plane.  And  {aj,bj)  are 
the  affine  coordinates  of  the  projection  of  pj  down  into 
this  plane,  using  the  first  three  scene  points  as  an  affine 
basis.  Jacobs[5]  derives  equation  1. 

This  set  of  equations  describes  all  images  that  the 
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Figure  2:  We  may  represent  all  of  a  model’s  possible 
images  with  a  pair  of  lines  in  two,  orthogonal  spaces. 
Each  image  corresponds  to  a  pair  of  points,  one  in  each 
of  these  spaces. 


scene  points  may  produce.  For  any  image,  this  equation 
will  hold.  And  for  any  values  described  by  the  equa¬ 
tion,  there  is  a  corresponding  image  that  the  scene  may 
produce.  There  is  one  important  degenerate  case,  in 
which  these  equations  do  not  hold.  If  the  scene  points  are 
coplanar,  then  their  affine  coordinates  are  invariant.  In 
that  case  the  above  equation  becomes  degenerate;  such 
a  scene  can  produce  only  one  set  of  affine  coordinates. 

Interpreted  geometrically,  the  set  of  equations  (1)  de¬ 
scribe  a  2-D  plane  in  the  affine  image  space,  since  rj ,  Oj 
and  bj  are  constant  for  a  specific  scene.  This  provides  a 
concrete  example  of  our  overall  approach.  We  develop  a 
space  such  that  all  images  correspond  to  points  in  that 
space  and  all  scenes  correspond  to  2-D  planes,  or  in  the 
degenerate  case  to  points,  in  that  space.  The  problem 
of  matching  groups  of  features  in  one  or  more  images  to 
groups  of  features  in  possible  scenes  becomes  the  prob¬ 
lem  of  determining  which  planes  pass  through  one  or 
more  points  in  this  space. 

Taking  the  a  and  components  of  these  equations 


O;  =  Oj  + 


(04  -  04) 


Pj  =bj  + 


iP4  -  64) 


we  have  equations  that  describe  a  line  in  a-space.  We 
may  derive  a  similar  set  of  equations  in  /^-space.  These 
equations  are  independent.  That  is,  for  any  set  of  a 
coordinates  that  a  scene  may  produce  in  an  image,  it 
may  still  produce  any  feasible  set  of  0  coordinates.  No¬ 
tice  that  for  any  line  in  a-space,  there  is  some  scene 
whose  images  are  described  by  that  line.  It  is  not  true 


that  there  is  a  scene  corresponding  to  any  pair  of  lines 
in  a-space  and  /9-space  because  the  parameters  rj  are 
the  same  in  the  equations  for  the  two  lines.  This  means 
that  the  two  lines  are  constrained  to  have  the  same  di¬ 
rectional  vector,  but  they  are  not  further  constrained. 
So  we  have  further  simplified  our  representation  to  one 
in  which  models  correspond  to  1-D  lines  and  images  to 
pairs  of  points.  This  is  shown  schematically  in  figure  2. 

Now  we  determine  the  manifolds  that  correspond  to 
scenes  of  oriented  point  features  in  affine  slope  space.  We 
begin  by  introducing  some  special  3-D  points  related  to 
our  scene.  With  every  orientation  vector,  vj,  we  asso¬ 
ciate  some  point,  r:  that  is  in  the  direction  V|  from  the 
origin.  We  denote  tne  points  of  the  scene  by  pj.  We  may 
then  describe  the  images  that  would  be  produced  by  the 
points  (pi, P2>  P3.r0,  •••rm, P4,  -Pn)  with  two  lines  in 
a  and  p  space,  which  we  call  a*  and  /9*.  However,  in 
practice  we  cannot  know  the  images  of  the  points  rj, 
only  the  direction  of  the  vectors  to  them.  We  proceed 
by  determining  the  images  that  are  produced  by  these 
points,  and  then  determining  the  affine  slope  of  the  vec¬ 
tor  to  a  point  when  only  the  direction  of  this  vector  is 
known.  The  vectors  that  we  can  extract  from  the  image 
will  be  in  the  direction  towards  the  location  where  the 
images  of  these  points  would  be,  however.  This  tells 
us  that  the  affine  slope  of  vj’s  image,  called  9i,  will 
equal  the  a  coordinate  of  rj  divided  by  the  p  coordi¬ 
nate  of  rj.  That  is,  Oi  =  So  for  any  two  points 


on  a*  and  p* ,  with  coordinates  (04,  as,  ...an+m+i)  And 
(/94,/?5,  .../Jn+m+i),  the  scene  can  produce  an  image 
which  is  described  by  the  affine  invariant  parameters: 

>  ®n>+5i  An+5>  •••Aw+w+l  •  This 

is  the  mixture  of  affine  coordinates  and  affine  slopes  that 
we  call  affine  slope  space. 

We  will  now  derive  a  set  of  equations  that  describe 
a  scene’s  manifold  in  this  space.  We  begin  by  showing 
how  the  possible  values  for  0i  that  a  scene  produces  can 
be  expressed  as  a  function  of  ffot  ^1,  And  characteristics 
of  the  scene.  First  some  notation.  We  can  describe  a* 
with  a  parameterized  equation  of  the  form: 


a*  =  a  +  tv* 


where  a  is  any  point  in  a  space  on  the  line  a*,  and  has 
coordinates  that  we  denote  by  (oo,  <>1.  And 

V*  is  a  vector  in  a  space  that  expresses  the  direction 
of  a*,  and  has  coordinates  (vq,  .■.v^4.„_3),  and  t  is  a 
variable.  As  k  varies,  we  get  the  points  on  the  line  a*. 
Similarly,  we  let: 

/9*  =b-f  cv* 

Note  that  v*  is  the  same  in  both  equations,  because  the 
two  lines  must  have  the  same  directional  vector,  as  we 
mentioned  earlier. 

We  want  to  find  the  range  of  values  for  (0o,  -^m), 
where,  for  a  particular  choice  of  k  and  c, 

fl.  =  2i±l  = 

Pi+4  bi  +  cv; 

That  is,  any  possible  set  of  affine  slopes  the  scene  may 
produce  is  found  by  finding  a  set  of  a  values  that  the 
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constructed  rj  points  can  produce,  and  then  dividing 
these  by  a  set  of  values  that  could  be  produced.  This 
equation  expresses  these  possible  0  values  as  a  function 
of  the  parameters  k  and  c,  which  may  vary  arbitrarily. 

We  ignore  the  degenerate  cases  where  04  and  ^4  are 
constant  over  a*  or  )9*.  These  are  the  cases  in  which 
the  first  scene  vector  is  coplanar  with  the  three  scene 
points,  and  so  its  affine  slope  does  not  vary  with  viewing 
direction.  Then  we  may  choose  for  a  and  b  those  points 
on  a*  and  for  which  oq  =  0  and  60  =  0,  This  gives  us 
the  equation: 


oq  +  kvl  _  k 
60  +  CWq  c 


This  implies 

c 

We  can  use  this  to  get: 


Si(bi0o  +  *Wi) 


k 


00 


ai  +  kvj 

—  ai0o  "k  kvi0o 

—  Ol0o  —  bi0o0i 
_  tfo(ai  —  0ibi) 

vj(0i  —0o) 


So  we  can  express  k  and  c  in  terms  of  the  first  two  affine 
slopes  that  we  detect  in  the  image  and  of  properties  of 
the  scene  that  determine  the  lines  a*  and  /3*.  Implic¬ 
itly,  we  have  used  0o  and  0i  to  solve  for  the  viewpoint. 
This  allows  us  to  express  each  remaining  image  parame¬ 
ter,  both  affine  slopes  and  the  a  and  P  coordinates  that 
describe  other  image  points,  as  a  function  of  these  first 
two  affine  slopes  and  properties  of  the  scene.  We  find: 


0i 


aj-\-v]k 

vl(flx-tfo) 

bj  +  w;c 

vj(ai-0ibi) 

g. 

bi  +  t»*c 


1  «>r*o(at-*i*i) 


g«t>i(tfi  —  0o)  +  t;*tfo(ai  —  0ibi) 
biVi(0i  —  0o)  +  v* (®i  ~  ^l^l) 
-6it;*gogi  +  (aiv^  -  g<v*)go  ■+■  aiVi0i 
-biv^0o  +  (6,vJ  -  6ii>?)tfi  +  aiVj 


Note  that  this  derivation  fails  in  a  few  degenerate 
cases.  In  particular,  it  will  fail  if  any  of  the  scene  points 
or  vectors  is  coplanar  with  the  first  three  scene  points, 
in  which  case  that  affine  value  is  constant.  A  physically 
unrealizable  solution  to  the  above  equation  occurs  when¬ 
ever  00  =  01  =  0i ,  in  which  case  our  derivation  involved 


Figure  3:  A  hyperboloid  of  one  sheet. 


dividing  by  0.  This  reflects  the  fact  that  as  our  view¬ 
point  becomes  closer  to  the  plane  of  the  first  three  scene 
points,  the  affine  slopes  all  converge  to  the  same  value. 
They  never  quite  get  there,  since  if  we  view  the  scene 
from  a  point  in  th^  plane,  the  first  three  image  points 
will  be  coUinear,  and  all  the  affine  coordinates  become 
undefined.  As  our  viewpoint  approaches  this  plane  and 
rotates  about  it,  all  the  affine  slopes  can  approach  any 
common  value. 

So  we  have  an  analytic  form  describing  all  images  of 
the  model.  We  now  show  in  the  case  of  3  points  and  3 
vectors  that  this  form  describes  a  2-D  h3rperboloid  in  a 
3-D  image  space. 

We  introduce  the  following  abbreviations: 

Cj  =  Oivj  Cj  =  OzWj  C3  =  C4  — 

X  =  00  y  =  0i  z  =  62 

We  note  that  ci ,  C2>  csi  C4  are  properties  of  the  scene,  and 
that  scenes  may  be  chosen  to  produce  any  set  of  these 
values.  So,  the  set  of  manifolds  that  can  be  produced  is 
precisely  described  by: 

_  -csay  +  (ci  -  C2)x  +  cap 

”  -C4X  +  (C4  -  C3)y  +  Cl 

-C4xz-f-(c4-c3)yr+ciz-f-c3xy-(ci-c2)x-c2y  =  0  (2) 

Adopting  the  notation  of  Korn  and  Korn  (pp.  74-76)[8] 
we  find: 

7  =  0  D  =  — 2c3C4(c4 — C3)  A  =  (C1C4— C2C3)*  >  0 

This  tells  us  that  when  we  look  at  three  dimensions  of 
affine  slope  space,  we  find  that  a  model’s  manifold  is  a 
hyperboloid  of  one  sheet  (see  figure  3).  That  is,  the  set 
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of  affine  slopes  that  three  oriented  points  produce  form 
an  hyperboloid  in  the  space  of  possible  affine  slopes.  We 
also  see  that  we  can  find  a  model  corresponding  to  any 
hyperboloid  that  fits  equation  2.  For  our  purposes,  we 
do  not  need  to  consider  the  degenerate  cases  in  detail. 

To  recap  this  section,  we  have  shown  how  to  determine 
the  manifolds  in  image  space  corresponding  to  scenes  of 
both  plain  and  oriented  point  features.  This  allows  us  to 
consider  the  problem  of  matching  models  and  images  by 
considering  an  equivalent  problem  of  matching  points  in 
image  space  to  lines  or  hyperbolic  shapes.  We  now  show 
how  to  use  this  formulation  of  matching  to  solve  real 
problems. 

3  Affine  Structure  from  Motion 

Koenderink  and  van  Doorn[7]  and  Shashua[12]  have 
noted  that  two  views  of  an  object  can  be  used  to  predict 
additional  views,  and  have  applied  this  result  to  mo¬ 
tion  understanding.  Koenderink  and  van  Doom  show 
that  the  affine  structure  of  an  object  made  of  3-D  points 
can  be  derived  from  two  views.  Affine  structure  is  that 
part  of  the  object’s  geometry  that  remains  unchanged 
when  we  apply  an  arbitrary  a^ne  transformation  to  the 
object’s  points.  For  example,  given  two  views  of  five 
corresponding  points,  they  compute  a  3-D  affine  invari¬ 
ant  representation  of  the  fifth  point  with  respect  to  the 
first  four.  This  is  similar  to  the  affine  coordinates  that 
we  have  used  above,  which  are  an  affine  invariant  de¬ 
scription  of  a  2-D  object.  Then,  given  the  location  of 
the  first  four  points  in  a  third  image,  the  location  of  the 
fifth  point  may  be  determined.  This  result  is  particu¬ 
larly  significant  because  it  is  known  (UIImaa[l-^)  that 
three  views  of  an  object  are  needed  to  determine  the  ob¬ 
ject’s  rigid  structure  when  images  are  formed  with  scaled 
orthographic  projection. 

We  can  rederive  this  thereom  from  our  previous  re¬ 
sults.  Determining  the  affine  structure  of  a  model  is 
equivalent  to  determing  the  manifold  that  it  corresponds 
to  in  image  space.  Recall  that  the  projection  model 
that  we  use  is  equivalent  to  applying  a  3-D  affine  trans¬ 
formation  to  a  model,  followed  by  orthographic  projec¬ 
tion.  This  means  that  if  two  objects  have  the  same  3- 
D  affine  structure,  then,  and  only  then,  can  they  pro¬ 
duce  exactly  the  same  set  of  images  under  our  projection 
model.  There  is  a  one-to-one  mapping  between  affine 
structure  and  a  model’s  manifold  in  an  affine-invariant 
image  space. 

We  have  shown  that  a  3-D  scene  of  points  corresponds 
to  a  pair  of  lines  in  a  and  space  that  have  the  same 
direction.  Given  two  images  of  a  model  seen  from  differ¬ 
ent  directions,  we  can  determine  two  sets  of  affine  coor¬ 
dinates  that  the  model  may  produce.  That  is,  we  have 
two  points  in  a  space  and  two  points  in  0  space  that 
must  be  included  in  the  model’s  manifold.  We  can  use 
these  pairs  of  points  to  determine  the  two  lines  that  must 
correspond  to  the  model  in  a  and  0  space.  In  fact,  since 
the  two  lines  have  the  same  directional  vector,  it  is  suf¬ 
ficient  to  know  the  affine  coordinates  of  one  image  of  a 
scene,  and  just  the  a  or  the  0  values  of  a  second  image  to 
determine  the  manifold  of  the  scene  that  produced  the 
two  images.  This  implicitly  tells  us  the  affine  structure 


of  the  scene,  and  everything  that  we  could  know  about 
which  images  it  can  produce. 

But  we  may  go  beyond  this,  and  consider  what  hap¬ 
pens  when  we  try  to  extend  Koenderink  and  van  Doom’s 
result  to  oriented  points  features.  We  find  that  four 
views  are  needed  to  determine  the  affine  structure  of 
oriented  points.  We  consider  a  model  with  three  points 
and  three  orientation  vectors.  We  know  that  for  any 
hyperboloid  of  the  following  form: 

-C4*Z  +  (C4  -  C3)yz  +  CiZ  +  CzXy  -  (a  -  C2)X  -  C2»  =  0 

there  is  a  model  whose  images  are  described  by  this 
hyperboloid,  where  x,y,z  are  the  affine  slopes  of  the 
three  image  vectors,  and  ci, 02,03,04  are  parameters  of 
the  model,  which  may  take  on  any  values.  Determining 
the  affine  structure  of  the  model  is  equivalent  to  finding 
the  values  of  01,02,03,04.  If  we  do  not  know  these  val¬ 
ues,  we  do  not  know  the  set  of  images  that  the  model 
can  produce  and  so  we  can  not  know  the  model’s  affine 
structure. 

Every  image  of  the  model  gives  us  a  set  of  values  for 
x,y,  and  z,  while  01,02,03,  and  04  remain  to  be  deter¬ 
mined.  This  gives  us  one  linear  equation  in  four  vari¬ 
able.  We  need  four  independent  equations  to  solve  for 
these  variable,  and  hence  we  need  at  least  four  views 
of  the  object  to  find  its  affine  structure.  Given  three 
views  of  the  scene,  there  will  still  be  an  infinite  number 
of  different  hyperboloids  that  might  produce  those  three 
images,  but  that  would  each  go  on  to  produce  a  different 
set  of  images. 

This  result  is  easily  extended  to  four  or  more  oriented 
points.  The  affine  structure  of  additional  points  can 
be  found  using  the  method  for  point  features  described 
above.  Additional  orientation  vectors  each  provide  four 
new  unknowns,  and  one  new  equation  for  each  image. 

However,  it  only  takes  three  views  to  determine  the 
rigid  structure  of  four  or  more  oriented  points.  To  com¬ 
pute  this,  we  can  first  use  the  locations  of  the  points  in 
three  views  to  determine  their  3-D  location,  as  shown  by 
Ullman.  This  tells  us  the  3-D  location  of  each  oriented 
point  and  each  viewing  direction,  but  not  the  3-D  direc¬ 
tion  of  the  orientation  vectors.  A  view  of  an  orientation 
vector  at  a  known  3-D  location  restricts  that  vector  to  lie 
in  a  plane.  For  each  orientation  vector,  two  views  tell  us 
two  different  planes  that  include  the  vector.  As  long  as 
our  viewpoint  are  not  identical,  these  planes  intersect 
in  a  line,  which  tells  us  the  direction  of  the  orientation 
vector. 

It  might  seem  paradoxical  that  from  three  views  we 
can  determine  the  rigid  structure  of  oriented  points, 
while  we  need  four  views  to  determine  their  affine  struc¬ 
ture.  But  keep  in  mind  that  a  view  of  an  object  provides 
us  with  less  information  about  the  object  if  we  assume 
the  view  was  created  with  a  linear  transformation  than 
if  we  assume  scaled  orthographic  projection. 

This  result  shows  us  how  to  solve  for  the  affine  struc¬ 
ture  of  an  object  of  oriented  points  by  just  solving  a  set 
of  linear  equations,  provided  that  we  have  four  views  of 
the  object.  But  the  need  for  four  views  is  a  significant 
limitation.  Koenderink  and  van  Doom  suggested  that 
affine  structure  is  an  intermediate  representation  that 
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we  can  compute  using  less  information  than  is  required 
to  determine  rigid  structure.  However,  we  see  that  this 
is  not  always  true. 

4  Indexing 

Indexing  is  a  general  term  often  used  to  describe  any 
object  recognition  process  that  selects  a  small  number 
of  candidate  objects,  or  object  parts,  from  a  large  data 
base  of  different  possiblilities  on  the  basis  of  information 
available  in  the  image.  In  one  approach  to  indexing,  we 
preprocess  the  models,  making  entries  in  a  lookup  table 
that  each  point  to  a  portion  of  a  model.  Then  at  run 
time,  we  use  some  set  of  features  in  the  image  to  com¬ 
pute  an  index  into  the  lookup  table.  Indexing  the  table 
at  that  point,  we  find  all  portions  of  all  known  models 
that  could  have  produced  that  particular  group  of  image 
features.  The  indexing  process  enforces  geometric  con¬ 
sistency  between  image  features  and  the  model  features 
to  which  we  match  them. 

The  central  problem  of  indexing  with  a  single  table 
lookup  is  to  find  the  best  way  of  representing  models  as 
manifolds  in  image  space.  For  if  we  look  in  an  index  table 
just  once  for  an  ordered  group  of  image  features,  this 
means  that  the  index  table  is  an  image  space;  each  point 
in  the  table  corresponds  to  one  or  more  images.  If  we 
are  to  avoid  missing  correct  matches,  then  for  each  group 
of  model  features  we  must  make  entries  in  the  table  at 
every  point  that  corresponds  to  a  possible  image  of  those 
model  features.  This  means  that  the  table  represents 
each  group  of  model  features’  manifold  in  image  space 
in  some  discrete  form. 

So  to  perform  indexing  we  would  like  to  have  an  an¬ 
alytically  understood  mapping  from  the  set  of  possible 
groups  of  model  features  to  their  manifolds  in  some  im¬ 
age  space.  Then  we  may  divide  image  space  into  discrete 
buckets,  and  determine  which  buckets  are  intersected  by 
each  model  group’s  manifold.  We  then  place  pointers  to 
the  model  group  in  eaich  of  these  buckets.  The  amount  of 
space  required  by  this  approach  is  approximately  Nd", 
where  we  represent  N  model  groups  in  the  lookup  table, 
we  divide  each  dimension  of  the  table  into  d  parts,  and 
where  each  manifold  is  n-dimensional.  This  space  can 
be  considerable,  and  therefore  it  is  important  to  find  an 
indexing  method  that  uses  the  lowest-dimensional  man¬ 
ifolds  possible. 

Our  past  results  show  that  we  can  represent  a  model 
of  point  features  using  two  1-D  lines  in  two  orthogonal 
image  spaces.  This  tells  us  that  we  can  perform  indexing 
using  a  and  0  spaces  that  we  have  divided  into  discrete 
buckets.  For  each  model  group,  we  then  make  entries  in 
those  buckets  intersected  by  the  line  in  each  space  that 
corresponds  to  that  model  group.  Then  at  run  time, 
we  compute  the  a  and  0  values  that  describe  an  image, 
and  use  them  to  index  separately  into  the  two  lookup 
tables,  intersecting  the  results.  The  space  required  for 
this  grows  at  only  a  linear  rate  as  we  discretize  the  spaces 
more  finely.  This  indexing  system  is  implemented  and 
described  in  [5].  There  we  also  show,  based  on  results 
in  Clemens  and  Jacobs[4]  that  this  is  the  most  space 
efficient  possible  way  of  representing  groups  of  3-D  model 
points  in  a  lookup  table. 


In  this  paper  we  have  shown  an  analytic  mapping  from 
groups  of  oriented  point  features  to  manifolds  in  an  affine 
slope  space.  These  manifolds  are  all  2-D.  We  could  there¬ 
fore  perform  indexing  by  discretely  representing  affine 
slope  space.  We  can  see  that  by  characterizing  the  geo¬ 
metric  structures  that  represent  the  images  that  a  model 
can  produce  we  are  solving  the  primary  theoretical  prob¬ 
lem  that  arises  in  indexing.  It  can  require  a  good  deal 
of  space,  however,  to  discretely  represent  a  2-D  man¬ 
ifold  in  a  lookup  table.  For  example,  Thompson  and 
Mundy[13],  Lamdan  and  Wolf8on[9],  and  Clemens  and 
Jacobs[4]  ^1  build  such  lookup  tables  by  uniformly  sam¬ 
pling  the  set  of  images  that  a  model  group  can  produce, 
instead  of  analytically  computing  these  images.  Thomp¬ 
son  and  Mundy,  in  fact,  do  this  for  groups  consisting  of 
pairs  of  vertices.  In  all  three  systems  thousands  of  ta¬ 
ble  entries  are  needed  to  represent  each  group  of  model 
features.  We  might  want  to  represent  many  different  ob¬ 
jects  in  a  lookup  table,  and  a  model  of  each  object  may 
give  rise  to  many  different  subgroups  of  features  that  we 
would  want  to  individually  recognize.  And  in  these  sys¬ 
tems,  the  lookup  tables  are  discretized  fairly  coarsely. 
A  more  fine  discretization  might  be  desirable,  but  this 
would  require  even  more  space.  So  it  is  of  considerable 
practical  significance  to  determine  whether  we  can  rep¬ 
resent  model  groups  with  lower-dimensional  manifolds. 

Unfortunately,  we  can  prove  that  this  is  not  possible. 
First  we  note  that  if  we  represent  oriented  point  features 
with  a  single  manifold  in  a  single  image  space,  these  man¬ 
ifolds  must  be  at  least  2-D.  The  proof  of  this  is  given  in 
Jacobs[6],  and  is  essentially  the  same  as  the  proof  given 
for  simple  point  features  in  Clemens  and  Jacobs[4].  Now 
in  the  case  of  point  features  we  were  able  to  decompose 
a  single  affine  space  into  two  orthogonal  subspaces  such 
that  each  model  was  represented  with  two  1-D  mani¬ 
folds.  Our  only  hope  of  reducing  the  space  required  to 
index  oriented  point  features  is  that  we  can  similarly  de¬ 
compose  the  image  space  that  represents  oriented  point 
features  so  that  all  the  2-D  manifolds  are  decomposed 
into  pairs  of  1-D  manifolds.  We  show  this  cannot  be 
done. 

Our  proof  will  assume  that  each  model  contains  at 
least  three  points,  and  three  or  more  vectors.  We  as¬ 
sume  that  any  configuration  of  points  and  vectors  is 
a  possible  model  group.  We  also  assume  a  continuous 
mapping  from  images  to  our  image  space,  and  to  any 
image  subspaces.  This  is  essential  in  any  practical  in¬ 
dexing  scheme,  because  otherwise  a  small  perturbation 
in  an  image,  due  to  error,  could  result  in  large  changes 
in  that  image’s  representation  in  image  space.  We  show 
that  if  there  is  a  decomposition  of  the  image  space  that 
decomposes  all  the  manifolds  in  it,  then  the  kinds  of  in¬ 
tersections  that  can  occur  between  manifolds  must  be 
limited,  and  that  the  class  of  manifolds  produced  by  ori¬ 
ented  point  models  do  not  meet  these  restrictions. 

We  will  suppose  the  opposite  of  our  proposition,  that 
image  space  may  be  decomposed  into  two  subspaces, 
such  that  each  2-D  manifold  in  affine  slope  space  that 
corresponds  to  a  model  is  the  cross-product  of  two  1-D 
curves  in  each  of  the  subspaces.  Then  when  two  man¬ 
ifolds  intersect  in  image  space,  we  can  determine  the 
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places  where  they  intersect  by  taking  the  cross  product 
of  the  intersections  of  their  1-D  manifolds  in  the  two 
image  subspaces.  Suppose  that  two  models’  manifolds 
intersect  in  image  space  in  a  1-D  curve.  Then  our  de¬ 
composition  of  image  space  must  represent  this  curve  as 
the  cross  product  of  a  1-D  curve  in  one  image  space, 
and  a  point  in  the  second  image  space.  This  means  that 
in  one  of  the  two  image  subspaces,  the  two  1-D  curves 
that  represent  the  two  models  must  overlap,  so  that  their 
intersection  is  also  a  curve  and  not  just  a  point. 

This  observation  allows  us  to  formulate  a  plan  for  de¬ 
riving  a  contradiction.  We  pick  a  model  group,  M,  with 
manifold  H.  We  then  choose  a  point  P  on  H  (that  is,  P 
corresponds  to  an  image  of  M).  We  define  p  and  p'  as  the 
points  that  correspond  to  P  in  the  first  and  second  image 
subspaces  respectively.  We  will  construct  five  new,  spe¬ 
cial  models.  Mi,  M2,  M3,  M|,  M3.  Each  of  these  model's 
manifolds  will  intersect  ^  in  a  1-D  curve  in  image  space. 
We  call  these  curves  Ki,K2,  K3,  K^,  K3.  Each  of  these 
curves  will  contain  P,  by  construction.  Then,  since  each 
curve  maps  to  a  curve  in  one  image  subspace,  and  a 
point  in  the  other,  we  may  assume  without  loss  of  gener¬ 
ality  that  Ki,K2,  and  K3  map  to  the  curves  kt,k2,  and 
ks  in  the  first  image  subspace,  and  to  the  points  ri ,  r2 
and  ra  in  the  second  image  subspace.  Then,  in  order 
for  the  curves  Ki,  K2,  and  K3  to  all  include  the  point 
P,  it  must  be  that  n  =  rj  =  ra  =  j/,  and  that  ki,k2, 
and  ka  all  intersect  at  the  point  p  in  the  first  image  sub¬ 
space.  We  will  call  the  curve  that  represents  M  in  the 
first  image  subspace  k.  ki,k2,  and  ka  must  all  lie  on  k 
because  they  come  from  the  intersection  of  M  and  other 
models.  It  is  possible  that  two  of  these  curves  intersect 
only  at  p  if  they  end  at  p,  and  they  occupy  portions  of 
k  on  opposite  sides  of  p.  But  with  three  curves,  two  at 
least  (suppose  they  are  ki  and  kj)  must  intersect  over 
some  l-D  portion  of  k.  Since  they  both  intersect  at  p'  in 
the  other  image  space,  this  will  tell  us  that  Ki  and  K2 
intersect  over  some  1-D  portion  of  image  space.  We  will 
then  derive  a  contradiction  by  showing  that  in  fact  all  of 
the  curves,  Ki,K2,  K3,  K\,  K3,  intersect  each  other  only 
at  a  single  point,  P.  So,  to  summarize  the  steps  needed 
to  complete  this  proof,  we  will:  construct  the  point  P 
and  the  models  M,  Mi ,  M2,  M3,  M4,  Ms  so  that  each  ad¬ 
ditional  model’s  manifold  intersects  M’s  in  a  1-D  curve 
that  includes  P.  We  will  then  show  that  these  curves 
intersect  each  other  only  at  P,  that  is,  that  M  and  any 
two  of  the  other  models  have  only  one  common  image. 

For  these  constructions,  we  will  choose  our  models  to 
be  identical  and  planar,  except  for  their  first  three  orien¬ 
tation  vectors.  Therefore,  in  considering  the  intersection 
of  these  models’  manifolds,  we  need  only  consider  their 
intersection  in  the  coordinates  (0o,0i,02),  since  their  re¬ 
maining  coordinates  will  always  be  constant,  and  will  be 
the  same  for  each  model.  Therefore,  when  we  speak  of 
the  coordinates  of  a  point  in  image  space,  we  will  only 
consider  these  three  coordinates.  And  to  describe  the 
values  for  (^Oi^i.^z)  that  a  model  can  produce,  we  need 
only  give  the  values  for  ci ,  cj,  C3,  C4  that  will  describe  the 
model’s  hyperboloid  in  (^Oi^i.^a)  space. 

It  is  easy  to  see  from  equation  2  that,  in  general,  any 
two  of  these  hyperboloids  will  intersect  in  a  set  of  1-D 


surfaces,  and  any  three  hyperboloids  will  intersect  only 
at  points,  and  in  the  line  Oq  =  0i  =  $2,  aa  noted  above. 
Therefore,  any  general  set  of  six  hyperboloids  chosen  to 
intersect  at  a  common  point  will  fulfill  our  needed  con¬ 
struction. 

We  can  also  prove  this  result  another  way,  which  will 
perhaps  strengthen  the  reader’s  intuitions  about  these 
hyperboloids.  Let  U  he  a  3-D  hyperboloid,  and  P  be  an 
arbitrary  point  on  it.  We  derive  a  contradiction  after  as¬ 
suming  that  we  can  decompose  H  into  two  1-D  curves  in 
two  image  spaces.  Suppose  that  P  is  represented  again 
by  the  two  points  p  and  p'  in  the  two  image  spaces. 
Choose  any  other  point  Q  on  H.  Referring  to  equation 
2  we  see  that  knowing  two  points  of  a  hyperboloid  gives 
us  two  linear  equations  in  the  four  unknowns  that  de¬ 
scribe  the  hyperboloid.  Therefore,  we  may  readily  find 
a  second  hyperboloid,  H',  that  also  includes  P  and  Q, 
but  that  does  not  coincide  with  H.  As  noted  above,  in 
general  H  and  H'  intersect  in  a  1-D  curve,  which  must 
correspond  to  a  curve  in  one  image  space,  and  to  either  p 
or  p'  in  the  other.  In  particular,  this  means  that  Q  must 
correspond  to  either  p  or  p'.  Since  Q  is  an  arbitrary 
point,  all  points  on  H  must  correspond  to  either  p  or  p'. 
This  contradicts  our  assumption  that  H  is  represented 
by  the  cross-product  of  two  curves. 

Notice  that  these  proofs  do  not  depend  on  our  choice 
of  affine  slope  space  to  represent  images  of  oriented 
points.  These  proofs  make  use  only  of  the  topology  of 
the  intersection  of  manifolds.  This  topology  will  be  pre¬ 
served  by  any  one-to-one  continuous  mapping,  and  hence 
will  be  present  in  any  continuous  representation  of  im¬ 
ages. 

We  have  therefore  shown  that  our  representation  of 
groups  of  oriented  points  as  2-D  manifolds  is  optimal  in 
terms  of  its  dimensionality.  We  cannot  represent  images 
of  such  models  using  lower-dimensional  manifolds.  This 
places  an  unexpected  lower  bound  on  the  cost  of  index¬ 
ing  groups  of  oriented  point  features.  It  shows  that  it 
requires  considerably  more  space  to  index  them  than  to 
index  simple  point  features.  Again  we  see  that  adding 
orientations  to  our  models  makes  a  basic  vision  task  fun¬ 
damentally  more  difficult. 

5  Recognition  by  Linear  Combinations 

Ullman  and  Basri[15]  show  that  any  image  of  a  model  of 
3-D  points  can  be  expressed  as  a  linear  combination  of 
a  small  set  of  basis  images  of  the  object.  That  is,  given 
a  few  views  of  an  object,  ii...»n,  and  any  new  view,  ij, 
we  can  find  coefficients  ai..an  so  that: 

n 

k=l 

where  we  multiply  and  sum  images  by  just  multiplying 
and  summing  the  cartesian  coordinates  of  each  image 
point  separately.  This  idea  is  refined  independently  by 
Basri  and  by  Poggio[10]  into  the  following  form. 

Suppose  we  have  a  model,  m,  with  n  3-D  points.  I'l 
and  13  are  two  images  of  m.  We  describe  each  image  with 
cartesian  coordinates,  and  assume  there  is  no  translation 
in  the  projection.  Let  xj  be  an  n-dimensional  vector 
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containing  all  of  t'l’s  x  coordinates,  and  let  be  an  n- 
dimensional  vector  of  its  y  coordinates.  Similarly,  define 
X2  and  y2  for  13.  Take  any  new  image  of  m,  ij,  and 
define  xj  and  yj.  Then  Basri  and  Poggio  show  that 
there  exist  00,01,03  and  60,61,63  such  that: 

Xj  =  ooxi  +  oiyi  +  03x2 

yj  =  60X1  +  6iyi  +  63x2 

This  tells  us  that  the  z  zmd  y  coordinates  of  a  new  image 
are  a  linear  combination  of  one  and  a  half  views  of  the 
object,  that  is,  of  the  x  and  y  coordinates  of  one  view, 
and  either  the  x  or  the  y  coordinates  of  a  second  view. 
Another  way  to  think  of  this  result  is  that  X]^,y]^,X2 
span  a  3-D  linear  subspace  in  H'*  that  includes  all  sets 
of  X  or  y  coordinates  that  the  object  could  later  produce. 
We  omit  a  proof  of  this,  but  note  that  the  proof  is  based 
on  a  linear  transformation. 

We  now  show  how  a  similar  result  is  evident  from  our 
work.  When  one  considers  the  affine  coordinates  of  an 
image,  we  have  shown  that  the  set  of  images  a  model  pro¬ 
duces  occupies  a  linear  subspace,  a  2-D  plane,  of  a  larger 
affine  space.  Our  approach  also  implies  a  result  similar 
to  the  one  and  a  half  views  result  described  above.  Given 
the  a  coordinates  of  any  two  views  of  a  model,  we  may 
determine  the  line  in  a  space  that  describes  all  the  or 
coordinates  the  model  might  produce.  In  fact,  any  point 
on  this  line  is  a  linear  combination  of  the  original  two 
points  used  to  determine  the  line.  And  since  the  direc¬ 
tions  of  the  a  and  0  lines  are  the  same,  if  we  are  given 
the  a  coordinates  of  two  images  of  a  model,  and  the  0 
coordinates  of  one  image,  we  may  also  determine  the  line 
in  0  space  that  describes  the  model. 

We  now  use  our  results  to  show  that  the  linear  com¬ 
binations  work  cannot  be  extended  to  oriented  points. 
To  do  this  it  will  be  sufficient  to  consider  the  case  where 
each  model  consists  of  three  points  and  three  vectors.  If 
the  linear  combinations  result  fails  in  this  case,  then  it 
fails  in  general.  In  this  case,  we  may  represent  a  model’s 
images  with  a  2-D  hyperboloid  in  a  3-D  space.  It  might 
seem  obvious  from  this  that  the  linear  combinations  idea 
will  not  apply.  Given  a  2-D  hyperboloid  in  a  3-D  space, 
it  is  easy  to  pick  four  points  on  the  hyperboloid  that 
span  the  entire  3-D  space.  This  means  that  in  general, 
any  four  images  of  any  model  can  be  linearly  combined 
to  produce  any  possible  image,  and  the  linear  combi¬ 
nations  idea  is  true  only  in  the  trivial  sense  that  with 
enough  images  we  may  express  any  other  image  as  a  lin¬ 
ear  combination  of  those  images. 

However,  things  are  not  this  simple.  Linear  combina¬ 
tions  might  be  true  of  one  representation  of  images,  but 
not  true  of  another.  For  example,  with  point  features 
the  cartesian  coordinates  of  one  image  are  linear  combi¬ 
nations  of  other  images  of  the  same  scene,  but  this  might 
not  be  true  of  polar  coordinates.  So  we  must  prove  that 
the  set  of  all  images  of  a  scene  are  not  a  linear  combi¬ 
nation  of  a  small  number  of  images,  regardless  of  our 
choice  of  representation  for  an  image.  Since  we  know 
that  the  three  basis  points  of  the  image  convey  no  in¬ 
formation  about  the  scene,  the  real  question  is  whether 
some  alternate  representation  of  affine  slope  might  map 
each  model’s  images  into  a  linear  subspace.  So  we  ask 


whether  there  is  a  continuous,  one-to-one  mapping  from 
affine  slope  space,  that  is  the  space  defined  by  {fia,  0i ,  62), 
into  another  space,  such  that  this  mapping  takes  every 
hyperboloid  in  affine  slope  space  into  a  linear  subspace. 
F^om  elementary  topology  we  know  that  any  continuous 
one-to-one  mapping  will  map  our  3-D  affine  slope  space 
into  a  space  that  is  also  3-D,  and  that  it  will  map  every 
2-D  hyperboloid  into  a  2-D  surface.  So  the  question  is 
whether  these  hyperboloids  might  map  to  2-D  planes  in 
a  3-D  space? 

To  answer  this,  we  must  look  at  the  particular  set  of 
hyperboloids  that  correspond  to  possible  models.  We  as¬ 
sume  that  an  appropriate  mapping  exists  for  linear  com¬ 
binations,  and  derive  a  contradiction.  First,  we  recall 
that  the  line  0o  =  =  ^3  is  put  of  the  equation  for  each 

hyperboloid  corresponding  to  a  possible  model.  Call  this 
line  L.  L  is  &  degenerate  case;  the  actual  set  of  images  a 
model  produces  does  not  include  L,  but  it  includes  im¬ 
ages  that  are  arbitrarily  close  to  L.  Suppose  we  apply  a 
continuous  one-to-one  mapping,  call  if  /,  that  takes  one 
of  these  hyperboloids,  H,  to  s  2-D  plane,  f{H).  Then 
/(L)  is  a  1-D  curve  such  that  for  any  point  on  the  curve, 
there  is  a  point  on  f{H)  arbitrarily  close  to  that  curve 
point.  This  can  only  happen  if  /(£.)  lies  on  f(H).  That 
is,  we  can  omit  f{L)  from  a  model’s  manifold  without 
problems,  but  if  this  manifold  is  linear,  then  the  require¬ 
ment  that  our  representation  be  continuous  tells  us  that 
f(L)  must  in  fact  lie  in  this  2-D  plane. 

Since  L  is  part  of  every  scene’s  hyperboloid,  this 
means  that  /(£>)  must  be  a  1-D  curve  at  which  all  scenes’ 
manifolds  intersect,  in  our  new  space.  If  all  scenes’  man¬ 
ifolds  are  2-D  planes  in  this  new  space,  they  can  only 
intersect  in  a  line.  So  f{L)  must  be  a  line  at  which 
all  scenes’  planes  intersect.  But  this  means  that  no 
scenes’  planes  can  intersect  anywhere  else  in  our  new 
space.  However,  we  have  already  shown  that  in  general 
all  the  hyperboloids  that  represent  scenes  intersect  at 
other  places  than  the  line  L.  f  must  preserve  these  inter¬ 
sections,  so  a  contradiction  is  derived.  That  is,  we  have 
shown  that  in  any  space,  the  manifolds  corresponding  to 
two  different  scenes  of  oriented  points  must  intersect  in 
two  distinct  curves,  which  cannot  happen  if  the  mani¬ 
folds  are  lineu.  This  tells  us  that  it  is  never  possible 
to  represent  the  images  produced  by  a  scene  of  oriented 
points  using  linear  combinations,  except  in  the  trivial 
sense. 

The  implications  of  this  result,  however,  depend  on 
what  one  thinks  is  important  about  the  linear  combina¬ 
tions  result.  If  it  is  the  linearity  of  the  images,  then  our 
result  concerning  oriented  points  is  a  setback.  It  does 
seem  that  part  of  the  impact  of  the  linear  combinations 
work  is  that  the  linearity  of  a  scene’s  images  was  unex¬ 
pected  and  striking.  And  it  is  certainly  true  that  lin¬ 
ear  spaces  can  lead  to  simpler  reasoning  than  non-linear 
ones.  However,  a  significant  part  of  the  importance  of 
the  linear  combinations  work  is  that  it  provides  a  simple 
way  of  characterizing  a  scene’s  possible  images  in  terms 
of  a  small  number  of  images,  without  explicitly  deriv¬ 
ing  3-D  information  about  the  scene.  And  we  may  still 
do  that  with  oriented  points,  as  we  have  shown  in  our 
discussion  of  affine  structure  from  motion.  Our  compu- 
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tations  are  no  longer  linear,  but  we  may  still  derive  a 
simple  set  of  equations  from  a  few  images  of  oriented 
points  that  characterize  all  other  images  that  could  be 
produced  by  the  scene,  without  explicitly  determining 
this  scene’s  3-D  structure. 

6  Conclusions 

In  this  paper  we  have  shown  how  to  characterize  the 
sets  of  images  that  groups  of  3-D  oriented  point  features 
can  produce  as  simple  geometric  objects,  when  a  lin¬ 
ear  transformation  is  used  as  a  projection  method.  This 
means  that  problems  involving  matching  2-D  images  to 
3-D  scenes  can  be  rephrased  as  the  problem  of  matching 
points  in  a  high-dimensional  space  to  2-D  hyperbolic  sur¬ 
faces  in  that  space.  This  provides  us  with  a  formulation 
of  the  matching  problem  that  can  be  easily  visualized, 
and  that  is  easy  to  reason  about.  Many  problems  can 
be  handled  very  simply,  and  involve  familar  geometric 
structures  residing  in  a  3-D  euclidean  space.  At  the  same 
time,  we  have  demonstrated  that  when  we  focus  on  the 
topology  of  this  simple  space,  we  can  derive  results  that 
will  apply  to  any  reasonable  representation  of  a  model’s 
images. 

We  use  these  results  to  place  some  fundamental  limits 
on  problems  in  motion  understanding  and  object  recog¬ 
nition.  We  show  that  although  affine  structure  can  be 
derived  simply  from  a  motion  sequence,  that  this  deriva¬ 
tion  requires  four  images  of  oriented  points.  This  com¬ 
pares  unfavorably  with  the  two  images  needed  to  derive 
the  affine  structure  of  simple  point  features,  or  the  three 
images  needed  to  derive  the  rigid  structure  of  simple  or 
oriented  point  features.  Our  result  therefore  undermines 
one  of  the  motivations  for  attempting  to  derive  affine 
structure  instead  of  rigid  structure.  We  also  show  that 
indexing  oriented  points  requires  us  to  represent  a  2- 
D  surface  discretely.  Some  indexing  systems  have  been 
built  that  implicitly  represent  2-D  surfaces,  and  that 
build  lookup  tables  through  sampling,  but  these  tend 
to  require  considerable  space.  So  it  is  disappointing  to 
find  that  for  oriented  points  we  cannot  find  a  way  to 
perform  indexing  by  representing  1-D  curves,  as  we  did 
for  simple  points.  Finally,  we  show  that  new  images  of 
oriented  points  cannot  be  constructed  from  a  linear  com¬ 
bination  of  old  images,  except  in  a  trivial  sense.  At  the 
same  time,  we  do  show  how  to  construct  new  images 
from  old  by  solving  a  simple  set  of  equations.  In  fact, 
it  is  not  clear  how  much  of  the  real  value  of  the  linear 
combinations  result  is  lost,  since  for  oriented  points  we 
have  a  simple  method  of  solving  the  same  basic  problem 
that  the  linear  combinations  method  solved  for  simple 
points.  For  each  of  these  problems,  we  have  shown  that 
when  we  add  orientation  vectors  to  points,  a  problem 
becomes  more  difficult  in  an  important  way,  while  at  the 
same  time  we  provide  positive  solutions  to  these  prob¬ 
lems. 
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Abstract 

A  method  for  localization  and  positioning  in 
an  indoor  environment  is  presented.  Localiza¬ 
tion  is  the  act  of  recognizing  the  environment, 
and  positioning  is  the  act  of  computing  the  ex¬ 
act  coordinates  of  a  robot  in  the  environment. 

The  method  is  based  on  representing  the  scene 
as  a  set  of  2D  views  and  predicting  the  ap¬ 
pearances  of  novel  views  by  linear  combina¬ 
tions  of  the  model  views.  The  method  accu¬ 
rately  approximates  the  appearance  of  scenes 
under  weak  perspective  projection.  Analysis 
of  this  projection  as  well  as  experimental  re¬ 
sults  demonstrate  that  in  many  cases  this  ap¬ 
proximation  is  sufficient  to  accurately  describe 
the  scene.  When  weak  perspective  approxima¬ 
tion  is  invalid,  either  a  larger  number  of  models 
can  be  acquired  or  an  iterative  solution  to  ac- 
count  for  the  perspective  distortions  can  be  em¬ 
ployed.  The  same  principal  method  is  applied 
for  both  the  localization  and  positioning  prob¬ 
lems,  and  a  simple  algorithm  for  repositioning, 
the  task  of  returning  to  a  previously  visited  po¬ 
sition  defined  by  a  single  view,  is  derived  from 
this  method. 

1  Introduction 

Basic  tasks  in  autonomous  robot  navigation  are  localiza¬ 
tion  and  positioning.  Localization  is  the  act  of  recogniz¬ 
ing  the  environment,  that  is,  assigning  consistent  labels 
to  different  locations,  and  positioning  is  the  act  of  com¬ 
puting  the  coordinates  of  the  robot  in  the  environment. 
Positioning  is  a  task  complementary  to  localization,  in 
the  sense  that  position  (e.g.,  “1.5  meters  northwest  of 
table  T”)  is  often  specified  in  a  place-specific  coordinate 
system  (“in  room  911”).  In  this  paper  we  suggest  a 
method  of  both  localization  and  positioning  using  vision 
alone.  A  variant  of  the  positioning  problem,  referred 

“R.  B.  was  supported  in  part  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  under  Office 
of  Naval  Research  contract  N00014-91-J-4038.  E.  R.  was  sup¬ 
ported  in  part  by  the  Defense  Advanced  Research  Projects 
Agency  (ARPA  Order  No.  8459)  and  the  U.S.  Army  Engi¬ 
neer  Topographic  Laboratories  under  Contract  DACA76-92- 
C-0009. 


Ehud  Rivlin 

Computer  Vision  Laboratory 
Center  for  Automation  Research 
University  of  Maryland 
CoUege  Park,  MD  20742 

to  as  repositioning,  involving  the  return  to  a  previously 
visited  place  is  alro  discussed. 

Previous  studies  have  examined  the  problems  of  lo¬ 
calization  and  positioning  under  a  variety  of  conditions, 
defined  by  the  kind  of  sensors  employed,  the  nature  of 
the  environment,  and  the  representations  used.  We  can 
distinguish  between  active  and  passive  sensing,  indoor 
and  outdoor  navigation  tasks,  and  metric  and  topolog¬ 
ical  representations.  The  metric  approach  attempts  to 
utilize  a  detailed  geometric  description  of  the  environ¬ 
ment,  while  the  topological  approach  uses  a  more  qual¬ 
itative  description  including  a  graph  with  nodes  repre¬ 
senting  places  and  arcs  representing  sequences  of  actions 
that  would  result  in  moving  the  robot  from  one  node  to 
another. 

In  the  paper  we  consider  a  robot  that  uses  a  passive 
sensor,  vision,  in  an  indoor  environment.  The  paper  ad¬ 
dresses  both  the  localization  and  the  positioning  prob¬ 
lems.  Solutions  to  these  problems  are  presented  based 
on  object  recognition  techniques.  The  method,  based  on 
the  linear  combinations  scheme  [18],  represents  scenes 
by  sets  of  their  2D  images.  Localization  is  2u;hieved  by 
comparing  the  observed  image  to  linear  combinations  of 
model  views,  and  the  position  of  the  robot  is  computed 
by  analyzing  the  coefficients  of  the  linear  combination 
that  aligns  the  model  to  the  image. 

The  rest  of  the  paper  is  organized  as  follows.  The 
next  section  describes  the  localization  and  positioning 
problems  and  surveys  previous  solutions.  The  method 
of  localization  and  positioning  using  linear  combinations 
of  model  views  is  described  in  Section  3.  The  method  as¬ 
sumes  weak  perspective  projection.  An  iterative  scheme 
to  account  for  perspective  distortions  is  presented  in  Sec¬ 
tion  4.  An  analysis  of  the  error  resulting  from  the  pro¬ 
jection  assumption  is  presented  in  Section  5.  Constraints 
imposed  on  the  motion  of  the  robot  as  a  result  of  special 
properties  of  indoor  environments  can  be  used  to  reduce 
the  complexity  of  the  method  presented  here.  Experi¬ 
mental  results  follow. 

2  The  Problem 

Localization  and  positioning  from  visual  input  are  de¬ 
fined  in  the  following  way:  Given  a  familiar  environ¬ 
ment,  identify  the  observed  environment,  and  then  find 
your  position  in  that  environment.  Localization  resem¬ 
bles  the  task  of  object  recognition,  with  objects  replaced 
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by  scenes.  Once  localization  is  accomplished,  positioning 
can  be  performed. 

One  problem  a  system  for  locdization  and  position¬ 
ing  should  address  is  the  variability  of  images  due  to 
viewpoint  changes.  The  inexactness  of  practical  systems 
makes  it  difficult  for  a  robot  to  return  to  a  specified  po¬ 
sition  on  subsequent  visits.  The  visual  data  available 
to  the  robot  between  visits  varies  in  accordance  with 
the  viewing  position  of  the  robot.  A  localization  system 
should  be  able  to  recognize  scenes  from  different  posi¬ 
tions  and  orientations. 

Another  problem  is  that  of  changes  in  the  scene.  At 
subsequent  visits  the  same  place  may  look  different  due 
to  changes  in  the  arrangement  of  the  objects,  the  intro¬ 
duction  of  new  objects,  and  the  removal  of  others.  In 
general,  some  objects  tend  to  be  more  static  than  oth¬ 
ers.  While  chairs  and  books  are  often  moved,  tables, 
closets,  and  pictures  tend  to  change  their  position  much 
less,  and  walls  are  almost  guaranteed  to  be  static.  Static 
cues  naturally  are  more  reliable  than  mobile  ones.  Con¬ 
fining  the  system  to  static  cues,  however,  may  in  some 
cases  result  in  failure  to  recognize  the  scene  due  to  in¬ 
sufficient  cues.  The  system  should  therefore  attempt  to 
rely  on  static  cues,  but  should  not  ignore  the  dynamic 
cues. 

Solutions  to  the  problem  of  localization  from  visual 
data  require  a  large  memory  and  heavy  computation. 
Existing  systems  often  try  to  reduce  this  cost  by  us¬ 
ing  sparse  representations  and  by  exploiting  contextual 
information.  Sparse  representations  are  introduced  in 
[10,  15].  Mataric  [10]  represents  scenes  as  sequences 
of  landmarks  (such  as  walls,  doors,  etc.)  extracted  by 
tracing  the  boundaries  of  the  scene  using  a  sonar  and  a 
compass.  Metric  information  of  and  between  the  land¬ 
marks  is  not  stored.  Sarachik  [15]  recognizes  a  room  by 
its  dimensions,  which  are  measured  by  identifying  and 
locating  the  top  corners  of  the  room  using  stereo  data 
(obtained  from  four  cameras).  In  both  cases  the  repre¬ 
sentation  is  very  sparse,  and  the  scene  is  therefore  often 
ambiguous. 

Richer  representations  are  used  in  [3,  5]  where  higher 
success  rates  are  reported.  Braunegg  [3]  represents  the 
scene  by  an  occupancy  table,  a  2D  bit  array  which  con¬ 
tains  a  1  at  every  location  occupied  by  some  object.  The 
table  is  constructed  by  taking  stereo  pictures  covering 
360“  from  the  middle  of  the  room  and  projecting  the  ob¬ 
tained  3D  data  onto  the  floor.  The  method  suffers  from 
loss  of  information  due  to  the  projection  onto  the  floor. 

Engeison  et  al.  [5]  represent  the  scene  by  a  set  of  in¬ 
variant  “signatures”.  A  signature  is  usually  composed 
of  low-resolution  gray-level  or  range  data  obtained  by 
blurring  an  image.  A  set  of  signatures  taken  from  differ¬ 
ent  viewpoints  are  stored.  A  scene  is  recognized  if  the 
robot  encounters  a  signature  similar  to  one  of  the  stored 
signatures. 

Systems  that  use  the  full  information  provided  by  the 
image  (e.g.,  [6,  12])  usually  rely  on  contextual  informa¬ 
tion  to  avoid  scanning  all  the  models  in  the  memory  and 
to  reduce  the  computational  cost  of  comparing  a  model 
to  the  image.  The  system  follows  a  predetermined  path, 
so  that  the  identity  of  each  visited  location  is  known  in 


advance,  and  localization  becomes  a  verification  prob¬ 
lem.  Path  continuity  in  many  cases  is  essential,  and  the 
so-called  “drop-ofP  problem  is  not  addressed.  The  em¬ 
phasis  in  these  systems  is  on  positioning,  which  is  used 
to  keep  the  robot  on  the  path.  It  is  typical  for  these 
systems  (e.g.,  [1,  6,  12])  to  use  a  full  3D  model  of  the 
environment. 

Onoguchi  e(  al.  [12],  among  others,  represent  the  en¬ 
vironment  by  a  set  of  landmarks  selected  from  pairs  of 
stereo  images  by  a  human  operator.  These  landmarks 
are  transformed  by  an  image  processing  program  which 
is  designed  so  as  to  identify  the  specific  landmark  using 
specific  extraction  instructions  (such  as  what  features  to 
look  lor  and  at  what  locations).  Localization  is  achieved 
by  applying  the  extraction  procedure  specified  for  the 
next  landmark.  Once  a  landmark  is  identified,  the  posi¬ 
tion  of  the  robot  relative  to  that  landmark  is  determined 
by  comparing  the  dimensions  of  the  observed  landmark 
with  those  of  the  stored  model. 

The  method  presented  in  this  paper  represents  the 
environment  using  a  set  of  views  given  as  edge  maps. 
Localization  and  positioning  are  achieved  by  comparing 
images  of  the  environment  to  linear  combinations  of  the 
model  views.  The  method  uses  rich  visual  information 
to  represent  the  scene.  The  system  is  flexible.  In  many 
cases  it  is  capable  of  recognizin,^  its  location  from  one 
image  only  (360“  coverage  is  not  required).  When  one 
image  is  not  sufficient,  additional  images  can  be  cicquired 
to  solve  the  localization  problem.  Context  can  be  used 
to  determine  the  order  of  comparison  of  the  models  to 
the  observed  image  and  to  increase  the  confidence  of  a 
given  match,  but  context  is  not  essential:  the  system  can 
also,  by  performing  more  extensive  computations,  solve 
the  “drop-oflP’  problem. 

3  The  Method 

The  problems  of  locedization  and  object  recognition  are 
similar  in  m2iny  ways.  Both  problems  require  the  match¬ 
ing  of  visual  images  to  stored  models,  either  of  the  en¬ 
vironment  or  of  the  observed  objects.  Both  problems 
face  similar  difficulties,  such  as  varying  illumination  con¬ 
ditions  and  changes  in  appearance  due  to  viewpoint 
changes.  Similar  methodologies  therefore  can  be  used 
for  both  problems. 

A  particular  application  of  an  object  recognition 
scheme,  the  Linear  Combinations  (LC)  scheme  [18],  to 
the  problenns  of  localization  and  positioning  is  discussed 
below.  The  environment  is  represented  in  this  scheme  by 
a  smeill  set  of  views  obtained  from  different  viewpoints 
and  by  the  correspondence  between  the  views.  A  novel 
view  is  recognized  by  comparing  it  to  linear  combinations 
of  the  stored  views.  Positioning  is  achieved  by  recovering 
the  position  of  the  camera  relative  to  its  position  in  the 
model  views  from  the  coefficients  of  the  aligning  linear 
combination.  In  the  rest  of  this  section  we  review  the  lin¬ 
ear  combinations  approach  and  describe  its  application 
to  both  localization  and  positioning. 

3.1  Localization 

The  problem  of  localization  is  defined  as  follows:  given 
P,  a  2D  image  of  a  place,  and  Ad,  a  set  of  stored  mod- 
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els,  find  a  model  M*  €  M  such  that  P  matches  M* .  A 
common  approach  to  handling  the  problem  of  recogni¬ 
tion  from  different  viewpoints  is  by  comparing  the  stored 
models  to  the  observed  environment  after  the  viewpoint 
is  recovered  and  compensated  for.  This  approach,  called 
alignment,  is  used  in  a  number  of  studies  of  object  recog¬ 
nition  [2,  7,  8,  9,  16,  17].  We  apply  the  alignment 
approach  to  the  problem  of  localization.  The  system 
described  below  uses  the  “Linear  Combinations”  (LC) 
scheme,  which  was  suggested  by  Uilman  and  Basri  [18]. 

We  begin  with  a  brief  review  of  the  LC  scheme.  LC 
is  defined  as  follows.  Given  an  image,  we  construct  two 
view  vectors  from  the  feature  points  in  the  image,  one 
contains  the  z-coordinates  of  the  points,  and  the  other 
contains  the  y-coordinates  of  the  points.  An  object  (in 
our  case,  the  environment)  is  modeled  hy  a  set  of  such 
views,  where  the  points  in  these  views  are  ordered  in  cor¬ 
respondence.  The  appearance  of  a  novel  view  of  the  ob¬ 
ject  is  predicted  by  applying  linear  combinations  to  the 
stored  views.  The  predicted  appearance  is  then  com¬ 
pared  with  the  actual  image,  and  the  object  is  recog¬ 
nized  if  the  two  match.  The  advantage  of  this  method 
is  twofold.  First,  viewer-centered  representations  are 
used  rather  than  object-centered  ones,  namely,  models 
are  composed  of  2D  views  of  the  observed  scene;  second, 
novel  appearances  are  predicted  in  a  simple  and  accurate 
way  (under  weak  perspective  projection). 

Formally,  given  P,  a  2D  image  of  a  scene,  and  M, 
a  set  of  stored  models,  the  objective  is  to  find  a  model 
M*  €  M  such  that  P  =  for  some  constants 

aj  €  H.  It  has  been  shown  that  this  scheme  accurately 
predicts  the  appearance  of  rigid  objects  under  weak  per¬ 
spective  projection  (orthographic  projection  and  scale). 
The  limitations  of  this  projection  model  are  discussed 
later  in  this  paper. 

More  concretely,  let  p,-  =  (xj ,  y,- ,  z,),  1  <  i  <  n,  be  a  set 
of  n  object  points.  Under  weak  perspective  projection, 
the  position  =  (x|,  y()  of  these  points  in  the  image  are 
given  by 

x'i  =  sriii<-»-sri2yi4-sri3Z,-(-t, 

y,'  =  sr2iri-l-sr22y,+»r232,+tv  (1) 

where  rjj  are  the  components  of  a  3  x  3  rotation  matrix, 
and  s  is  a  scale  factor.  Rewriting  this  in  vector  equation 
form  we  obtain 

x'  =  sriix-l-sri2y +  »ri3Z-l-t*l 

y'  =  sr2ix-|-sr22y +  sr23Z  +  <vl  (2) 

where  x,y,  z,x',y'  €  are  the  vectors  of  x,-,  yj,  Zj, 
x'i  and  yi  coordinates  respectively,  and  1  =  (1,1,...,!). 
Consequently, 

x',y' e  5pan{x,y,z,l}  (3) 

or,  in  other  words,  x'  and  y'  belong  to  a  four-dimensional 
linear  subspace  of  TV'.  (Notice  that  z',  the  vector  of 
depth  coordinates  of  the  projected  points,  also  belongs 
to  this  subspace.  This  fact  is  used  in  Section  4  below.) 
A  four-dimensional  space  is  spanned  by  any  four  lin¬ 
early  independent  vectors  of  the  space.  Two  views  of 
the  scene  supply  four  such  vectors  [13,  18].  Denote  by 


Xi,  yi  and  X2,  yt  the  location  vectors  of  the  n  points  in 
the  two  images;  then  there  exist  coefficients  ui ,  02, 03, 04 
and  6] ,  62, 63, 64  such  that 

x'  =  OiXi  +  a2yi  -b  03X2  +  04! 

y'  =  bixi-i-  62yi  -I-  63*2  +  ^4!  (4) 

(Note  that  the  vector  y2  abeady  depends  on  the  other 
four  vectors.)  Since  i2  is  a  rotation  matrix,  the  coeffi¬ 
cients  satisfy  the  following  two  quadratic  constraints; 

aj  -I-  a®  -t-  03  —  6j  —  62  —  63  = 

2(6163  —  aia3)rii  -1-  2(6263  —  a203)>'i2  (5) 

and 

ai6i  -b  0362  -b  0363  -b  (<1163  -b  U36i)rii  + 

(0363  -b  a362)ri2  =  0  (6) 

To  derive  these  constraints  the  transformation  between 
the  two  model  views  should  be  recovered.  This  can 
be  done  under  weak  perspective  using  a  third  image. 
Alternatively,  the  constraints  can  be  ignored,  in  which 
case  the  system  would  confuse  rigid  transformations  with 
affine  ones.  This  usually  does  not  prevent  successful  lo¬ 
calization  since  generally  scenes  are  fairly  different  from 
one  another. 

The  LC  scheme  for  the  problem  of  localization  is  as 
follows.  The  environment  is  modeled  by  a  set  of  im¬ 
ages  with  correspondence  between  the  images.  For  ex¬ 
ample,  a  spot  can  be  modeled  by  two  of  its  correspond¬ 
ing  views.  The  corresponding  quadratic  constraints  may 
also  be  stored.  Localization  is  achieved  by  recovering  the 
linear  combination  that  aligns  the  model  to  the  observed 
image.  The  coefficients  are  determined  using  four  model 
points  and  their  corresponding  image  points  by  solving 
a  linear  set  of  equations.  Three  points  are  sufficient  to 
determine  the  coefficients  if  the  quadratic  constraunts  are 
also  considered.  Additional  points  may  be  used  to  reduce 
the  effect  of  noise. 

The  LC  scheme  uses  viewer-centered  models,  that  is, 
representations  that  are  composed  of  images.  It  has  a 
number  of  advantages  over  methods  that  build  full  three- 
dimensional  models  to  represent  the  scene.  First,  by  us¬ 
ing  viewer-centered  models  that  cover  relatively  small 
transformations  we  avoid  the  need  to  handle  occlusions 
in  the  scene.  If  from  some  viewpoints  the  scene  appears 
different  because  of  occlusions  we  utilize  a  new  model 
for  these  viewpoints.  Second,  viewer-centered  models 
are  easier  to  build  and  to  maintain  than  object-centered 
ones.  The  modek  contain  only  images  and  correspon¬ 
dences.  By  Umiting  the  transformation  between  the 
model  images  one  can  find  the  correspondence  using  mo¬ 
tion  methods.  If  large  portions  of  the  environment  are 
changed  between  visits  a  new  model  can  be  constructed 
by  simply  replacing  old  images  with  new  ones. 

One  problem  with  using  the  LC  scheme  for  localization 
is  due  to  the  weak  perspective  approximation.  In  con¬ 
trast  with  the  problem  of  object  recognition,  where  we 
can  generally  assume  that  objects  are  small  relative  to 
their  distance  from  the  camera,  in  localization  the  envi¬ 
ronment  surrounds  the  robot  and  perspective  distortions 
cannot  be  neglected.  The  limitations  of  weak  perspective 
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modeling  are  discussed  both  mathematically  and  empir¬ 
ically  in  the  next  two  sections.  It  is  shown  that  in  many 
practical  cases  weak  perspective  is  sufficient  to  enable  ac¬ 
curate  localization.  The  main  reason  is  that  the  problem 
of  localization  does  not  require  accurate  measurements 
in  the  entire  image;  it  only  requires  identifying  a  suffi¬ 
cient  number  of  spots  to  guarzmtee  accurate  naming.  If 
these  spots  are  relatively  close  to  the  center  of  the  im¬ 
age,  or  if  the  depth  differences  they  create  are  relatively 
small  (as  in  the  case  of  looking  at  a  wall  when  the  line  of 
sight  is  nearly  perpendicular  to  the  wall),  the  perspec¬ 
tive  distortions  are  relatively  small,  and  the  system  can 
identify  the  scene  with  high  accuracy.  Also,  views  re¬ 
lated  by  a  translation  parallel  to  the  image  plane  form  a 
linear  space  even  when  perspective  distortions  are  large. 

By  using  weak  perspective  we  avoid  stability  problems 
that  frequently  occur  in  perspective  computations.  We 
can  therefore  compute  the  alignment  coefficients  by  look¬ 
ing  at  a  relatively  narrow  field  of  view.  The  entire  scheme 
can  be  viewed  as  an  accumulative  process.  Rather  than 
eicquiring  images  of  the  entire  scene  and  comparing  them 
all  to  a  full  scene  model  (as  in  [3])  we  recognize  the  scene 
image  by  image,  spot  by  spot,  until  we  accumulate  suffi¬ 
cient  convincing  information  that  indicates  the  identity 
of  the  place. 

When  perspective  distortions  are  relatively  large  and 
weak  perspective  is  insufficient  to  model  the  environ¬ 
ment,  two  approaches  can  be  used.  One  possibility  is 
to  construct  a  larger  number  of  models  so  as  to  keep 
the  possible  changes  between  the  familiar  and  the  novel 
views  small.  Alternatively,  an  iterative  computation  can 
be  applied  to  compensate  for  these  distortions.  Such  an 
iterative  method  is  described  in  Section  4. 

3.2  Positioning 

Positioning  is  the  problem  of  recovering  the  exact  po¬ 
sition  of  the  robot.  This  position  can  be  specified  in 
a  fixed  coordinate  system  associated  with  the  environ¬ 
ment  (i.e.,  room  coordinates),  or  it  can  be  associated 
with  some  model,  in  which  case  location  is  expressed 
with  respect  to  the  position  from  which  the  model  views 
were  acquired.  In  this  section  we  discuss  an  application 
of  the  LC  scheme  to  the  positioning  problem. 

The  idea  is  the  following.  We  assume  a  model  com¬ 
posed  of  two  images,  P\  and  Pji  their  relative  position  is 
given.  Given  a  novel  image  P' ,  we  first  align  the  model 
with  the  image  (i.e.,  localization).  By  considering  the  co¬ 
efficients  of  the  linear  combination  the  robot’s  position 
relative  to  the  model  images  is  recovered.  To  recover  the 
absolute  position  of  the  robot  in  the  room  the  absolute 
positions  of  the  model  views  should  also  be  provided. 

Assuming  P2  is  obtained  from  P\  by  a  rotation  R, 
translation  t  =  (f^fy),  and  scaling  s,  the  coordinates  of 
a  point  in  P' ,  (*',  j/),  can  be  written  as  linear  combinar 
tions  of  the  corresponding  model  points  in  the  following 
way: 

x'  =  Oiari  -1-021/1  -f  03*2 +  04 

y'  =  61*1 +62^1 +63*2 +*4  (7) 

Substituting  for  *2  derive  the  position  of  the 

robot  in  the  image  relative  to  its  position  in  the  model 
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views: 


A*  =  03!,  +  04 

Ay  =  63fy  +  64 

=  /(f-7) 

8„  S 

(8) 

where 

*n  =  «i  +  “2  +  “3*^  +  2o3s(oirii  +  02*12)  (9) 

Note  that  A*  is  derived  from  the  change  in  scale  of  the 
object.  The  rotation  of  the  camera  can  also  be  derived 
(see  details  in  [14]). 

As  was  already  mentioned,  the  position  of  the  robot 
is  computed  here  relative  to  the  position  of  the  camera 
when  the  first  model  image.  Pi ,  was  acquired.  A*  and 
Az  represent  the  motion  of  the  robot  from  Pi  to  P',  and 
the  rest  of  the  yarameters  represent  its  3D  rotation  and 
elevation.  To  obtain  the  relative  position  the  transfor¬ 
mation  parameters  between  the  model  views.  Pi  and  P2, 
are  required. 

3.3  Repositioning 

An  interesting  variant  of  the  positioning  problem,  re¬ 
ferred  to  as  repositioning,  is  defined  as  follows.  Given 
an  image,  called  the  target  image,  position  yourself  in 
the  location  from  which  this  image  was  observed^ .  One 
way  to  solve  this  problem  is  to  extract  the  exact  position 
from  which  the  target  image  was  obtained  and  direct  the 
robot  to  that  position.  In  this  section  we  are  interested 
in  a  more  qualitative  approach.  Under  this  approach 
position  is  not  computed.  Instead,  the  robot  observes 
the  environment  and  extracts  only  the  direction  to  the 
target  location.  Unlike  the  exact  approach,  the  method 
presented  here  does  not  require  the  recovery  of  the  trans¬ 
formation  between  the  model  views. 

We  assume  we  are  given  with  a  model  of  the  environ¬ 
ment  together  with  a  target  image.  The  robot  is  allowed 
to  take  new  images  as  it  is  moving  towards  the  target. 
We  assume  a  horizontally  moving  platform.  (In  other 
words,  we  assume  three  degrees  of  freedom  rather  than 
six;  the  robot  is  allowed  to  rotate  around  the  vertical 
axis  and  translate  horizontally.)  Below  we  give  a  simple 
computation  that  determines  a  path  which  terminates  in 
the  target  location.  At  each  time  step  the  robot  acquires 
a  new  image  and  aligns  it  with  the  model.  By  compar¬ 
ing  the  alignment  coefficients  with  the  coefficients  for  the 
target  image  the  robot  determines  its  next  step.  The  al¬ 
gorithm  is  divided  into  two  stages.  In  the  first  stage  the 
robot  fixates  on  one  identifiable  point  and  moves  along 
a  circular  path  around  the  fixation  point  until  the  line  of 
sight  to  this  point  coincides  with  the  line  of  sight  to  the 
corresponding  point  in  the  target  image.  In  the  second 
stage  the  robot  advances  forward  or  retreats  backward 
until  it  reaches  the  target  location. 

Given  a  model  composed  of  two  images,  Pi  and  P2, 
P2  is  obtained  from  Pi  by  a  rotation  about  the  y^-axis 
by  an  angle  a,  horizontal  translation  f,,  and  scale  factor 

'This  problem  can  be  considered  as  a  variant  of  the  hom¬ 
ing  problem.  A  discussion  of  the  general  homing  problem 
with  a  “signature-  based”  solution  can  be  found  in[ll]. 


s.  Given  a  target  image  P<,  Pt  is  obtained  from  Pi  by  a 
similetr  rotation  by  an  angle  $,  translation  U,  and  scale 
s,.  Using  Eq.  (4)  the  position  of  a  target  point  (ztil/t) 
can  be  expressed  as 


Xt  =  0111+03X2  +  04 

yt  =  b^yi  (10) 

(The  rest  of  the  coefficients  are  zero  since  the  platform 
moves  horizontally.)  In  fact,  the  coefficients  are  given  by 


Ol 

04 


»i  sinfa— 


tt 


sin  a 
-  ^ 


sin  $ 


03 

62 


«<  sin  0 
t  sin  a 


St 


(11) 


(The  derivation  is  given  in  [14]) 

At  every  time  step  the  robo^  acquires  an  image  and 
aligns  it  with  the  above  model.  Assume  that  image  Pp 
is  obtained  as  a  result  of  a  rotation  by  an  angle  <f),  trans¬ 
lation  tp,  and  scale  Sp.  The  position  of  a  point  (zp.j/p) 
is  expressed  by 


Xp  =  C1X1+C3X2  +  C4 


Vp  =  <<21/1 

(12) 

where  the  coefficients  are  given  by 

*  8ina  .  ,  " 

-  tp  illl^  d2 

s,  sin  p 
“  $  flin  a 

— 

(13) 

The  step  performed  by  the  robot  is 

determined  by 

(14) 

C3  03 

That  is. 

5  =  s  sin  a(cot  <l>  —  cot  6) 

(15) 

£  is  a  monotonic  function  of  the  angular  diference  be¬ 
tween  Pp  and  Pi  (the  derivation  is  given  in  [14]).  The 
robot  should  now  move  so  as  to  reduce  the  absolute  value 
of  6.  The  direction  of  motion  depends  on  the  sign  of  6. 
The  robot  can  deduce  the  direction  by  moving  slightly 
to  the  side  and  checking  if  this  motion  results  in  an  in¬ 
crease  or  decrease  of  6.  The  motion  is  defined  as  follows. 
The  robot  moves  to  the  right  (or  to  the  left,  depending 
on  which  direction  reduces  ||$||)  by  a  step  Ax. 

A  new  image  P„  is  now  acquired,  and  the  fixated  point 
is  located  in  this  image.  Denote  its  new  position  by 
x„.  Since  the  motion  is  parallel  to  the  image  plane  the 
depth  vedues  of  the  point  in  the  two  views,  Pp  and  Pn, 
are  identical.  We  now  want  to  rotate  the  camera  so  as 
to  return  the  fixated  point  to  its  original  position.  The 
angle  of  rotation,  0,  can  be  deduced  from  the  equation 

Xp  =  x„  cos +  sin  ^  (16) 

This  equation  has  two  solutions.  We  chose  the  one  that 
counters  the  translation  (namely,  if  translation  is  to  the 
right,  the  camera  should  rotate  to  the  left),  and  that 
keeps  the  angle  of  rotation  small.  In  the  next  time  step 
the  new  picture  P„  replaces  Pp  and  the  procedure  is 
repeated  until  S  vanishes.  The  resulting  path  is  circular 
around  the  point  of  focus. 

Once  the  robot  arrives  at  a  position  for  which  i  =  0 
(namely,  its  line  of  sight  coincides  with  that  of  the  tcirget 
image,  and  <^  =  0)  it  should  now  advance  forward  or 


retreat  backward  to  adjust  its  position  along  the  line 
of  sight.  Several  measures  can  be  used  to  determine  the 
direction  of  motion;  one  example  is  the  term  ci/o]  which 
satisfies 

—  =  (17) 

ai  s, 

when  the  two  lines  of  sight  coincide.  The  objective  at 
this  stage  is  to  bring  this  measure  to  1 . 

4  Handling  Perspective  Distortions 

The  linear  combination  scheme  presented  above  accu¬ 
rately  handles  changes  in  viewpoint  assuming  the  images 
are  obtained  under  weak  perspective  projection.  Error 
analysis  and  experimental  results  demonstrate  that  in 
many  practical  cases  this  assumption  is  valid.  In  cases 
where  perspective  distortions  are  too  large  to  be  handled 
by  a  weak  perspective  approximation,  matching  between 
the  model  and  the  image  can  be  facilitated  in  two  ways. 
One  possibility  is  to  avoid  cases  of  large  perspective  dis¬ 
tortion  by  augmenting  the  library  of  stored  models  with 
additional  models.  In  a  relatively  dense  library  there 
usually  exists  a  model  that  is  related  to  the  image  by 
a  sufficiently  small  transformation  avoiding  such  distor¬ 
tions.  The  second  alternative  is  to  improve  the  match 
between  the  model  and  the  image  using  an  iterative  pro¬ 
cess.  In  this  section  we  consider  the  second  option. 

The  suggested  iterative  process  is  based  on  a  Taylor 
expansion  of  the  perspective  coordinates.  As  described 
below,  this  expansion  results  in  a  polynomial  consisting 
of  terms  each  of  which  can  be  approximated  by  linear 
combinations  of  views.  The  first  term  of  this  series  rep¬ 
resents  the  orthographic  approximation.  The  process  re¬ 
sembles  a  method  of  matching  3D  points  with  2D  points 
described  recently  by  DeMenthon  and  Davis  [4].  In  th  s 
case,  however,  the  method  is  applied  to  2D  models  rather 
than  3D  ones.  In  our  application  the  3D  coordinates  of 
the  model  points  are  not  provided;  instead  they  are  ap¬ 
proximated  from  the  model  views. 

An  image  point  (x,y)  =  (/A’/Z,/y/Z)  is  the  projec¬ 
tion  of  some  object  point,  (A,  Y,  Z)  in  the  image,  where 
/  denotes  the  focal  length.  Consider  the  following  Tay¬ 
lor  expansion  of  F{Z)  =  l/Z  around  some  depth  value 
Zo: 


F<*)(Zo) 

ib! 


(Z-Zo)‘ 


(18) 


The  Taylor  series  describing  the  position  of  a  point  x  = 
/A/Z  therefore  is  given  by 


X 


fi  I  V  fz-ZoV 

^0  [  ^  (Jfc  -  1)!  V  ^0  J 


(19) 


Notice  that  the  zero  term  contains  the  orthographic  ap¬ 
proximation  for  X.  Denote  by  the  tth  term  of  the 
series: 


A(*)  -  (-1)^  (Z-ZoV 

Zo(k-l)!V  ^0  J 


(20) 


A  recursive  definition  of  the  above  series  is  given  below. 
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Initialization: 


x(°)  =  A(“>  =  H 

Zo 


Iterative  step: 


X  \e 

5 

10 

15 

20 

25 

1.2 

1.4 

1.6 

1.6 

50 

1.1 

1.2 

1.3 

1.4 

75 

1.07 

1.13 

1.2 

1.27 

100 

1.05 

1.1 

1.15 

1.2 

A(*) 

x(*) 


z-Zq  (t-n 
{k  -  l)Zo 


where  represents  the  Jbth  order  approximation  for  x, 
and  A^*^  represents  the  highest  order  term  in 

According  to  the  orthographic  approximation  both  X 
and  Z  can  be  expressed  as  linear  combinations  of  the 
model  views  (Eq.  (4)).  We  therefore  apply  the  above 
procedure,  approximating  X  and  Z  at  every  step  using 
the  linear  combination  that  best  aligns  the  model  points 
with  the  image  points.  The  general  idea  is  therefore  the 
following.  First,  we  estimate  x^'’^  and  A^^^  by  solving  the 
orthographic  case.  Then  at  each  step  of  the  iteration  we 
improve  the  estimate  by  seeking  the  linear  combination 
that  best  estimates  the  factor 


Z-Zo  x-x(*-‘) 
(*-1)^0*"  A(*-i) 


(21) 


Denote  by  x  €  the  vector  of  image  point  coordinates, 
and  denote  by 

^’ =  [xi,yi,X2.1]  (22) 

an  n  X  4  matrix  containing  the  position  of  the  points  in 
the  two  model  images.  Denote  by  P"*"  =  (P^P)“*P^  the 
pseudo-inverse  of  P  (we  assume  P  is  overdetermined). 
Also  denote  by  the  coefficients  computed  for  the  ifcth 
step.  Pa^*)  represents  the  Unear  combination  computed 
at  that  step  to  approximate  the  X  or  the  Z  values.  Since 
at  every  step  Zq,  /,  and  k  are  constant  they  can  be 
merged  into  the  Unear  combination.  Denote  by  and 
the  vectors  of  computed  values  of  x  and  A  at  the 
ibth  step.  An  iterative  procedure  to  align  a  model  to  the 
image  is  described  below. 


Initialization: 

Solve  the  orthographic  approximation,  namely 


a(°)  =  P+x 

x(0)  =  ^(0)  ^  pa(0) 

Iterative  step: 

q(*)  =  (x-x(*-‘))-^A(*-'> 

a(*)  =  p+q(*) 

A^*)  =  (Pa(*))(gi  A(*~'^ 

x(*)  = 


where  the  vector  operations  and  -r  are  defined  as 


u  ^  V 
U  -r  V 


(uiVl . «nVn) 

(^,...>) 

V]  V„ 


Table  1:  Allowed  depth  ratios,  as  a  function  of  x 
(half  the  width  of  the  field  considered)  and  the  error 
allowed  (e,  in  pixels). 


5  Projection  Model — Error  Analysis 

In  this  section  we  estimate  the  error  obtained  by  using 
the  Unear  combination  method.  The  method  assumes  a 
weak  perspective  projection  model.  We  compare  this  as¬ 
sumption  with  the  more  accurate  perspective  projection 
model. 

A  point  (X,  Y,  Z)  is  projected  under  the  perspective 
model  to  the  point  (x,y)  =  (fX/Z,fY/Z)  in  the  im¬ 
age,  where  /  denotes  the  focal  length.  Under  the  weak 
perspective  model  the  same  point  is  approximated  by 
(x,2/)  =  (sXysY)  where  s  is  a  scaUng  factor.  The  best 
estimate  for  s,  the  scaling  factor,  is  given  by  s  =  f/Zo, 
where  Zo  is  the  average  depth  of  the  observed  environ¬ 
ment.  Denote  the  error  by 


E 


x-x 


The  error  is  expressed  by 
E  = 


Changing  to  image  coordinates  we  obtain 

Z 


£■=1x1 


Zo 


-  1 


(23) 

(24) 


(25) 


The  error  is  small  when  the  measured  feature  is  close 
the  optical  axis,  or  when  our  estimate  for  the  depth,  Zq, 
is  close  to  the  real  depth,  Z.  This  supports  the  basic 
intuition  that  for  images  with  low  depth  variance  and 
for  fixated  regions  (regions  near  the  center  of  the  im¬ 
age),  the  obtained  perspective  distortions  are  relatively 
small,  and  the  system  can  therefore  identify  the  scene 
with  high  accuracy.  Table  1  shows  a  number  of  examples 
where  the  aUowed  depth  variance,  Z/Zq,  is  computed  as 
a  function  of  x  and  the  tolerated  error,  <.  For  exam¬ 
ple,  a  10  pixel  error  tolerated  in  a  field  of  view  of  up  to 
±50  pixels  is  equivalent  to  allowing  depth  variations  of 
20%.  From  this  discussion  it  is  apparent  that  when  a 
model  is  aligned  to  the  image  the  results  of  this  align¬ 
ment  should  be  judged  differently  a  different  points  of 
the  image.  The  farther  away  a  poii.i  is  from  the  cen¬ 
ter  the  more  discrepancy  should  be  tolerated  between 
the  prediction  and  the  actual  image.  A  five  pixel  error 
at  position  x  =  50  is  equivalent  to  a  10  pixel  error  at 
position  X  =  100. 

So  far  we  have  considered  the  discrepancies  between 
the  weak  perspective  and  the  perspective  projections  of 
points.  The  accuracy  of  the  LC  scheme  depends  on  the 
validity  of  the  weak  perspective  projection  both  in  the 
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model  views  and  for  the  incoming  image.  In  the  rest  of 
this  section  we  develop  an  error  term  for  the  LC  scheme 
assuming  that  both  the  model  views  and  the  incoming 
image  are  obtained  by  perspective  projection. 

The  error  obtained  by  using  the  LC  scheme  is  given 
by 

E  —\x  —  axi  —  byi  —  cx2  —  d|  (26) 

Since  the  scheme  accurately  predicts  the  appearances  of 
points  under  weak  perspective  projection,  it  satisfies 


X  =  ail  —  bin  —  ci2  —  d  (27) 


where  accented  letters  represent  orthographic  approxi¬ 
mations.  Assume  that  in  the  two  model  pictures  the 
depth  ratios  are  roughly  equal; 


_  Zqi  ^  Zo2 

Z‘^~  Zi  ~  Z2 


(28) 


(This  condition  is  satisfied,  for  example,  when  between 
the  two  model  images  the  camera  only  translates  along 
the  image  plane.)  Using  the  fact  that 


H  =  =  .£0 

Z  Zo  Z  ^  Z 


we  obtain  (see  [14]) 


E  < 


Z^ 

Z» 


+  MI 


z^ 

z" 


(29) 


(30) 


The  error  therefore  depends  on  two  terms.  The  first 
gets  smaller  as  the  image  points  get  closer  to  the  center  of 
the  frame  and  as  the  difference  between  the  depth  ratios 
of  the  model  and  the  image  gets  smaller.  The  second 
gets  smaller  as  the  translation  component  gets  smaller 
and  as  the  model  gets  close  to  orthographic. 

Following  this  analysis,  weak  perspctive  can  be  used 
as  a  projection  model  when  the  depth  variations  in  the 
scene  are  relatively  low  and  when  the  system  concen¬ 
trates  on  the  center  part  of  the  image.  We  conclude 
that,  by  fixating  on  distinguished  parts  of  the  environ¬ 
ment,  the  linear  combinations  scheme  can  be  used  for 
localization  and  positioning. 


static  cues  (the  wail  corners)  were  perfectly  aligned.  The 
semi-static  cues,  however,  did  not  match  any  features  in 
the  image. 

Figure  3  shows  the  matching  of  the  model  of  office  A 
with  an  image  of  the  same  office  obtained  by  a  relatively 
large  motion  forward  (about  2m)  and  to  the  side  (about 
1.5m).  Although  the  distances  are  relatively  short  most 
perspective  distortions  are  negligible,  and  a  good  match 
between  the  model  and  the  image  is  obtained. 

Another  set  of  images  was  taken  in  a  corridor.  Here, 
because  of  the  deep  structure  of  the  corridor,  perspective 
distortions  are  noticeable.  Nevertheless,  the  alignment 
results  still  demonstrate  an  accurate  match  in  large  por¬ 
tions  of  the  image.  Figure  4  shows  to  model  images  of 
a  corridor.  Figure  5  (left)  shows  an  overlay  of  a  lin¬ 
ear  combination  of  the  model  views  with  an  image  of 
a  corridor.  It  can  be  seen  that  the  parts  that  are  rela¬ 
tively  distant  align  perfectly.  Figure  5  (right)  shows  the 
matching  of  the  corridor  model  with  an  image  obtained 
by  a  relatively  large  motion  (about  half  of  the  corridor 
length).  Because  of  perspective  distortions  the  relatively 
near  features  no  longer  align  (e.g.,  the  near  door  edges). 
The  relatively  far  edges,  however,  still  match. 

The  next  experiment  shows  the  application  of  the  iter¬ 
ative  process  presented  in  Section  4  in  cases  where  large 
perspective  distortion  were  noticeable.  Figure  6  shows 
the  results  of  matching  an  office  model  to  an  image  of  the 
same  office.  In  this  case,  since  the  image  was  taken  from 
a  relatively  close  distance,  perspective  distortions  can¬ 
not  be  neglected.  The  effects  of  perspective  dbtortions 
can  be  noticed  on  the  top  right  corner  of  the  board,  and 
on  the  edges  of  the  hanger  on  the  top  right.  Perspective 
effects  were  reduced  by  using  the  iterative  process.  The 
results  of  applying  this  procedure  after  one  and  three 
iterations  are  shown  in  Figure  7. 

The  experimental  results  demonstrate  that  the  LC 
method  achieves  accurate  localization  in  many  cases,  and 
that  when  the  method  fails  because  of  large  perspective 
distortions  an  iterative  computation  can  be  used  to  im¬ 
prove  the  quality  of  the  match. 

7  Conclusions 


6  Experiments 

The  LC  method  was  implemented  and  applied  to  images 
taken  in  an  indoor  environment.  Images  of  two  offices, 
A  and  B,  that  have  similar  structures  were  taken  us¬ 
ing  a  Panasonic  ceimera  with  a  focal  length  of  700  pixels. 
Semi-static  objects,  such  as  heavy  furniture  and  pictures, 
were  used  to  distinguish  between  the  offices.  Figure  1 
shows  two  model  views  of  office  A.  The  views  were  taken 
at  a  distance  of  about  4m  from  the  wall.  Correspon¬ 
dences  were  picked  manually.  The  results  of  aligning  the 
model  views  to  images  of  the  two  offices  are  presented 
in  Figure  2.  The  left  image  contains  an  overlay  of  a 
predicted  image  (the  thick  white  lines),  constructed  by 
linearly  combining  the  two  views,  and  an  actual  image 
of  office  A.  A  good  match  between  the  two  was  achieved. 
The  right  image  contains  an  overlay  of  a  predicted  im¬ 
age  constructed  from  a  model  of  office  B  eind  an  image  of 
office  A.  Because  the  offices  share  a  similar  structure  the 


A  method  of  localization  and  positioning  in  an  indoor 
environment  was  presented.  The  method  is  based  on  rep¬ 
resenting  the  scene  as  a  set  of  2D  views  and  predicting 
the  appearance  of  novel  views  by  linear  combinations  of 
the  model  views.  The  method  accurately  approximates 
the  appearances  of  scenes  under  weak  perspective  projec¬ 
tion.  Analysis  of  this  projection  as  well  as  experimental 
results  demonstrate  that  in  many  cases  this  approxima¬ 
tion  is  sufficient  to  accurately  describe  the  scene.  When 
the  weak  perspective  approximation  is  invalid,  either  a 
leurger  number  of  models  can  be  acquired  or  an  iterative 
solution  can  be  employed  to  account  for  the  perspective 
distortions. 

The  method  presented  in  this  paper  has  several  ad¬ 
vantages  over  existing  methods.  It  uses  relatively  rich 
representations;  the  representations  are  2D  rather  than 
3D,  and  localization  can  be  done  from  a  single  2D  view 
only.  The  same  basic  method  is  used  in  both  the  localiza¬ 
tion  and  positioning  problems,  and  a  simple  algorithm 
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Figiiro  1:  Two  ihocIpI  virws  of  office  A. 


Figurf  2:  Mafrhing  a  rnodol  of  office  A  to  an  image  of  (.(fire  A  (left),  and  matching  a  model  of  office  R  to  ih<'  same 
image  (right) 


Figure  Matching  a  model  r)f  office  A  to  an  image  of  the  same  office  obtained  hy  a  relatively  large  motion  forwanl 
and  to  the  right. 


Figure  4;  Two  model  views  of  a  corridor. 


Figure  5:  Matching  the  corridor  model  with  two  images  of  the  corridor.  The  right  image  was  obtained  by  a  relatively 
large  motion  forward  (about  half  of  the  corridor  length)  and  to  the  right. 


for  repositioning  is  derived  from  this  method.  Future 
work  includes  handling  the  problem  of  actpiisition  and 
maintenance  of  models,  developing  efficient  and  robust 
algorithms  for  solving  the  correspondence  problem,  and 
building  maps  using  visual  input. 

References 

[1]  N.  Ayache  and  O.  D.  Faugeras.  Maintaining  rep¬ 
resentations  of  the  environment  of  a  mobile  robot, 
IEEE  Trans,  on  Rob.  and  Auto.,  5,  804-819,  1989. 

[2]  R.  Basri  and  S.  Ullman.  The  alignment  of  objects 
with  smooth  surfaces.  Pror.  2nd  Int.  Conf.  on  Com¬ 
puter  Vision,  Tarpon  .Springs,  FL,  482  488,  1988. 

[;{]  n.  .1.  Braunegg.  Marvel  A  .system  for  recognizing 
world  locations  with  stereo  vision.  AI-TR-1229. 
MIT.  1990. 

[4]  I).  F.  I)e , Merit hon  and  L.  S.  Davis.  .Model-ba.sed  ob¬ 
ject  pose  in  2.')  lines  of  code.  Pror.  2nd  Enroptnn 
Conf.  on  Comp,  l  i.s.,  (ieriova,  Italy,  1992. 


[fi]  .S.  F*.  Engelson  and  D.  V'.  .McDermott.  Image  sig¬ 
natures  for  place  recognition  and  map  construction. 
Pror.  ,'IPIE  Symposium  on  Intel.  Rob.  ,Sys..  Boston. 
MA,  1991, 

[6]  Fennema.  A.  Hanson,  E.  Riseman,  R.  J.  Bev¬ 
eridge,  and  R.  Kumar.  Model-directed  mobile  robot 
navigation.  IEEE  Trans,  on  Sys..  Man  and  Cybem.. 
20.  1352-1369,  1990. 

[7]  M.  A.  Fisrhler  and  R.  (’.  Bolles.  Random  sample 
consensus;  a  paradigm  for  model  fitting  with  appli¬ 
cation  to  image  analysis  and  automated  cartogra¬ 
phy.  ('om.  of  the  ACM.  24,  381  .395.  1981. 

[8]  D.  F*.  Huttenlocher  atid  .S.  rilmati.  Object  recogni¬ 
tion  itsing  alignment.  Pror.  1st  hit.  Conf.  on  Comp. 
V'l.s..  London,  I'K,  102  111,  1987. 

[9]  D.  O.  Lowe.  Three-dimensional  obji'ct  recognition 

from  single  two-rlimensional  images.  Rob.  Ri  s.  I'R 
202.  Courant  Inst,  of  .Math.  .Sri..  1985. 


385 


Figure  6:  The  result  of  matching  the  model  to  an  image  obtained  by  a  relatively  large  motion.  Perspective  distortions 
can  be  seen  in  the  table,  the  board,  and  the  hanger  at  the  upper  right. 


Figure  7:  The  results  of  applying  the  iterative  process  to  reduce  perspective  distortions  after  one  (left)  and  three 
(right)  iterations. 


[10]  M.  J.  Mataric.  Environment  learning  using  a  dis¬ 
tributed  representation.  Proc.  Ini.  Conf.  on  Rob. 
and  Auto.,  Cincinnati,  OH,  1990. 

[11]  R.  N.  Nelson.  Visual  homing  using  an  associative 
memory.  DARPA  Image  Understanding  Workshop, 
245-262,  1989. 

[12]  K.  Onoguchi,  M.  Watanabe,  Y.  Okamoto,  Y.  Kuno, 
and  H.  Asada.  A  visual  navigation  system  using  a 
multi  information  local  map.  Proc.  Int.  Conf.  on 
Robotics  and  Automation,  Cincinnati,  OH,  pp.  767- 
774,  1990. 

[13]  T.  Poggio.  3D  object  recognition;  on  a  result  by 
Basri  and  Ullman.  TR  9005-03,  IRST,  Povo,  Italy, 
1990. 

[14]  E.  Rivlin  and  R.  Basri.  Localization  and  positioning 
using  combinations  of  model  views.  TR-S926,  Univ. 
of  Maryland  and  Al-TR-1376,  MIT,  1992. 


[15]  K.  B.  Sarachik.  Visued  navigation;  constructing  and 
utilizing  simple  maps  of  an  indoor  environment.  AI- 
TR-II13,  MIT,  1989. 

[16]  D.  W.  Thompson  zuid  J.  L.  Mundy.  Three  dimen¬ 
sional  model  matching  from  an  unconstrained  view¬ 
point.  Proc.  Int.  Conf.  on  Rob.  and  Auto..  Raleigh, 
NC,  208-220,  1987. 

[17]  S.  Ullman.  Aligning  pictorial  descriptions;  an  ap¬ 
proach  to  object  recognition.  Cognition,  32.  193- 
254,  1989. 

[18]  S.  Ullman  and  R.  Basri.  Recognition  by  linear  com¬ 
binations  of  models.  IEEE  Trans,  on  Pattern  Anal¬ 
ysis  and  Machine  Intelligence ,  13.  992-1006,  1991. 


386 


Dynamic  Camera  Self-Calibration  from  Controlled 

Motion  Sequences 

Lisa  Dron 

M.I.T.  Artificial  Intelligence  Laboratory 
545  Technology  Square,  Cambridge,  MA  02139 


Abstract 

In  order  io  recover  camera  motion  and  S-d  structure  from 
a  sequence  of  images  we  must  first  relate  points  in  the 
image  plane  to  directions  in  space.  This  paper  describes  a 
least-squares  algorithm  for  computing  camera  calibration 
from  a  series  of  motion  sequences  for  which  the  transla¬ 
tional  direction  of  the  camera  is  known.  The  method  does 
not  require  special  calibration  objects  or  scene  structure. 
It  only  requires  the  ability  to  move  the  camera  in  a  given 
direction  and  to  track  features  in  the  image  as  the  cam¬ 
era  moves.  This  method  differs  from  other  recently  de¬ 
veloped  approaches  in  at  least  two  respects.  First,  since  it 
is  a  linear  least-squares  method,  it  can  include  informa¬ 
tion  from  many  sequences  to  produce  a  robust  estimate 
of  the  calibration  matrix,  which  can  be  updated  dynami¬ 
cally  as  new  measurements  are  taken.  Second,  it  uses  the 
most  general  possible  linear  model  for  calibration,  taking 
into  account  misalignment  of  the  image  sensor  and  op¬ 
tical  axis,  as  well  as  rotation  of  the  camera  with  respect 
to  the  external  coordinate  system.  Experimental  results 
from  applying  the  algorithm  to  a  set  of  real  motion  se¬ 
quences  with  noisy  correspondence  data  are  given  and 
analyzed. 

1  Introduction 

Before  camera  motion  and  3-d  structure  can  be  deduced 
from  a  set  of  images,  it  is  necessary  to  know  the  re¬ 
lation  between  points  in  the  image  plane  and  direc¬ 
tions  in  3-space.  Methods  for  computing  camera  cali¬ 
bration  differ  mainly  in  the  complexity  of  the  camera 
model  and  the  type  of  laboratory  setup  required.  The 
accuracy  of  the  calibration  and  the  complexity  of  the 
model  needed  are  determined  by  the  application.  For 
3-d  metrology,  it  is  necessary  to  use  as  complete  a  model 
as  possible  which  includes  the  nonlinear  distortions  due 
to  imperfections  and  misalignments  in  the  optical  sys¬ 
tem.  The  basic  method  for  determining  the  linear  per¬ 
spective  parameters  and  nonlinear  radial  lens  distortion 
using  a  carefully  designed  calibration  pattern  was  devel¬ 
oped  by  Tsai  [Tsai,  1986],  and  later  r^ned  by  Lenz  and 
Tsai  [Lens  and  Tsai,  19M].  Since  then  the  method  has 
been  improved  several  times  by  expanding  the  number  of 
distortion  parameters  in  the  model  and  using  nonlinear 
optimisation  techniques  [Weng  et  ai,  1992]. 

For  computing  camera  motion  and  judging  relative 
distances  as  needed  for  navigation,  the  precision  of  the 


full  nonlinear  model  is  not  usually  necessary.  Since  cam¬ 
era  motion  is  estimated  by  measurements  taken  over  the 
entire  image,  the  effects  of  distortion  are  not  as  impor¬ 
tant  unless  the  lens  distortion  is  particularly  severe.  The 
previously  cited  methods  require  tedious  measurements 
and  carefully  controlled  setups.  It  is  certainly  desirable 
to  avoid  the  use  of  calibration  patterns  if  this  is  possible; 
and  as  a  practical  matter,  it  may  be  necessary  to  do  so  if 
the  camera  is  placed  in  an  environment  where  the  scene 
cannot  be  controlled. 

Recently,  methods  for  computing  the  linear  pinhole- 
model  parameters  of  perspective  projection  by  track¬ 
ing  features  in  the  image,  without  using  spe¬ 
cial  patterns,  have  been  developed  by  Faugeras  et 
al.  [Faugeras  et  al.,  1992]  and  Hartley  [Hartley,  1992]. 
In  the  second  method,  the  effective  focal  lengths  and 
relative  positions  of  two  different  cameras  are  computed 
from  two  images  using  an  initial  guess  for  the  locations 
of  their  principle  points.  The  method  developed  by 
Faugeras  et  al.  applies  theories  from  algebraic  projective 
geometry  to  compute  all  of  the  pinhole  model  parame¬ 
ters  from  three  image  sequences. 

The  algorithm  for  camera  self-c^ibration  presented  in 
this  paper  is  similar  to  these  methods  in  that  it  requires 
the  ability  to  track  features  between  images  as  the  cam¬ 
era  moves  and  does  not  need  special  calibration  patterns. 
It  differs  from  the  previous  methods,  however,  in  that 
it  is  a  linear  least  squares  algorithm  and  can  include 
information  from  many  motion  sequences  to  produce  a 
robust  estimate  which  can  be  updated  dynamically  as 
new  measurements  are  taken.  Furthermore  it  uses  the 
most  general  linear  camera  model  possible,  taking  into 
account  misalignment  of  the  image  sensor  and  optical 
axis,  as  well  as  the  rotation  of  the  camera  with  respect 
to  the  motion  stage  coordinate  system. 

Unlike  the  other  methods,  it  does  require  that  the 
translational  direction  of  the  camera  be  known  for  each 
sequence.  It  is  not  necessary  to  know  the  rotation;  al¬ 
though  the  estimation  is  simplified  if  it  is  made  to  be 
zero.  The  benefit  guned  in  return  for  having  to  control 
the  camera  motion,  is  a  less  stringent  requirement  on 
the  accuracy  of  the  point  correspondences  than  needed 
otherwise.  For  example,  Faugeras  et  al.  report  that  er¬ 
rors  of  I  pixel  in  the  location  of  the  correspondences  can 
significantly  affect  their  results.  Such  precision  is  not 
achievable  from  block-matching  techniques  which  can  be 
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implemented  in  hardware.  In  many  situations,  it  is  much 
easier  to  move  the  camera  in  a  known  direction  than  to 
attain  such  high  accuracy  in  the  correspondences. 

2  Basic  Equations  and  Definitions 

The  least-squares  algorithm  for  computing  the  camera 
calibration  from  a  set  of  controlled  motion  sequences  is 
presented  in  Section  3.  In  this  section  I  define  the  nota¬ 
tion  which  will  be  used  later  and  rederive  the  fundamen¬ 
tal  equations  describing  the  apparent  motion  of  points 
in  the  image  plane  given  the  motion  of  the  camera  with 
r^pect  to  an  external  coordinate  system. 

2.1  The  camera  calibration  matrix 

Let  w  =  (z,y,  z)'’^  represent  a  point  in  the  world  with 
respect  to  the  origin  of  the  camera  coordinate  system. 
We  are  interested  in  relating  the  projection  of  w  onto 
the  plane  (0,0,/)^  to  the  coordinates  of  a  point  on  a 
digitized  pixel  grid.  Let  w  =  {xfz,yfz,ly  represent 
the  2-dimensional  homogeneous  coordinates  of  w.  Two 
world  points  w,-  and  Wj  are  projectively  equivalent  on 
(0, 0,  if  and  only  if  Wj  =  Wj . 

Image  plane  coordinates,  m  =  (me,my,l)”^  are  de¬ 
fined  on  a  rectangular  coordinate  system  such  that 
(ms,  my)  represents  the  coordinates  of  a  point  in  the  im¬ 
age  plane  with  the  centers  of  each  pixel  located  at  integer 
values  of  m^  and  my .  If  we  ignore  the  nonlinear  radial 
and  tangential  distortions  caused  by  imperfections  in  the 
lens  and  assume  perfect  perspective  projection,  the  im¬ 
age  plane  coordinates  are  related  to  2-d  homogeneous 
world  coordinates  by  a  linear  transformation  matrix  Ke 

m  =  KeW  (1) 

Under  certain  conditions  Ke  takes  on  a  special  form. 

If  the  axes  of  the  pixel  grid  are  exactly  orthogonal,  and 
if  the  image  plane  is  exactly  perpendicular  to  the  optical 
axis,  Ke  is  upper  triangular  and  has  the  form 

/  //s,  0  lo  ^ 

Ke  =  0  //sy  yo  (2) 

Vo  0  1/ 

where  /  is  the  effective  focal  length,  and  Sy  are  the 
spacings  between  pixel  centers  along  the  orthogonal  x- 
and  y-axes,  and  (zoi!/o)  is  the  location  of  the  principal 
point,  in  image  plane  coordinates,  where  the  optical  axis 
pierces  the  image  plane. 

In  practice,  the  conditions  for  Kc  to  have  the  form 
of  (2)  will  not  be  met  exactly.  Synchronization  error  or 
clock  skew  in  the  frame  capture  hardware  can  cause  the 
grid  axes  to  be  slightly  non-orthogonal.  Since  commer¬ 
cial  CCD  camerrts  are  not  assembled  with  strict  align¬ 
ment  tolerances,  it  is  also  unlikely  that  the  image  sensor 
will  be  exactly  perpendicular  to  the  optical  axis.  Taking 
these  effects  into  consideration,  a  more  general  form  for 
Ke  can  be  written  as 

/  //*«  -//(s,  tantf)  xo  \ 

Ke  =  A,  I  0  //(sysinfl)  yo  )  (3) 

Vo  0  1  / 

where  A.  is  the  affine  transformation  between  image 
plane  coordinates  on  the  rotated  and  unrotated  image 
sensor,  and  B  is  the  angle  between  the  grid  axes. 


2.2  Relative  motion  and  the  coplanarity 
constraint 

Suppose  we  have  two  images  taken  by  the  same  camera 
at  two  different  positions,  which  we  will  refer  to  as  right 
and  left.  Let  {wr,,  w;,-}  denote  world  coordinates  with 
respect  to  the  right  and  left  coordinate  systems  of  a  set 
of  fixed  points  in  the  environment,  Wr,  =  {xr,yT,2r)J 
and  w/i  =  {xi,yi,zi)J.  Assuming  rigid  motion  of  the 
camera  with  respect  to  the  stationary  environment,  w^, 
rmd  are  related  by 

Wr  j  =  RcWli  +  tc  (4) 

where  Re  denotes  an  orthonormal  rotation  matrix,  and 
te  is  the  translation  vector  that  connects  the  origins  of 
the  two  systems. 

In  terms  of  homogeneous  coordinates,  Wr  = 

(Xr/Zr,yrAr,  1)7  =  (*//«/■  .  1)7- 

tion  (4)  can  be  rewritten  as 

ZrWrj  =  Z/RcWj,-  -|-  tc  (5) 

and  hence, 

ZrKc"^mri  =  2|RcKc~‘m/,  -I- tc  (6) 

A  necessary  condition  for  the  vectors  Wr,  and  w/,  to 
intersect  at  the  world  point  i  is  that  they  be  coplanar 
with  tc,  or  equivalently,  that  the  triple  product  of  the 
ray  directions  vanish 

ReW,,  •  (tc  X  Wr,)  =  0  (7) 

Zr  and  zi  drop  out  of  the  above  expression  since  they  are 
merely  scale  factors. 

Equation  (7)  is  known  as  the  coplanarity  constraint 
and  is  the  basis  of  all  methods  for  computing  relative 
camera  motion  given  a  set  of  matched  vectors  {wr^ ,  w/j}. 
A  similar  constraint  can  be  written  in  terms  of  image 
coordinates  by  writing  (6)  as 


ZrUlri  = 

Z(KeRcKe“*inii  -1-  Kcte 

zjUcinjj  -h  Vo 

(8) 

Ue 

=  KcRcKc"' 

(9) 

Vc 

=  Koto 

(10) 

Equation  (10)  is  the  key  equation  which  will  be  used  in 
the  least-squares  algorithm  of  Section  3.  As  with  (5),  we 
obtain 


Ucm,,  (vc  X  mrj  =  0  (11) 

2.3  Motion  stage  orientation 

In  order  to  generate  controlled  motion  secjuences,  the 
camera  must  be  mounted  on  .some  device  which  is  able 
to  move  in  a  precise  manner.  In  practice,  it  is  not  feasible 
to  exactly  align  the  camera  coordinate  system  with  that 
of  the  motion  stage,  and  hence  we  must  also  find  the 
transformation  which  relates  the  motion  of  the  stage  to 
that  of  the  camera. 

Let  (Ro,  to)  denote  the  rotation  and  translation  of  the 
camera  coordinate  system  with  respect  to  the  motion 
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stage  coordinate  system  such  that  if  We  =  (zcil/ci^c)^ 
are  the  coordinates  of  a  world  point  in  the  camera  system 
and  Wm  =  (zm  •  9^  •  &re  the  coordinates  of  the  same 

point  with  respect  to  the  motion  stage  then 

We  =  RoWm+to  (12) 

Suppose  the  stage  executes  a  motion  described  by  a 
rotation  and  translation  (Rmi  tm)  of  its  coordinate  cen¬ 
ter.  Let  Wrm  denote  the  coordinates  of  the 

same  world  point  with  respect  to  the  stage  before  and 
after  the  motion  and  let  Wre  and  wj^  denote  the  same 
points  in  the  right  and  left  camera  systems.  We  have 
then 

^rm  ~  (13) 

and,  combining  equations  (12)  and  (13),  we  obtain  after 
some  algebraic  manipulation 

Wre  ~  Wje-t“Ilo^m"l‘(I~R'oJl»nIl'0  )t«  (1^) 


Comparing  (14)  with  (4),  we  see  that 

Re  =  R^R^Ro'^  (15) 

te  =  Rotfn  +  (I  ~  RoR<mRo"^)to  (1®) 

and  we  can  use  (9)  and  (10)  to  determine  that 

Ue  =  KrR*Ke-‘  =  KeReRenRo^Ke-^  (17) 

and 

Ve  =  Kete  =  KeRet„  Ke(I  -  ReR„Re'^)te  (18) 
Replacing  Ke  by  Km  =  KeRo.  we  obtain  finally 

Ue  —  KmRmRm  (1^) 

Ve  —  Kmtm  +  Km(I  ~  Rm)Ro'^to  (20) 


which,  except  for  the  term  multiplying  to,  gives  the  same 
set  of  equations  as  (9)  and  (10)  in  terms  of  the  rotation 
and  translation  of  the  motion  stage,  Rm  and  tm,  and 
the  matrix  Km- 

There  are  two  ways  in  which  the  to  term  in  the  above 
expression  can  be  effectively  eliminated.  The  first  is  by 
making  Rm  =  I,  or  at  least  Rm  I,  which  is  usually 
possible  since  we  control  Rm  ■  The  second  is  by  making 
||tm||  ^  ll^oll-  In  either  case,  we  arrive  at  an  equa¬ 
tion  of  the  form  (10)  to  be  solved  for  Km  instead  of 
Ke.  Once  Km  is  found,  other  procedures,  which  will  not 
be  discussed  in  this  paper,  may  be  applied  if  desired  to 
estimate  Ro^to  and  complete  the  camera-motion  stage 
calibration. 

3  Computing  the  Calibration  Matrix 

The  algorithm  for  computing  the  calibration  matrix  Km 
is  based  on  solving  a  set  of  equations  of  the  form 

^et  —  (21) 

given  M  motion  sequences  (pairs  of  images)  for  which 
the  translation  directions  tmt,  t  =  1, . .  .,M,  are  known. 
Equation  (21)  is  the  same  as  (20)  in  which  it  is  assumed 
that  to  can  be  ignored,  either  by  making  Rm  =  I  or 

\M\ » lltoll- 


As  explained  in  Section  4,  it  is  numerically  difficult  to 
obtain  a  good  estimate  directly  from  the  image  data  for 
Km  =  KeRo,  with  Ke  of  the  form  (3),  since  this  matrix 
is  very  poorly  conditioned.  Instead,  we  start  with  an 
estimate,  Ke,  and  look  for  the  the  best  update  matrix 
Ku  such  that 

Km  =  K,K„  (22) 

The  estimate  K,  can  be  derived  either  from  the  manufac¬ 
turer’s  data  on  the  camera  and  associated  frame  capture 
hardware,  or  by  iteratively  solving  (22)  and  replacing 
K«  with  the  value  computed  for  Km  on  the  previous 


iteration.  We  thus  define 

U.t  =  K.-'U«tK.  (23) 

=  K.-^e*  (24) 

p  =  K.-*in  (25) 

such  that  from  (19)  and  (20) 

=  K„-iRmtK„  (26) 

v,i  =  K„tmt  (27) 

and 

p  =  K„w  (28) 


The  algorithm  for  computing  Ku  consists  of  first  de¬ 
termining  the  uncalibrated  translation  directions  for 
each  sequence,  k  =  1,...,M.  Since  only  the  direc¬ 
tion  of  translation  is  recovered  from  the  relative  motion 
equations,  the  equations  (27)  cannot  be  solved  directly. 
The  next  step  is  therefore  to  compute  the  scale  factors 
at  =  l|v,t||/||tmt|l  such  that  =  K„Cf  After 

determining  the  Ok  we  can  then  solve  for  Ku- 

Each  of  these  steps  will  be  described  in  more  detail 
in  the  remainder  of  this  section.  To  simplify  the  nota¬ 
tion,  since  only  unit  vectors  are  used  in  the  following 
derivations,  we  will  define 

t  =  C  (29) 

V  =  v;  (30) 

3.1  Computing  the  uncalibrated  translation 
direction 

For  each  sequence,  k  =  I,. .  .,M  we  first  determine  a 
set  of  correspondence  points  between  the  two  images. 
The  fundamental  equation  for  computing  relative  cam¬ 
era  motion  is  the  coplanarity  constraint  which  in  terms 
of  U,,  V,  pij,  and  pr,  is  given  by 

U,pjj  •  (v  X  p„)  =  0  (31) 

These  equations  can  be  written  in  matrix  form  by  re¬ 
placing  the  cross  product  by  an  equivalent  matrix  mul¬ 
tiplication.  Let  Vx  denote  the  skew-symmetric  matrix 


where  {vt,Vy,v,)  are  the  components  of  v.  Clearly 
VxPr,  =  V  X  pn  and  (31)  can  be  written  as 

Pi?'u/VxPr.  =  0  (33) 
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The  above  formulation  is  identical  to  that  derived 
by  Longuet-Higgens  [Longuet-Higgens,  1981]  who  pro¬ 
posed  a  linear  method  to  compute  the  product  matrix 
Q  =  U«^Vx  to  within  an  arbitrary  scale  factor  from  8 
correspondence  points.  Once  Q  is  known,  v  is  identi¬ 
fied  as  the  eigenvector  of  Q^Q  =  —  v^vl  with  the 

smallest  eigenvalue. 

The  Longuet-Higgens  method,  however,  is  numerically 
unstable.  Since  it  does  not  explicitly  enforce  dependen¬ 
cies  between  the  elements  of  Q  and  does  not  take  into 
account  any  error  in  the  data,  it  requires  knowing  the  lo¬ 
cations  of  the  correspondence  points  with  high  accuracy. 
If  there  is  error,  as  there  usually  is,  the  coplanarity  con¬ 
dition  will  not  hold  exactly  for  all  pairs  of  corresponding 
points.  Equation  (31)  should  instead  be  written  as 


U.pr,  •  (v  X  p„)  =  a 


(34) 


and  solved  by  a  least-squares  procedure  which  minimizes 

Ee?- 

If  the  motion  sequences  are  controlled  such  that  Rm  = 
I,  then  by  (19)  and  (26)  U«  =  I,  and  equation  (31) 
becomes 


P/i  •  (V  X  p,.,.)  =  V  •  (pr,  X  p;,)  =  Ci  (35) 


Given  N  correspondences  and  defining  =  Prj  x  ptj 


The  method  which  in  testing  the  calibration  algorithm 
was  found  to  yield  the  best  results,  is  to  minimize 

N  N 

5;c?  =  X;(U.p,.(vxp.J)'  (39) 

•=i  <=1 

explicitly  assuming  U,  =  KuRmKu~‘  is  an  orthonor¬ 
mal  rotation  matrix.  Horn  [Horn,  1991]  developed  an 
iterative  algorithm  to  compute  relative  motion  by  di¬ 
rectly  solving  the  nonlinear  minimization  problem  while 
enforcing  the  orthonormality  of  U, .  A  revised  version  of 
Horn’s  algorithm,  which  was  modified  for  more  efficient 
implementation  in  hardware  [Dron,  1992],  was  used  for 
the  results  of  Section  5. 

If  Rm  ^  I  and  the  initial  estimate  K,  is  far  from 
the  assumption  that  Uj  is  orthonormal  will  not  be  very 
good.  Nonetheless,  as  long  as  the  relative  motion  algo¬ 
rithm  can  determine  a  well-defined  solution  to  (39)  for  a 
sufficient  number  of  motion  sequences,  the  least-squares 
algorithm  for  computing  the  update  Ku  to  the  calibra¬ 
tion  matrix  will  find  a  solution  that  brings  the  new  esti¬ 
mate  closer  to  the  correct  As  K,  — *  Km,  Ku  -*  I, 
and  the  assumption  of  orthonormality  for  U,  will  im¬ 
prove.  A  detailed  analysis  has  not  been  performed  to 
prove  that  an  iterative  procedure  defined  in  this  manner 
starting  from  an  arbitrary  K,  will  always  converge  to 
the  correct  Km-  However,  in  the  tests  which  have  been 
conducted,  the  procedure  has  in  fact  converged  to  the 
same  solution  starting  from  very  different  initial  values 
forK,. 

3.2  Estimating  the  Ok 

Having  computed  the  uncalibrated  translation  direc¬ 
tions,  v*,  we  now  have  from  (27)  a  set  of  equations 


The  unit  vector  v  which  minimizes  the  sum  of  squared 
errors  is  the  eigenvector  of  Y  with  the  smallest  eigen¬ 
value. 

The  method  of  requiring  Rm  =  I  so  that  v  can  be 
computed  directly  as  an  eigenvector  of  Y  is  the  simplest 
procedure,  but  is  not  always  practical.  In  many  applica¬ 
tions,  it  may  be  possible  to  translate  the  camera  accu¬ 
rately  in  a  known  direction,  but  not  to  ensure  Rm  =  I 
with  high  precision.  An  example  is  a  camera  mounted 
on  a  vehicle  which  can  move  reliably  from  point  to  point, 
but  has  limited  ability  to  align  itself  exactly  in  the  di¬ 
rection  of  the  previous  position.  In  general  it  is  best  to 
estimate  both  U«  and  v  together.  If  in  fact  U,  =  I, 
nothing  is  lost. 

Weng  et  al.  [Weng  et  ai,  1989]  proposed  an  improved 
version  of  the  Longuet-Higgens  method  which  does  mini¬ 
mize  the  sum  of  squared  errors  by  using  more  than  8  cor¬ 
respondences  and  solving  a  9x9  eigenvalue-eigenvector 
problem  instead  of  computing  the  exact  solution  to  the 
homogeneous  linear  equations.  This  method,  however, 
still  does  not  enforce  dependencies  between  elements 
of  Q,  and,  since  it  involves  computing  the  eigenvector 
corresponding  to  the  smallest  eigenvalue  of  a  relatively 
large  matrix,  it  has  inherent  numerical  difficulties.  When 
the  correspondence  data  are  noisy,  the  results  from  this 
method  have  been  found  to  be  less  reliable  than  those 
from  other  procedures. 


ottv*  =  Kut*  (40) 

for  k  =  1,  • . . ,  M,  where  a*  is  an  unknown  scale  factor. 

To  compute  a  least-squares  estimate  for  the  factors  at 
we  choose  three  non-coplanar  vectors,  (ti,t2,t3),  out  of 
the  M  known  translations  to  serve  as  a  basis.  For  best 
numerical  stability,  it  is  desirable  that  (ti ,  t2,  ta)  be  mu¬ 
tually  orthogonal:  however  the  algorithm  only  requires 
that  they  span  R®.  Let  T  be  the  matrix  whose  columns 
are  the  basis  vectors 


T  —  (ti  t2  ta)  (41) 

and  let  V  be  the  matrix  whose  columns  are  the  corre¬ 
sponding  computed  displacement  vectors 

V  =  (vi  V2  va)  (42) 

From  (40) 

/  ai  0  0  \ 

V  O  Q2  0  =  K„T  (43) 

Vo  0  Qa  / 

Since  T  spans  R^,  we  can  express  the  other  transla¬ 
tion  vectors,  tt,ifc  =  4,...,A/,as  linear  combinations  of 

(tl,t2,t3) 

tk  =  Cfciti  Ckata  +  <^k3t3 

=  Tck  (44) 
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with  Cjt  =  (Cjki,ct2,c*3)'*’.  Hence 

a*vt  =  K„tt 

=  K„Tct  (45) 

Let  Cjb  be  the  diagonal  matrix,  diag(cti,  ci;2,  cta)  formed 
from  the  elements  of  ct  and  let  a  =  (ai, 02, 03)^,  then 
from  (43)  and  (45) 

atvt  =  VC*a  (46) 

Since  vt  is  a  unit  vector, 

at  =  vJVCta  (47) 

and  by  substituting  (46)  into  (47)  we  obtain 

vtvJVC*a  =  VC*a  (48) 

or, 

(l-vtvJ)VC*a  =  0  (49) 

The  fact  that  the  above  expression  is  homogeneous 

reflects  the  fact  that  Ku  is  known  only  to  within  an 

arbitrary  scale  factor,  and  hence  that  we  can  only  solve 
for  the  relative  proportions  of  the  at.  To  fix  the  scale, 
we  impose  the  condition  ||a||  =  1. 

Since  the  computed  estimates  of  vt  will  necessarily 
contain  some  error,  we  should  write  (49)  as 

(l-vtyJ)VCta  =  et  (50) 

where  et  is  the  error  vector  whose  squared  magnitude  is 
ejet  =  a'^'CjkV'^  (I  -  vtvj)  VC*a  (51) 

The  total  error  for  all  of  the  sequences  is  therefore  min¬ 
imized  by  taking  a  to  be  the  eigenvector  corresponding 
to  the  smallest  eigenvalue  of 

M 

W  =  ^CtV'r(I-v*vJ)VC»  (52) 

t=i 

In  order  to  obtain  a  nontrivial,  unique  solution  for  a, 
we  must  have  at  least  four  motion  sequences,  includ¬ 
ing  the  basis,  T,  for  which  no  three  translations,  t^, 
are  coplanar.  If  we  have  more  than  four  sequences,  the 
condition  for  obtaining  a  nontrivial,  unique  solution  is 
that  we  can  construct,  though  linear  combinations  of 
the  tt ,  at  least  four  vectors,  no  three  of  which  are  copla- 
nar.  It  should  be  noted  that  this  condition  is  equiva¬ 
lent  to  that  for  being  able  to  form  a  projective  basis  for 
[Semple  and  Kneebone,  1952]. 

3.3  Computing  Ku 

Given  the  least-squares  estimate  for  a,  we  can  compute 
at  for  each  sequence  from  equation  (47).  We  now  have 
from  (40) 

K„tt  =  ajv*  (53) 

in  which  the  only  unknown  is  the  matrix  Ku-  Again,  we 
consider  the  error  in  the  estimated  vt  and  write 

et  =  K„ti  -  atvt  (54) 

As  before,  we  seek  to  minimize  the  total  error 

M  M 

^ejet  =  53(Ku**  -  atvt)'’'(Kuti  -atvt)  (55) 

•=i  t=i 


The  matrix  Ku  which  minimizes  the  sum  of  squared  er¬ 
ror  magnitudes  satisfies 


“  =  2I:  (Kui,.!  - 


(56) 


«=1 


from  which  we  find  that 


(57) 


This  expression  can  be  simplified  by  substituting  equa^ 
tion  (47)  for  at,  and  defining 

ft  =  CtY^vt 

M 

D  =  ^tttj  (58) 

t  =  l 


SO  that 


K„  =  ^^(a^f*)vttj^  D  * 


(59) 


3.4  Including  new  measurements 

A  feature  of  a  linear  least-squares  algorithms  is  that  it  is 
possible  to  improve  a  given  estimate  for  Ku  by  perform¬ 
ing  subsequent  measurements  using  a  recursive  update 
procedure,  without  having  to  recompute  everything  from 
the  beginning. 

Suppose  we  have  n  —  1  measurements  and  have  com¬ 
puted 


n— 1 


W("-i)  =  ][]CtV’'(I-vtvJ)VCt  (60) 


>=i 


We  subsequently  obtain  the  nth  measurement  and 
compute 


Now  define 


r("-i)q.c„V’r  (l-v„vT)  VC„ 

(61) 

obtain  a  new  estimate  a^"). 

i=l 

(62) 

=  £‘/.2V*t2' 

i=l 

(63) 

=  E/*3^*** 

i  =  l 

(64) 

where  {fn ,  /,2,  fis)  are  the  components  of  f* .  The  matri¬ 
ces  and  Fg""^^  are  updated  in  the  obvious 

way,  and  we  can  use  a  matrix  identity  to  compute 

=  -  (65) 

l  +  tTD(™-i)->t„ 

Ku<"^  can  then  be  computed  directly  from 
K„(")  = 

(66) 
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4  Numerical  Stability  and  Parameter 
Estimation 

Before  applying  the  algorithm  to  real  data,  we  should 
first  address  the  issues  of  numerical  stability  and  ana¬ 
lyze  the  effects  of  error  on  the  estimation  of  the  different 
components  of  the  calibration  matrix. 

4.1  Numerical  stability 

The  structure  of  Ke,  given  by  equation  (2),  guaran¬ 
tees  numerical  problems.  Even  though  we  compute  Km. 
which  in  general  differs  from  the  form  of  (2)  by  the  affine 
and  rotational  transformations,  A,  stnd  Ro,  the  weak¬ 
ness  of  the  underlying  structure  remains. 

Typical  commercial  CCD  image  arrays  are  several 
hundred  pixels  wide  in  both  horizontal  and  vertical  di¬ 
mensions.  The  effective  focal  length  of  the  lens,  which  is 
comparable  to  the  sensor  dimensions,  is  therefore  several 
hundred  times  longer  than  the  spacing  between  the  rows 
of  sensing  elements,  and  as  a  result,  //s*,  f/sy,  xq,  and 
2/0  are  all  1.  Suppose  K*  is  of  the  form  (2).  If  we  use 
image  plane  coordinates  directly  then 

(txf/Sx  +  t,Xo  \ 

tyf/sy^  +  Uyo  j  (67) 

and  the  z,  or  third,  component  of  the  uncalibrated  trans¬ 
lation  directions  will  therefore  always  be  much  smaller 
than  the  other  two.  Aside  from  the  fact  that  it  is  difficult 
numerically  just  to  estimate  such  small  z  components  in 
the  computed  vectors  vj,  the  matrix  V,  which  is  central 
to  the  algorithm,  will  be  nearly  singular.  Consequently, 
it  is  not  feasible  to  estimate  Km  in  one  step  directly  from 
the  image  data.  As  previously  stated,  we  must  instead 
start  with  an  initial  estimate  K«  and  compute  the  best 
update  matrix  Ku  such  that 

K„,  =  K.K„  (68) 

If  K.  is  not  close  to  the  actual  Km>  several  update 
iterations  may  be  required  before  the  estimate  stabilizes. 
Although  the  least-squares  algorithm  for  computing  Ku 
is  linear,  the  procedure  for  estimating  the  directions 
by  approximating  U,  as  an  orthonormal  matrix  is  not. 

4.2  Parameter  Sensitivity 

To  analyze  the  effect  of  error  on  estimating  the  different 
components  of  Km,  it  is  easiest  to  look  at  how  errors 
in  Km  affect  the  computed  vectors  v^.  First  consider 
the  effects  of  error  in  the  scale  factors  and  the  principal 
point,  assuming  A,  —  Rg  =  I  and  that  Km  has  the  form 
of  equation  (2).  Let 

(  ^xfl^x  0  e,(io  +  ^x)  \ 

Km  =  I  0  Cyf/Sy  CyluO-i-Sy)  (69) 

\  0  0  1  / 


where  KJ|,,  oj  and  vj  represent  the  error-free  values  of 
Km,  Ole  and  Vh,  respectively. 

Clearly,  errors  in  scaling,  represented  by  Cc  ^  1  and 
Cy  ^  1  will  significantly  affect  the  computed  vt .  Suppose 
then  that  Cx  =  Cy  =  1  and  consider  now  the  effect  of 
error  in  xq,  yo  represented  by  6x,  6y.  We  have 


By  the  same  argument  as  before,  will  always  be  much 
smaller  than  the  other  components  of  From  (71) 
we  C2U1  see  that  errors  in  the  principal  point  will  have 
little  effect  on  computing  vj ;  and  they  will  have  no  effect 
unless  the  z  component  of  tt  is  significantly  different 
from  zero. 

These  arguments  can  be  turned  around  to  conclude 
that  we  should  be  able  to  estimate  the  diagonal  scale 
factors  reasonably  well,  but  that  much  greater  precision 
in  computing  is  required  in  order  to  obtain  an  accu¬ 
rate  estimate  of  the  principal  point. 

Now  consider  the  effects  of  the  affine  transformation 
A,  and  the  offset  rotation  Rg.  A  simple  example  illus¬ 
trates  the  problems  caused  by  these  matrices  for  esti¬ 
mating  the  other  parameters.  Suppose  A,  =  I  and  Rg 
represents  a  small  rotation  of  0  about  the  j/-axis.  so  that 
we  can  express  Km  as 

(f/si  -  xq9  0  f/sxB  -I-  zo  \ 

//*»  Vo  I  (72) 
-0  0  \  } 

Since  9  is  small,  the  elements  in  the  lower  triangle  of  Km 
are  also  small  relative  to  the  other  elements  of  Km,  and 
hence  will  be  difficult  to  estimate  accurately.  They  can, 
however,  significantly  affect  the  estimates  of  the  princi¬ 
pal  point  and  diagonal  scale  factors,  if  they  are  ignored. 
With  f/sx  —  567,  and  xq  =  378 — which  are  the  val¬ 
ues  supplied  by  the  manufacturer  of  the  camera  and  lens 
used  in  the  experiments — a  rotation  of  1°  would  cause  a 
10  pixel  difference  in  the  estimate  of  xo  and  a  difference 
of  7  in  the  estimate  of  //s, . 

In  conclusion,  we  see  that  it  is  difficult  to  accurately 
estimate  any  of  the  individual  parameters  that  compose 
Km  with  great  accuracy.  The  goal  of  the  method  pre¬ 
sented  in  this  paper,  however,  is  not  to  recover  the  indi¬ 
vidual  parameters  but  rather  Km  itself.  We  do  expect, 
that  the  matrix  found  by  the  algorithm  will  be  approx¬ 
imately  upper  triangular  of  the  form  (2),  and  the  fact 
that  it  is  can  be  used  ^ks  a  partial  conhrmation  that  the 
result  is  reasonable.  The  real  test  of  correctness,  how¬ 
ever,  will  be  how  well  the  computed  motion  given  Km 
fits  the  known  motion  of  the  camera  and  how  closely 
subsequent  motion  sequences,  not  included  in  the  com¬ 
putation  of  Km,  can  be  predicted. 


/  C*  0  Cx^x  \ 

=  I  0  Cy  CySy  j  K„, 

\  0  0  1  / 

=  KerrK;; 

From  (40) 

OiVi  =  Kmt*  =  KerrKmt*  =  aJKgrrVl  (70) 


5  Experimental  Results 

The  algorithm  was  tested  on  a  Cohu  digital  camera 
rigidly  mounted  on  a  movable  carriage  which  could  be 
translated  along  a  fixed  rail.  The  carriage  assembly 
could  be  rotated  on  both  the  vertical  and  horizontal  axes 
so  that  the  camera  could  be  oriented  in  any  direction  as 
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Figure  1:  Motion  sequence  #3,  tm  =  (.866,0,-0.5).  Top:  original  images.  Bottom:  edge  maps  with  numbered  matched 
points. 


the  assembly  was  moved  along  the  rail.  Positional  accu¬ 
racy  Wcis  better  than  .1°  on  each  axis. 

Digitized  image  data  was  read  directly  from  the  cam¬ 
era  over  an  Sbus  interface  to  a  Sparc  workstation.  The 
advantage  of  the  digital  camera  is  that  each  pixel  cor¬ 
responds  exactly  to  a  sensor  location.  Since  there  is 
no  frame  grabber  in  the  path  to  resample  and  resize 
the  data,  the  calibration  matrix  should  be  very  close 
to  that  given  by  the  manufacturer’s  data  on  the  im¬ 
age  sensor.  According  to  the  data  sheets,  the  sensor  is 
6.4mmx4.8mm  with  756  (horizontal)  and  484  (vertical) 
pixels.  A  4.8mm  lens  weis  used  to  give  an  approximately 
80°  field  of  view.  This  information  wais  used  to  generate 
the  initial  calibration  matrix  given  in  table  3. 

Twelve  pairs  of  images  were  taken  for  the  relative  mo¬ 
tions  given  in  table  1.  Not  all  of  these  were  purely  trans¬ 
lational  motion.  Sequences  #6  and  #10  combined  a  y- 
axis  rotation  of  5°,  and  sequence  #9  included  a  5°  ro¬ 
tation  about  z.  Point  correspondences  were  found  using 
an  improved  version  of  the  matching  procedure  described 
in  [Dron,  1992].  With  this  method,  the  binary  edge  map 
of  one  of  the  images  is  divided  into  blocks  which  are  in¬ 
dividually  shifted  across  the  edge  map  from  the  other 
image  in  search  of  the  offset  which  gives  the  best  align¬ 
ment.  Only  the  most  reliable  matches  are  kept.  For  each 
pair  of  images  in  these  sequences  there  were  generally  20 
to  30  correspondence  points  retained.  Figure  1  shows 


one  of  the  images  pairs  (sequence  #3)  along  with  the  bi¬ 
nary  edge  maps  upon  which  the  correspondence  points 
are  marked.  One  can  see  from  these  images  that  there 
is  a  fair  amount  of  nonlinear  distortion  in  the  periphery 
produced  by  imperfections  in  the  small  focal-length  lens. 

The  improvements  made  to  the  matching  procedur ' 
in  [Dron,  1992]  were  to  reject  any  match  which  did  not 
have  a  well-localized  peak  and  to  use  knowledge  of  the 
translation  direction  to  shape  the  search  window.  Since 
the  procedure  compares  24x24  blocks,  there  is  an  inher¬ 
ent  uncertainty  of  ~3-4  pixels  caused  by  defining  the 
location  of  the  correspondence  to  be  at  the  centers  of 
the  blocks.  However,  no  adjustments  were  made  to  the 
locations  given  by  the  matching  procedure. 

Uncalibrated  translation  directions  were  computed  for 
each  sequence  from  the  point  correspondences  found  by 
the  matching  procedure  using  the  relative  motion  algo¬ 
rithm  described  in  [Dron,  1992],  which  is  a  modified  ver¬ 
sion  of  Horn’s  algorithm  [Horn,  1991].  This  algorithm 
produces  a  least-squares  estimate  of  the  translation  di¬ 
rection  assuming  the  matrix  U,  from  (26)  is  orthonor¬ 
mal.  However,  since  it  is  a  nonlinear  optimization  prob¬ 
lem,  it  can  get  stuck  in  local  minima  and  produce  er¬ 
roneous  results.  Two  steps  were  taken  to  ensure  that 
the  uncalibrated  translation  directions  computed  from 
the  algorithm  were  reliable.  First,  the  motion  was  e.s- 
timated  twice.  After  the  first  pass,  the  errors  in  the 


Table  1:  Camera  motions  used  to  generate  test  image  pairs 


■■1 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

ESI 

0.9993 

glMlIIMW 

glMlIIHIM 

0.7007 

0.9081 

0.4921 

0.0137 

-0.0029 

-0.0107 

0.6977 

0.5564 

0.2804 

m 

missa 

0.9994 

0.0100 

-0.7120 

0.0500 

0  0163 

0.8564 

0.4754 

0.0131 

-0.7071 

-0.6760 

IS 

msMM 

0.0350 

0.9999 

-0.0467 

-0.4159 

EI&SS2I 

0.5161 

0.8798 

0.9999 

-0.1150 

E9 

0.0004 

0.0003 

0.0000 

0.0028 

0.0001 

0.0002 

0.0002 

0.0001 

0.0033 

0.0019 

0,0033 

Table  2;  Computed  translation  directions  for  test  image  pairs  from  calibration  matrix  of  T?’ '  4 


coplanarity  constraint  (31)  were  computed,  and  every 
correspondence  pair  whose  error  was  greater  than  one 
standard  deviation  above  the  mean  squared  error  was 
removed  from  the  list.  The  translation  direction  used  in 
the  calibration  algorithm  was  the  one  computed  from  the 
remaining  correspondence  points  with  the  least  error. 

The  second  step  taken  was  to  judge  the  reliability  of 
the  result  from  the  condition  number  of  the  translation 
matrix.  At  each  iteration  of  the  relative  motion  algo¬ 
rithm,  a  matrix  of  the  form  of  Y  in  equation  (38)  is 
computed,  given  the  matrix  U,.  The  estimated  transla¬ 
tion  V  is  the  eigenvector  corresponding  to  the  smallest 
eigenvalue  of  this  matrix  in  the  final  iteration.  If  the 
condition  number  of  this  matrix  is  small,  the  estimate 
of  V  is  very  sensitive  to  error,  and  the  least  eigenvalue 
reported  may  in  fact  correspond  to  a  vector  that  is  per¬ 
pendicular  to  the  correct  translation  direction.  Hence 
any  sequence  for  which  the  condition  number  was  <  1000 
was  categorically  rejected  and  was  not  used  to  compute 
the  calibration  matrix. 

The  results  of  applying  the  least-squares  algorithm  for 
camera  calibration  to  the  data  from  these  sequences  are 
given  in  Tables  4-7.  Note  that  both  Km  and  K“*  as 
given  in  the  tables  have  been  normalized  so  that  their 
(3,3)  element  is  1.  Hence  they  are  not  exact  inverses  of 
each  other. 

The  results  of  two  experiments  are  shown.  In  the  first, 
the  calibration  matrix  given  in  Table  3,  which  was  de¬ 
rived  from  the  manufacturer’s  data,  was  used  as  a  ini¬ 
tial  estimate.  An  update  matrix  was  computed  from 
sequences  1-8,  and  the  result  is  given  in  Table  4.  Esti¬ 
mated  translation  directions  for  all  12  motion  sequences 
were  then  computed  by  the  relative  motion  algorithm  us¬ 
ing  this  new  matrix,  and  the  results  are  given  in  Table  2. 
The  prediction  error,  computed  from  At  =  (1  — vJtt)/2, 
based  on  the  actual  motion  tt  is  also  given  at  the  bot¬ 
tom  of  the  table.  As  expected,  the  error  is  smallest  for 
vectors  1-8  which  were  used  to  compute  the  new  calibra¬ 
tion  matrix.  However,  the  errors  for  the  four  sequences, 
8-12,  which  were  not  used  are  also  reasonably  small. 

The  second  experiment  consisted  of  deriving  the  cal¬ 
ibration  matrix  using  the  iterative  method  described  in 


Section  4  without  any  prior  knowledge  of  the  camera  sys¬ 
tem.  Tables  5-7  show  the  results  after  iterations  1,  3, 
and  6,  starting  initially  from  K,  =  I.  By  the  6th  itera¬ 
tion  the  result  is  very  close  to  that  of  Table  4  which  was 
computed  starting  from  the  manfacturer’s  data. 

6  Conclusions  and  Future  Work 

The  least-squares  method  derived  in  this  paper  for  com¬ 
puting  camera  calibration  hris  been  shown  to  work  well 
for  the  real  image  sequences  given  in  the  last  section. 
These  results  are  very  encouraging;  however,  additional 
tests  should  be  conducted  on  different  scenes  and  differ¬ 
ent  cameras  to  confirm  the  general  applicability  of  the 
method. 

An  eventual  goal  is  to  implement  the  automatic  cali¬ 
bration  procedure  in  hardware  as  part  of  a  larger  navi¬ 
gation  system.  This  was  a  primary  motivation  for  devel¬ 
oping  a  method  which  could  tolerate  substantial  error  in 
the  correspondences.  Before  doing  so,  however,  a  more 
extensive  analysis  should  be  performed  to  determine  how 
much  error  in  the  data  can  be  tolerated  cis  well  as  to  de¬ 
termine  more  precisely  how  nonlinear  distortion  in  the 
lens  affects  the  results. 
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Table  3:  Initial  calibration  matrix  derived  from  manufacturer’s  data  on  the  image  sensor. 


Table  4:  Calibration  matrix  computed  from  sequences  1-8  starting  with  matrix  of  TaUe  3  for  K, 


■16.52  1120.0  97.4  0.000025  0.000769  -0.084876 

0.3  0.2  1.0  -0.000191  -0.000168  1.000000 


Table  S:  Calibration  matrix  computed  from  sequences  1-8  after  1st  iteration  starting  with  K.  = 


Calibration  Computed  from  Sequences  1-8 
after  3rd  iteration  from  K«  s  I 

Km 

Km"* 

“927:5  =575  5T5T 

-111.0  632.5  217.7 
-0.2  -0.1  1.0 

0.001103  -0.000019  -0.341880 
0.000097  0.001717  -0.404238 

0.000281  0.000101  1.000000 

Table  6:  Calibration  matrix  computed  from  sequences  1-8  after  3rd  iteration  starting  from  K.  =  I 


Calibration  Computed  from  Sequences  1-8 
after  6th  iteration  from  K,  =  I 

Km 

K 

"550  rro  355:5" 

17.6  646.7  240.6 

0.0  0.0  1.0 

0.001146  -0.000004  -0.415O92 

-0.000022  0.001533  -0.360741 

-0.000024  -0.000037  1.000000 

Table  7:  Calibration  matrix  computed  from  sequences  1-8  after  6th  iteration  starting  from  K.  =  I 
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Fast  and  Robust  3D  Recognition  by  Alignment 


T.  D.  Alter  and  W.  Eric  L.  Grimson 
MIT  AI  Lab  &  Dept,  of  EECS 
Cambridge,  Massachusetts  02139 


Abstract 

Alignment  is  a  prevalent  approach  for 
recognizing  three-dimensional  objects  in 
two-dimensional  images.  Current  imple¬ 
mentations  handle  errors  that  are  inher¬ 
ent  in  images  in  ad  hoc  ways.  These  er¬ 
rors,  however,  can  propagate  and  mag¬ 
nify  through  the  dignment  computa¬ 
tions,  such  that  the  ad  hoc  approaches 
may  not  work.  This  paper  gives  a  tech¬ 
nique  for  tightly  bounding  the  propa¬ 
gated  error,  which  can  be  used  to  make 
the  recognition  robust  while  still  being 
efficient. 

Previous  analyses  of  alignment  have  in¬ 
dicated  that  the  approach  is  sensitive 
to  false  positives,  even  in  moderately- 
cluttered  scenes.  But  these  analyses  ap¬ 
plied  only  to  point  features,  whereas  al¬ 
most  all  alignment  systems  rely  on  ex¬ 
tended  features,  such  as  line  segments, 
for  verifying  the  presence  of  a  model  in 
the  image.  We  derive  a  new  formula  for 
the  “selectivity”  of  a  line  feature.  Then, 
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using  the  technique  for  computing  er¬ 
ror  bounds,  it  is  demonstrated  experi¬ 
mentally  that  the  use  of  line  segments 
significantly  reduces  the  expected  false 
positive  rate.  The  extent  of  the  im¬ 
provement  is  that  an  alignment  system 
that  correctly  handles  propagated  error 
is  expected  to  remain  reliable  even  in 
substantially-cluttered  scenes. 

1  Introduction 

Object  recognition  involves  determining  which  of 
a  set  of  known  objects  are  in  an  image  and 
where  they  are.  Many  systems  use  rigid  objects, 
modeled  by  geometric  feaiurts  such  as  lines  and 
points.  Given  a  set  of  such  object  models,  the 
task  of  model-based  recognition  is  to  find  corre¬ 
spondences  between  model  features  and  thei.  pro¬ 
jections  in  the  image.  To  find  large  sets  of  corre¬ 
spondences,  many  approaches  begin  with  minimal 
sets — i.e.  sets  with  sufficient  matches  to  transform 
a  model  to  the  image — and  try  to  extend  them. 
Backtracking  search  starts  from  each  minimal  set 
and  repeatedly  uses  the  current  matches  to  con¬ 
strain  the  search  for  an  additional  match,  back¬ 
tracking  whenever  an  inconsistency  is  found  (e.g., 
(6,  13,  14]).  l^ansform  clustering  uses  every  mini¬ 
mal  set  to  compute  a  model-to-image  transformer 
tion,  then  counts  how  often  each  transformation 
is  repeated  (e.g.,  [4,  17,  7]).  Alignment  methods 
use  each  minimal  set  to  transform  all  the  model 
features  into  the  image,  then  look  near  each  pre¬ 
dicted  model  feature  for  a  matching  image  feature 
(e.g.,  [8,  3,  5,  15,  16]). 

Here,  we  focus  on  alignment,  analyz'  -.g  the  ef¬ 
fects  of  uncertainty  in  the  image  features.  To  be 
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roiuai  to  such  uncertainty,  an  alignment  system 
needs  to  know,  for  each  predicted  model  feature, 
where  in  the  image  to  search  for  a  matching  im¬ 
age  feature.  If  the  regions  searched  are  smaller 
or  larger  than  necessary,  correct  matches  may  be 
missed  (/a/se  positives)  and  incorrect  matches  may 
be  accepted  {false  negatives).  Assuming  “weak- 
perspective”  (i.e.  scaled  orthographic)  projection 
and  a  “bounded  error”  model  of  uncertainty  in  the 
image  points,  we  give  a  method  for  computing  the 
correct  search  regions  for  point  and  line  features  of 
a  3D  model,  given  a  set  of  3  corresponding  model 
and  image  points,  since  3  is  minimal  [15]. 

Even  if  the  correct  search  regions  are  known, 
there  may  still  be  false  positives,  i.e.,  image  fea¬ 
tures  appearing  at  random  in  the  search  regions, 
causing  incorrect  identifications  of  models  in  the 
image.  Consequently  the  reliability  of  an  align¬ 
ment  system  depends  largely  on  its  false  positive 
rate.  We  use  the  search  regions  to  compute  the 
probability  of  a  false  positive  and  then  use  the 
probability  of  a  false  positive  to  compute  limits 
on  the  amount  of  scene  clutter  that  alignment  can 
handle.  We  also  examine  how  much  of  the  model 
must  be  matched  to  keep  the  probability  of  a  false 
positive  low. 

The  approach  we  consider  uses  points  for  gener¬ 
ating  minimal  sets  (hypotheses)  and  then  verifies 
the  hypotheses  using  extended  features  as  well  as 
points  (as  in  [15]).  Specifically,  Section  2  de¬ 
scribes  a  theory  for  analyzing  the  false-positive 
sensitivity  of  an  alignment  system  that  uses  points 
or  line  features  for  verification.  The  main  result 
is  a  formula  for  the  “selectivity”  of  a  line  feature. 
Section  3  computes  the  propagated  uncertainty  re¬ 
gions  for  point  features,  which  give  the  regions  for 
line  features  (Section  2.1).  We  show  that  the  re¬ 
gions  for  points  can  be  computed  quickly  and  ac¬ 
curately  by  fitting  circles  to  their  sampled  bound¬ 
aries,  assuming  a  fast  solution  for  the  image  posi¬ 
tion  of  a  predicted  model  point  given  3  correspond¬ 
ing  model  and  image  points  (as  in  [1]).  Section  4 
uses  the  formulas  for  selectivity  and  false  positives 
(Section  2)  and  the  technique  for  computing  prop¬ 
agated  error  bounds  (Section  3)  to  estimate  the 
improvement  gained  by  using  line  features  for  ver¬ 
ification,  and  to  judge  how  sensitive  alignment  is 
to  false  positives. 

Previous  work  has  shown  that  the  propagated 
uncertainty  regions  for  3D  point  features  can  be 
computed  exactly  for  flat  models  [16].  For  3D  solid 


models,  [11]  uses  numerically  computed  bounds  on 
the  parameters  of  the  model-to-image  transformar 
tion  to  estimate  conservative  bounds  on  the  propa¬ 
gated  regions.  Analyses  of  false  positive  rates  have 
been  provided  for  recognition  involving  2D  mod¬ 
els  md  2D  data  for  both  point  and  line  features 
[9,  10].  Using  point  features  only,  similar  analy¬ 
ses  have  been  applied  to  recognition  involving  3D 
models  and  2D  data  for  flat  models  [12],  and  also 
for  solid  models  using  the  boun^  of  [f  fj- 

2  Analyzing  the  False-Positive 
Sensitivity  of  Alignment 

2.1  Propagated  uncertainty  regions 

A  match  of  three  model  and  image  points  deter¬ 
mines  the  image  position  of  an  unmatched  model 
point  up  to  a  finite  number  of  solutions  [8,  15]. 
Any  uncertainty  in  the  locations  of  the  matched 
image  points  propagates  to  uncertainty  in  the  pre¬ 
dicted  position  of  an  unmatched  model  point.  Er¬ 
rors  in  the  sensed  or  nominal  locations  of  the  im¬ 
age  points  are  assumed  to  be  bounded  by  circles 
of  ts^ius  €.  Then,  as  the  three  image  points  move 
independently  around  their  e-circles,  the  fourth 
model  point  traces  out  a  region  of  possible  image 
locations.  Any  image  point  within  e  of  this  region 
is  a  possible  match  for  the  model  point. 

For  flat  objects,  the  propagated  uncertainty  re¬ 
gion  for  a  model  point  is  a  disc  centered  at  the 
nominal  location,  whose  radius  depends  on  the 
affine  coordinates  of  the  nominal  location  with 
respect  to  the  basis  triple  [16].  For  solid  mod¬ 
els,  we  assume  that  the  uncertainty  region  can  be 
bounded  accurately  with  an  “uncertainty  circle” 
centered  at  the  nominal  point.  Section  3  demon¬ 
strates  that  this  assumption  generally  is  valid. 

We  can  use  this  result  to  bound  the  uncertainty 
in  predicted  line  segments.  We  assume  that  the 
orientation  of  an  image  line  segment  is  constrained 
such  that  its  endpoints  lie  within  (  circles.  Then, 
for  each  model  line  segment  we  calculate  the  un¬ 
certainty  circles  for  its  endpoints.  Next,  we  find 
candidates  for  each  model  segment  by  gathering 
all  image  line  segments  that  lie  entirely  within  the 
uncertainty  region  formed  by  the  uncertainty  cir¬ 
cles  and  their  common  outer  tangents  (Fig.  1)  and 
whose  extensions  intersect  both  of  the  uncertainty 
circles. 
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2.2  Selectivity  of  line  features 

A  system’s  false  positive  rate  depends  on  the  un¬ 
certainty  regions’  selectivity  [12],  i.e.  the  probabil¬ 
ity  that  it  contains  a  spurious  feature  at  random. 
For  points,  the  selectivity  is  the  uncertunty  re¬ 
gion’s  area  divided  by  the  area  of  the  image.  For 
lines,  we  need  the  fraction  of  positions  and  orien¬ 
tations  of  a  spurious  segment  in  the  image  that 
put  the  segment  in  the  region. 

Non'overlapping  uncertainty  circles 

There  is  a  set  of  translations  that  places  a  line 
segment  of  known  length  and  orientation  within 
the  image,  also  given  by  its  configuration  apace 
[18].  The  line  selectivity  is  the  fraction  of  these 
translations  that  place  the  segment  within  the  line 
uncertainty  region,  and  can  be  obtained  by  shrink¬ 
ing  both  the  region  and  the  image  along  the  seg¬ 
ment’s  orientation  and  by  its  length.  For  shrink¬ 
ing  the  region  along  the  segment’s  orientation,  the 
shaded  regions  in  Fig.  2  show  two  cases,  distin¬ 
guished  by  the  image  segment’s  orientation  rela¬ 
tive  to  that  of  the  common  outer  tangent,  Oi.  An 
image  segment’s  orientati'n  is  bounded  by  the  ori¬ 
entation  of  the  crossed  common  tangent,  From 
Fig.  3, 

01  =  sin  -  03  =  sm  ^  , 

provided  L  >  R  —  r  fox  0i  and  L  >  R  +  r  fox  0%. 
If  the  circles  don’t  overlap,  then  L>R  +  r. 

The  set  of  translations  is  further  constrained 
by  the  segment’s  length,  obtained  by  shrinking  the 
shaded  region  in  Fig.  2  by  this  length.  To  compute 
the  selectivity,  we  need  the  area  of  the  shrunken 
region.  While  this  area  can  be  computed  exactly 
[2],  we  use  a  more  convenient  rectangular  box  as 
an  overestimate  (Fig.  4).  For  comparison.  Fig.  2 
shows  the  box  surrounding  each  corresponding  Tine 
uncertainty  region.  From  Fig.  4,  after  shrinking 
the  rectangle  along  the  base  by  t,  the  region’s  area 
is 

If0€[O,0i]  &  t<R  +  r  +  Lcos0, 

A^{R  +  r  +  LcosO  —  /)2r, 

If0€  [01,02]  &  l<R  +  r  +  Lcoa0, 

A  ^  {R+  r  +  L cos 0  —  (){R -1-  r  —  L sin 0), 

Note  that  ii  -1-  r  —  £  sin  0  >  0,  since  0  <  02  = 
sin-‘^. 


For  the  image,  the  area  of  the  region  of  trans¬ 
lations  for  the  same  image  segment  is  (Fig.  5) 

At  =  (u;  —  /  cos  0)(/i  — /8in0) 

The  selectivity  of  a  random  line  segment  of  known 
length  and  orientation  is  ^ . 

In  general,  there  will  be  several  line  segments 
with  different  lengths  and  orientations  that  fall 
within  a  line  uncertainty  region.  To  2u:count  for 
orientation,  we  assume  that  random  line  segments 
are  equally  likely  to  fall  at  any  angle.  We  then 
integrate  A  and  Ai  over  their  respective  ranges  of 
allowable  orientations  to  get  volumes  of  allowable 
positions  of  a  random  line  segment  (with  known 
length).  Integrating  the  two  area  formulas  over  an 
arbitrary  range  [k;i,W2]  gives: 

t)i(wi,u;2)  =  (ii-l-  r  -  /)2r(w2  -  wi) 

-(-2rL(sin  a;2  —  sinuti) 

V2{ui,U2)  —  (R  +  f  ~  ^)(R  +  r)(‘*'2  “ 

+(R  +  r  —  t)L{coau2  —  coswi ) 
+(/?-!-  r)£,(sin  W2  —  sin  wi ) 

-  5£,*(sin*  u)2  -  sin*  wi ) 

From  the  area  formulas,  the  range  of  0  is  divided 
into  two  intervals  at  0  =  0i ,  and  the  length  of  the 
image  segment  constrains  the  range  of  orientations 
such  that  cos  0  >  ,  or 

Note  that  ^  exists  iff  R+r—L  <  t  <  R+r+L.  The 
first  inequality  holds  if  the  circles  do  not  overlap, 
and  the  second  must  be  true  for  the  image  segment 
to  fit  in  the  uncertainty  region  (Fig.  1).  From  this, 
the  volume  V  that  corresponds  to  the  two  area 
formulas  is  given  by 

f  viCO,^)  if^<0i. 

V  =  2 <  vj(O,0i)  +  V2(0i,^)  if0i<0<02i 

I  Vi(O,0;) -I- V2(01,02)  it  02  <  4, 

where  I  <  R+r+L.  Integrating  A/  from  0  =  -x/2 
to  0  =  ir/2  gives 

V/  =  wtvh  —  2t(w  +  h)  +  t^. 

The  selectivity  equals 

In  summary,  a  model  line  segment’s  selectivity 
can  be  computed  as  follows.  Let  r  and  R,  r  < 
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R,  be  the  radii  of  the  uncertainty  circles  for  the 
endpoints  of  the  line  segment,  computed  using  the 
technique  of  Section  3  or,  for  planar  models,  using 
the  known  analytic  solution.  Let  L  be  the  distance 
between  the  centers  of  the  two  circles,  and  let  t  be 
the  expected  length  of  the  corresponding  image 
line  segment.  Then  compute  the  selectivity  /i  from 
the  above  formulas  for  V,  Vj,  Vy,  vj,  Bi,  O2,  and 

This  method  assumes  that  the  image  line  seg¬ 
ment’s  length  is  known.  Because  in  real  images 
all  lengths  are  not  equally  likely,  we  cannot  inte¬ 
grate  out  the  length  dependence.  Instead,  if  we 
can  estimate  the  percentage  a  of  the  model  that 
is  occluded,  then  we  can  estimate  occlusion  by  as¬ 
suming  that  each  model  segment  is  occluded  by  a, 
giving  f  =  (1  -  a)L  [10]. 

Overlapping  uncertainty  circles 

Occasionally,  the  uncertainty  circles  may  over¬ 
lap,  either  by  intersection  (R  —  r<L<R  +  r) 
or  inclusion  (L  <  A  —  r).  For  these  situations, 
[2]  derives  similar  volume  formulas,  which  are  a 
little  more  complex  and  handle  the  few  additional 
cases  that  arise  over  the  range  of  0. 

2.3  Likelihood  of  false  positives 

The  false  positive  rate  of  an  alignment  system  can 
be  computed  from  the  expected  selectivity  of  a 
feature  ([10,  11,  12],  also  ^tion  4).  To  compute 
the  false  positive  rate,  let  Ji  be  the  expected  se¬ 
lectivity,  let  «  be  the  number  of  unmatched  image 
features,  let  m  be  the  number  of  unmatched  model 
features,  and  let  m'  be  the  number  of  model  point 
features  that  are  used  for  generating  hypotheses. 
If  the  8  image  features  occur  independently  and 
at  random,  the  probability  of  at  least  one  image 
feature  appearing  in  a  propagated  region  with  se¬ 
lectivity  J[  is 

p=l-(l-/I)V 

The  probability  of  at  least  k  of  the  m  propagated 
regions  having  at  least  one  random  feature  is 

»s0  '  ' 

Wk  is  the  probability  of  a  false  positive  of  size  k. 
If  we  match  a  fixed  image  triple  to  all  possible 


model  triples,  the  probability  that  at  least  one 
match  leads  to  a  false  positive  of  size  k  is 

et  =  1-(1- «;*)("').  (2) 

3  A  Study  of  Uncertainty  Regions 
for  3D  Point  Features 

To  use  these  false  positive  formulas,  we  need  the 
expected  selectivity  of  point  and  line  features. 
These  can  be  computed  from  the  propagated  un¬ 
certainty  regions  for  points,  which  we  assume  are 
circular.  This  section  experimentally  justifies  this 
assumption,  and  proposes  a  method  for  comput¬ 
ing  the  bounding  circles.  Specifically,  Section  3.1 
shows  that  the  uncertainty  regions  generally  can 
be  approximated  accurately  with  “uncertainty  cir¬ 
cles”  centered  at  the  nominal  points,  although  at 
times  the  shapes  of  the  uncertainty  regions  can  be 
complex  (Section  3.2).  Section  3.3  demonstrates 
that  the  uncertainty  circles  can  be  computed  pre¬ 
cisely  with  a  small  amount  of  numerical  sampling, 
so  that  the  simple  approach  of  sampling  is  both 
accurate  and  efficient. 

3.1  Comparing  the  shapes  of  uncertainty 

regions  to  circles 

To  see  how  well  uncertainty  circles  bound  the  er¬ 
rors  in  the  image  locations  of  predicted  model 
points,  this  section  runs  two  experiments  that 
compare  the  true  regions  to  the  circular  fits.  The 
radii  of  the  circles  are  computed  using  the  maxi¬ 
mum  distance  from  the  nominal  point  to  a  bound¬ 
ary  point.  To  compare  regions,  we  use  the  fol¬ 
lowing  error  measure.  Let  At  equal  the  area  of 
the  true  region,  and  let  Ac  equal  the  area  of  the 
approximating  circle.  The  error  measure  is 


where  the  sign  is  used  to  discriminate  which  area  is 
larger.  Since  it  is  based  on  the  difference  in  areas, 
the  measure  will  be  large  when  the  fit  is  poor. 
Since  the  difference  in  areas  may  be  large  if  the 
perimeters  do  not  line  up  exactly,  the  measure  may 
also  be  large  when  the  fit  is  relatively  good.  Thus, 
the  measure  provides  a  conservative  estimate  of 
the  badness  of  the  circular  fit. 
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Experiment  1:  Accuracy  of  uncertainty  cir¬ 
cles  for  random  models 

This  experiment  examines  how  often  we  can  ex¬ 
pect  the  uncertainty  circles  to  be  correct,  in  par¬ 
ticular,  how  often  the  maximum  error  is  1%,  or 
10%.  Also,  we  estimate  what  the  maximum  error 
will  be  90%,  95%,  and  99%  of  the  time.  To  do  this, 
we  run  a  series  of  trials  of  an  alignment  algorithm 
and  compute  the  error  measure  (Eq.  3)  for  each. 
The  percent  of  time  the  error  satisfies  some  crite¬ 
ria  is  estimated  by  the  fraction  of  trials  over  which 
the  error  measure  satisfies  that  criteria.  For  the 
algorithm,  we  assume  that  the  image  points  effec¬ 
tively  arise  at  random,  which  is  reasonable  if  the 
image  has  significant  clutter. 

Method:  We  ran  100  trials  where  a  model  is  pro¬ 
jected  into  an  image  and  the  error  measure  of  Eq.  3 
is  computed  for  each  model  point.  In  each  trial,  a 
random  triple  of  image  points  is  matched  to  a  ran¬ 
dom  triple  of  model  points  taken  from  a  randomly- 
generat^  model  (for  details  see  [2]).  The  three- 
point  match  is  used  to  project  the  model  into  the 
image,  which  gives  the  nominal  image  locations 
of  the  model  points.  Except  for  model  points  in 
the  plane  of  the  matched  model  points,  there  are 
two  possibilities  for  each  nominal  image  location 
(15,  1]. 

Using  e  ss  5,  the  e-circles  around  the  three 
image  points  are  sampled  uniformly  at  25  points 
each.  Every  triple  of  sampled  points  is  matched  to 
the  three  model  points,  and  used  to  compute  the 
image  locations  of  all  the  model  points.  This  re¬ 
sults  in  a  region  in  the  image  for  each  model  point. 
The  area  of  each  region  is  computed  by  counting 
the  pixels  within  the  region’s  boundary  (see  [2]). 
The  radius  of  the  corresponding  uncertainty  circle 
is  obtauned  by  taking  the  maximum  distance  from 
the  nominal  point  to  a  boundary  point. 

As  noted,  there  are  two  solutions  for  each  pair 
of  model  and  image  triples,  which  correspond  to 
a  reflection  about  any  plane  parallel  to  the  image 
[15,  1].  FYom  [1],  let  /fi  and  /fj  be  the  differences 
in  the  z  coordinates  between  the  first  model  point 
and  the  second  and  third  model  points,  respec¬ 
tively;  —/fi  and  — /f}  for  the  reflected  solution. 
To  distinguish  the  two  solutions,  we  use  the  val¬ 
ues  of  Hi  and  H3  that  occur  when  the  matched 
image  points  are  at  their  nominal  locations.  If  the 
nomind  Hi  is  larger,  we  take  all  solutions  with 
the  same  sign  for  Hi  as  being  from  the  same  re¬ 
gion.  We  do  the  opposite  if  the  nominal  H2  is 


larger.  This  method  separates  the  two  regions  cor¬ 
responding  to  the  two  weak-perspective  solutions, 
unless  they  overlap.  If  they  do  overlap,  there  re¬ 
ally  is  one  region,  and  this  method  splits  it. 

Results  and  Discussion:  Over  the  100  trials, 
1163  uncertainty  regions  were  tested.  The  aver¬ 
age  area  was  583.53  for  the  correct  uncertainty 
regions  and  662.43  for  the  approximating  circles. 
For  96.73%  of  the  uncertainty  regions,  the  error 
(using  the  error  measure)  between  the  true  region 
and  the  approximation  was  less  than  1%,  and  for 
97.94%  of  the  uncertainty  regions  the  error  was 
less  than  10%.  Also,  the  maximum  error  for  90% 
of  the  regions  was  1%,  for  95%  of  the  regions  was 
1%,  for  98%  of  the  regions  was  10%,  and  for  99% 
of  the  time  51%.  This  suggests  that  uncertainty 
circles  are  generally  very  accurate. 

Experiment  2:  Accuracy  of  the  uncertainty 
circles  for  the  telephone  model 

For  comparison,  we  ran  the  same  set  of  trials  on  a 
model  of  a  telephone  (Fig.  6). 

Method  ;  The  method  is  as  in  Experiment  1,  but 
using  the  telephone  model  at  every  trial. 

Results  and  Discussion:  For  100  trials  with 
the  phone  model,  1092  uncertainty  regions  were 
generated.  The  average  area  was  495.59  for  the 
correct  uncertainty  regions  and  450.13  for  the  ap¬ 
proximating  circles.  Notice  that  this  time  the  av¬ 
erage  area  for  the  overestimates  is  lower  than  for 
the  exact  areas.  This  is  because  the  method  used 
to  compute  the  true  regions  can  overestimate  them 
a  little  when  the  fit  is  good,  an  effect  which  turned 
out  to  be  stronger  than  the  overestimate  in  the  cir¬ 
cular  fit,  because  very  few  of  the  circular  fits  were 
poor.  Here,  for  98.01%  of  the  uncertainty  regions 
the  error  between  the  true  region  and  the  approx¬ 
imation  was  less  than  1%,  and  for  99.08%  of  the 
regions  it  was  less  than  10%.  The  maximum  er¬ 
ror  for  90%,  95%,  and  98%  of  the  regions  was  1%. 
Further,  for  99%  of  the  regions  the  maximum  er¬ 
ror  was  10%  instead  of  51%.  So  it  appears  the 
circular  fits  work  better  for  the  specific  model  of 
a  telephone. 

3.2  Cases  where  errors  are  greatest 

Of  the  100  trials  of  random  models,  two  had  er¬ 
rors  greater  than  25%.  Fig.  7  displays,  relative  to 
the  image,  the  uncertainty  regions  and  uncertainty 
circles  corresponding  to  the  largest  errors  for  those 
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trials  (78.8%  and  81.7%).  In  both  cases,  the  corre¬ 
sponding  angles  between  the  matched  model  and 
image  points  were  very  close.  Geometrically,  this 
means  the  plane  of  the  model  points  was  almost 
parallel  to  the  image,  a  situation  which  is  inher¬ 
ently  unstable  [1].  The  unusual  shapes  of  the  true 
uncertainty  regions  in  Fig.  7  are  due  to  the  com¬ 
putation  of  the  uncertainty  regions  (Section  3.1), 
and  represent  cases  where  the  two  regions  overlap 
and  are  split  in  two. 

For  the  phone  model,  only  one  trial  had  errors 
greater  than  25%,  specifically  27.0%.  The  uncer¬ 
tainty  region  and  uncertainty  circle  are  shown  in 
Fig.  7  (same  scale  as  other  examples). 

F^om  the  cases  with  large  errors,  we  can  infer 
that,  in  an  alignment  system  that  tries  many  or 
all  pairs  of  point  triples  for  aligning  a  model  to 
the  image,  situations  with  large  errors  could  be 
avoided  by  checking  whether  the  angles  between 
the  points  are  similar.  However,  this  may  lead 
to  relying  on  an  arbitruy  threshold.  As  a  conse¬ 
quence,  it  would  be  better  to  handle  these  cases 
specially  by  sampling  extensively  and  then  walk¬ 
ing  the  boundaries  of  the  resulting  regions. 

3.3  Computing  uncertainty  circles 
efficiently 

Given  that  circles  centered  at  the  nominal  points 
approximate  well  the  uncertainty  region  bound¬ 
aries,  all  that  is  needed  is  to  compute  the  radii 
of  the  circles.  A  simple  approach  is  to  sample 
points  from  the  error  circles  around  the  matched 
image  points  and  take  the  maximum  distance  from 
the  predicted  nominal  point  as  the  radius.  This 
process  will  be  efficient  if  few  sample  points  are 
required.  This  section  infers  how  few  points  are 
ne^ed. 

Experiment  3:  Using  fewer  sample  points 
for  random  models 

To  see  how  few  sample  points  are  needed,  this  ex¬ 
periment  tests,  for  various  numbers  of  points,  n, 
and  for  a  series  of  trials,  the  percent  of  time  (frac¬ 
tion  of  trials)  that  the  error  in  using  n  points  is 
less  than  some  limit.  This  is  compared  to  using  25 
points,  as  in  the  last  two  experiments. 

Method:  A  series  of  100  trials  were  ran  using 
random  image  triples  matched  to  random  model 
triples  from  randomly-generated  models,  as  in  Ex¬ 
periment  1.  For  each  trial,  the  error  circles  around 


the  matched  image  points  are  sampled  uniformly 
at  25  points  and  10  points.  For  each  propagated 
uncertainty  region,  the  error  in  using  the  smaller 
number  of  samples  to  using  25  samples  is  com¬ 
puted.  This  is  repeated  for  9,  8,  and  7  sample 
points  as  well. 

Results  and  Discussion ;  The  results  are  shown 
in  Table  1.  Note  that  the  percentages  do  not 
strictly  decrease  as  fewer  sample  points  are  used. 
This  is  because  the  circles  around  the  image  points 
are  sampled  uniformly,  so  that  using  different 
numbers  of  sampled  points  can  give  different  sam¬ 
ples  on  the  circles.  Hence,  when  the  percentages 
are  close,  there  may  be  cases  where  fewer  sample 
points  do  better.  Nevertheless,  this  effect  should 
be  smadl.  Notice  that  the  average  percent  error 
does  indeed  increase  monotonically. 

We  can  use  Table  1  to  pick  a  reasonable  number 
of  points  for  sampling  the  image  error  circles.  If 
we  permit  5%  error,  then  using  8  sample  points 
instead  of  25  should  be  2u:curate  over  99%  of  the 
time.  Also,  the  average  error  in  using  eight  points 
is  very  sm^I  (1.137%). 

A  better  feel  for  how  accurate  is  the  use  of  fewer 
sample  points  is  given  by  statistics  on  the  radii, 
shown  in  Table  2.  From  the  table,  the  average  dif¬ 
ference  in  the  radii  for  eight  sample  points  was  .08 
pixels,  and  the  worst  case  difference  was  3. 24  pix¬ 
els.  Relative  to  the  radius  for  twenty-five  points, 
the  average  difference  is  .575%,  and  the  maximum 
difference  is  8.96%. 

Experiment  4:  Using  fewer  sample  points 
for  telephone  model 

Method :  This  experiment  is  the  same  as  Exper¬ 
iment  3,  except  that  the  phone  model  is  used. 

Results  and  Discussion;  Tables  1  and  2  give 
the  results.  From  Table  1 ,  we  again  can  use  eight 
points  to  limit  errors  to  5%  over  99%  of  the  time. 
From  both  tables,  it  appears  that  using  fewer  sam¬ 
ple  points  works  slightly  better  with  the  phone 
model  than  with  random  models. 

To  illustrate  the  use  of  uncertainty  circles.  Fig.  8 
shows  an  example  of  the  propagated  uncertainty 
circles,  where  eight  sample  points  were  used.  The 
three  smallest  circles  correspond  to  the  assumed 
errors  in  the  matched  image  points,  which  in  this 
example  were  matched  correctly.  For  the  un¬ 
matched  model  points,  the  other  circles  show  the 
regions  to  be  searched  for  matching  image  points. 
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The  self-occluded  model  points  were  removed  be¬ 
forehand.  Still,  some  of  the  remaining  corner 
points  are  occluded  by  other  objects,  and  the  un¬ 
certainty  regions  provide  a  means  to  reason  that 
this  is  so  after  a  relatively  small  amount  of  search 
in  the  image. 

Notice  that  the  sizes  of  the  propagated  un¬ 
certainty  regions  vary  considerably  for  different 
model  points.  Consequently,  an  approach  that  re¬ 
lies  on  fixed-sized  error  bounds,  as  in  [15],  can 
lead  to  correct  matches  being  missed  (when  the 
bounds  are  too  small),  and  incorrect  matches  be¬ 
ing  accepted  (when  the  bounds  are  too  large  and 
include  spurious  image  points). 

4  Measuring  the  Sensitivity  to 
False  Positives 

4.1  Expected  selectivity  of  point  features 

We  now  use  the  analyses  of  Sections  2  and  3  to 
examine  alignment’s  sensitivity  to  false  positives. 
For  point  features,  the  expected  selectivity  has 
been  used  before  to  analyze  false  positive  rates 
for  alignment  where  the  models  are  flat  [12],  and 
also  for  alignment  with  solid  models  but  using  a 
different  uncertainty  propagation  technique  [11]. 
We  can  use  the  expected  selectivity  to  compare 
the  uncertainty  propagation  technique  used  here 
to  the  one  in  [11]. 

For  flat  models,  the  propagated  uncertainty  re¬ 
gions  can  be  computed  exactly.  It  would  be  inter¬ 
esting  to  see  how  much  the  chance  of  a  false  pos¬ 
itive  increases  from  planar  to  solid  models,  since 
the  propagated  uncertainty  regions  are  larger  for 
points  out  of  the  plane  of  the  matched  model 
points  than  for  their  corresponding  points  in  the 
plane  [1] — for  a  3D  point,  the  corresponding  point 
in  the  plwe  is  the  intersection  of  the  plane  and 
the  perpendicular  from  the  plane  to  the  3D  point. 
Again,  we  can  use  the  expected  selectivity  for  the 
comparison. 

Experiments  5  and  6:  Expected  selectivity 
of  point  features 

Method:  To  compute  the  expected  selectivity, 
we  re-ran  1000  trials  of  the  same  type  as  in  Ex¬ 
periments  3  and  4,  except  five  was  added  to  each 
radius  before  computing  the  area,  in  order  to  ac¬ 
count  for  expanding  each  uncertainty  region  out¬ 
wards  by  r  =  5  pixels. 


Results  and  Discussion:  Using  random  mod¬ 
els  with  8  sample  points  over  1000  trials  gave 
11349  propagated  regions  with  average  area  973.25 
square  pixels.  Using  the  phone  model  with  8  sam¬ 
ple  points  over  1000  trials  gave  11085  propagated 
regions  with  average  area  979.78  square  pixels. 
The  resulting  selectivities  along  with  those  for  [1 1] 
and  [12]  are  shown  in  Table  3. 

The  expected  selectivity  for  the  uncertainty  cir¬ 
cles  is  about  half  that  for  [11],  which  implies  that 
the  uncertainty  circles  should  give  significantly 
better  performance.  Furthermore,  it  appears  that 
the  selectivities  of  solid  models  are  only  slightly 
greater  than  for  planar  ones.  We  can  infer  from 
this  that,  when  point  features  are  used,  recogniz¬ 
ing  solid  objects  with  alignment  is  a  only  little 
more  sensitive  to  false  positives  than  recognizing 
planar  objects. 

4.2  Expected  selectivity  of  line  features 

Experiment  7:  Expected  selectivity  of  line 
features  for  the  telephone  model 

Method:  To  compute  the  expected  selectivity,  we 
used  the  formula  given  in  Section  2.2.  We  ran  a 
series  of  the  same  trials  from  Experiments  5  and 
6  when  the  selectivity  of  point  features  was  com¬ 
puted.  For  each  trial,  we  used  each  pair  of  uncer¬ 
tainty  circles  that  corresponds  to  a  line  segment 
in  the  telephone  model  (Fig.  6)  and  computed  the 
line  segment  selectivity.  This  was  repeated  for  var¬ 
ious  amounts  of  occlusion,  a. 

Results  and  Discussion:  For  1000  trials,  the  se¬ 
lectivity  of  9560  line  uncertainty  regions  was  com¬ 
puted  and  averaged.  Table  4  gives  the  selec¬ 
tivities  for  different  amounts  of  occlusion.  As  ex¬ 
pected,  the  selectivities  for  lines  are  much  less  than 
for  points  (compare  to  Table  3). 

4.3  Limits  on  Scene  Clutter 

A  recognition  scheme  based  on  extended  model 
features  will  fail  if  a  scene  becomes  extremely  clut¬ 
tered.  It  would  be  useful,  then,  to  know  how  much 
clutter  a  recognition  system  can  accommodate  be¬ 
fore  the  probability  that  it  will  fail  is  significant. 
We  can  use  Eq.  2  to  estimate  this  limit.  Specifi¬ 
cally,  given  an  image  triple,  we  can  compute  the 
maximum  value  of  s  such  that  et  <  6,  where  6  is 
a  preset  limit,  and  k  =  fm  for  some  fraction  /  of 
the  model  features. 
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Table  5  shows  the  results  for  6  =  .001  (the  .01 
and  .0001  cases  are  similar).  The  limits  for  the 
uncertainty  propagation  technique  of  [11]  are  very 
low.  Although  the  numbers  are  greatly  improved 
using  uncertainty  circles,  it  is  only  when  line  seg¬ 
ments  are  used  that  numbers  of  features  are  in 
the  range  of  images  with  substantial  scene  clutter 
500  features). 

4.4  Threshold  for  Accepting  a  Partial 
Match 

When  the  extended  features  of  a  model  are  used 
for  verification,  we  want  to  know  how  many  must 
be  matched  before  we  can  stop  looking  for  more 
matches.  We  can  use  Eq.  1  to  set  a  threshold  such 
that  the  chance  that  a  false  positive  will  arise  is 
less  than  a  preset  limit.  Let  /  be  the  percentage 
of  model  features  that  must  be  matched  to  keep 
the  probability  of  a  false  positive  at  most  Sub¬ 
stituting  mf  for  k,  we  want  to  find  the  minimum 
/  such  that  Wmj  <  ^3-  Table  6  shows  the  results 
for  line  segments.  As  a  check  on  the  method,  the 
recognition  system  of  [15]  used  /  =  .5  as  a  thresh¬ 
old  on  the  percentage  of  the  model  to  verify.  In 
the  examples  given,  anywhere  from  0  to  50%  of 
the  model  wa.i  occluded,  and  so  the  thresholds  pre¬ 
dicted  by  the  table  are  in  approximate  agreement 
with  the  experimental  threshold  chosen. 

5  Conclusion 

An  important  criteria  for  a  recognition  system  is 
that  we  can  trust  it  when  it  decides  if  an  instance 
of  the  model  occurs  in  the  image.  Once  the  sys¬ 
tem  is  guaranteed  to  not  mistakenly  discard  any 
correct  hypotheses,  its  usefulness  is  determined  by 
its  sensitivity  to  false  positives.  This  paper  gave 
a  theory  for  analyzing  the  false-positive  sensitivity 
of  alignment-style  recognition  systems  (Section  2). 
The  main  contribution  is  a  formula  for  the  selec¬ 
tivity  of  line  features  (Section  2.2).  A  feature’s 
selectivity  can  be  used  to  infer  the  expected  per¬ 
formance  of  recognition  systems.  It  can  also  be 
used  to  set  a  threshold  on  how  much  of  a  model 
must  be  identified  in  an  image  before  the  object  is 
recognized  (Section  4.4). 

We  also  provided  an  error  analysis  of  point  fea¬ 
tures  for  alignment-style  recognition  of  3D  models 
from  2D  images  (Section  3).  The  earlier  analysis 
of  [1 1]  was  conservative  in  its  bounds  on  the  prop¬ 
agated  uncertainty,  and  Section  3  showed  we  can 


do  better.  In  fact,  the  analysis  is  almost  always  a 
solution,  which  means  its  bounds  are  exact,  except 
where  the  3D  pose  solution  is  inherently  unstable. 
In  these  cases,  the  bounds  conservatively  overesti¬ 
mate  the  exact  bounds. 

Even  though  the  error  propagation  technique  in 
Section  3  is  generally  accurate,  the  technique  hcts 
the  disadvantage  of  being  numerical.  For  most 
recognition  problems,  however,  the  time  to  com¬ 
pute  the  solution  is  effectively  constant,  as  though 
the  solution  were  analytic. 

These  contributions  tie  together  well  for  build¬ 
ing  a  fast  and  robust  alignment  system.  The  un¬ 
certainty  analysis  provides  the  correct  minimal 
search  regions  to  guarantee  that  no  correct  hy¬ 
potheses  are  lost.  Further,  the  uncertainty  regions 
can  be  computed  quickly  using  the  error  propaga¬ 
tion  technique  and  a  fast  solution  for  the  image 
position  of  an  unmatched  model  point.  Once  com¬ 
puted,  the  uncertainty  regions  usually  are  small 
enough  to  be  searched  rapidly  for  candidate  image 
features.  Then  the  current  hypothesis  can  be  eval¬ 
uated,  using  a  predetermined  threshold  on  the  per¬ 
centage  of  model  features  that  must  be  matched. 
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Table  1:  Peicentage  of  time  error  was  less  than  l%-6%  for  different  numbers  of  sample  points,  plus 
average,  maximum,  and  minimum  percent  errors  over  all  trials.  Top:  using  1149  propagated  uncertainty 
regions  from  random  models.  Bottom:  using  1113  uncertainty  regions  from  the  telephone  model. 


ave  percent 

max  percent 

min  percent 

-.05 

^44 

rrsT 

-.17 

9 

.08 

3.87 

-.03 

.521 

17.30 

-.18 

8 

.08 

3.24 

-.05 

.573 

8.96 

-.24 

mm 

.13 

4.21 

-.02 

.844 

13.70 

-.16 

10 

.05 

0.69 

-.05 

.365 

4.37 

-.17 

9 

.06 

1.30 

-.02 

.459 

6.83 

-.17 

8 

.07 

1.12 

-.03 

.494 

6.07 

-.25 

7 

.10 

1.23 

-.03 

.772 

6.62 

-.16 

Table  2:  Differences  in  radii  for  different  numbers  of  sample  points.  Top  results  are  for  raoidom  models, 
bottom  results  are  for  the  telephone. 
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Table  3;  Expected  selectivities  of  point  features. 
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Table  4:  Expected  selectivities  of  line  features  for  different  amounts  of  occlusion,a,  using  the  telephone. 


Figure  1;  Region  to  search  for  candidate  line  segments 
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Table  5:  Approximate  limits  on  the  number  of  sensory  features  for  different  amounts  of  occlusion  a 
and  different  fractions  /  of  model  features  used.  Table  is  for  r  =  S,  j  =  .001,  for  lines  m  =  m'  =  200 
(line  uncertainty  reports),  and  for  points  m  =  197  and  m'  =  200  (uncertainty  circles  and  [11]). 
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Table  6:  Predicted  termination  thresholds  for  different  amounts  of  occlusion  a,  and  for  different  limits 
63  on  the  false  positive  probability.  Table  is  for  c  =  S,  m  =  m'  =  200,  and  3  =  500. 


Figure  2:  Region  of  translations  with  orientation  constraint  and  rectangular  bound. 


Figure  3:  Orientations  of  the  common  outer  tangents  (left)  and  the  common  cross  tangents  to  the 
circles  (right),  which  give  the  maximum  possible  angle  of  a  line  segment  with  an  endpoint  in  each  circle. 
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Figure  7:  Left:  Largest  errors  for  random  modeb:  81.7%  (upper  left),  78.8%  (bottom  right).  Right: 
Largest  error  for  the  telephone  model:  27.0%. 


Figure  8:  Propagated  uncertainty  in  a  real  image,  which  was  provided  by  David  Jacobs.  The  three  smallest 
circles  correspond  to  assumed  errors  in  the  match^  image  points,  and,  given  those  errors,  the  larger  circles  show 
the  sets  of  possible  locations  of  the  other  corner  points  of  the  telephone. 
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Abstract 

Calibration  of  cameras  and  manipulators  for 
robotic  tasks  is  a  difficult  and  sensitive  pro¬ 
cess.  We  present  a  technique  that  uses  ac¬ 
tive  camera  motion  to  recover  image  space 
properties  that  can  be  used  to  accurately 
control  and  position  a  robot  hand/eye  sys¬ 
tem  that  uses  an  uncalibrated  camera.  The 
algorithm  is  verified  by  an  experiment  where 
a  robot  completes  the  task  of  inserting  a  peg 
into  a  hole  with  an  error  of  3mm. 

This  work  was  supported  in  part  by 
DARPA  contract  DACA-76-92-C-007,  NSF 
grants  IRI-86-57151,  CDA-90-24735,  North 
American  Philips  Laboratories,  Siemens 
Corporation  and  Rockwell  International. 


1  INTRODUCTION 

In  many  real  world  applications,  there  is  a 
need  to  perform  alignment  tasks  between 
two  objects.  Two  simple,  generic  tasks 
are  inserting  a  peg  into  a  hole  and  align¬ 
ing  objects  into  arbitrary  geometric  config¬ 
urations  (e.g.  robotic  assembly  tasks.)  A 
key  component  of  this  problem  is  position¬ 
ing  where  there  is  little  room  for  mechan¬ 
ical  error.  The  idea  of  precision  measure¬ 
ment  (in  our  example,  alignment)  using  a 
mechanical  device,  photographic  emulsions 
or  photo-electric  sensors,  has  been  exam¬ 
ined  in  great  detail  by  the  researchers  in 
non-topographic  photogrammetry.  By  us¬ 
ing  models  which  account  for  most  of  the 


Figure  1:  View  of  camera,  robot,  and  mul¬ 
tiple  target  setup 

aberration  and  lens  defects  in  modern  lenses, 
they  obtain  highly  precise  calibrations  of 
their  camera  systems.  For  more  informa¬ 
tion  see  Karara[6].  These  methods  are  often 
difficult  to  understand  and  inconvenient  to 
use  in  most  robotics  environments.  They 
usually  require  the  minimization  of  several, 
complex,  non-linear  equations  of  multiple 
variables  (of  which  the  results  are  not  guar¬ 
anteed  to  be  robust.)  Other  methods  for 
performing  camera  calibration  for  robots  in¬ 
clude  the  works  of  Tsai  (16,  15],  Young  et. 
al.  [is],  Bennett  et.  al.  [l],  and  Holt  et.  ad. 
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Figure  2:  View  of  targets  from  camera 


[4]  for  example. 

To  give  the  reader  an  idea  of  the  align¬ 
ment/insertion  task,  figure  1  shows  our  ex¬ 
perimental  setup.  Off  the  end  of  the  end 
effector  of  our  robot  is  a  probe  with  a  sharp 
tip  (the  “peg”.)  The  target  in  this  scene  is  a 
2mm  hole  in  the  machined  aluminum  block 
located  almost  directly  below  the  probe. 
Figure  2  shows  a  view  of  the  target  objects 
taken  from  the  camera  system.  In  this  fig¬ 
ure,  the  holes  in  the  machined  block  are 
more  easily  seen.  The  goal  of  the  task  is 
to  maneuver  the  probe  to  a  position  where 
it  is  directly  above  the  target,  and  then  to 
insert  the  probe  into  the  target. 

Another  class  of  methods  revolves 
around  the  depth  from  motion  paradigm. 
This  body  of  research  tries  to  recover  the 
absolute  pixel  velocity  for  objects  in  image 
space.  Here  too  the  researchers  are  search¬ 
ing  for  an  absolute  transformation  from  a 
known  reference  (the  velocity  of  a  known  ob¬ 
ject)  and  an  unknown  system  (the  actual, 
time- varying,  intensity  data).  The  method 
we  propose  does  not  require  the  absolute  po¬ 
sitional  information  that  both  of  the  afore¬ 
mentioned  systems  require.  It  uses  simple 
image  displacement  data  (generated  from 
the  movement  of  the  camera  system)  to  gen¬ 


erate  an  estimated  position  where  it  expects 
that  the  object  motion  will  be  minimized 
with  respect  to  the  camera  movement. 

Our  technique  takes  the  typical  map¬ 
ping  from  3-D  positions  to  ima^e  coordi¬ 
nates,  and  instead  of  finding  this  mapping, 
it  recovers  a  property  of  the  image  coor¬ 
dinates.  The  traditional  mapping  problem 
(known  as  the  calibration  problem)  deter¬ 
mines  the  position  of  objects  based  on  rela¬ 
tive  scale  difference,  perspective  distortion, 
and/or  several  other  properties  which  ex¬ 
ist  between  a  calibrated  system  and  an  ob¬ 
served  system.  These  positional  values  can 
be  obtained  from  both  static  and  dynamic 
systems.  These  methods  do  not  exploit  the 
fact  that  a  known  movement  in  the  camera 
system  can  result  in  useful  motion  informa¬ 
tion  in  the  image  system  without  knowing 
the  exact  calibration  between  the  systems. 

We  approached  the  problem  by  asking 
the  following  question:  How  can  I  get  a 
robot  to  perform  a  given  task  using  only  un- 
calibrated  visual  input  to  direct  the  robot’s 
actions?  In  many  cases,  it  is  not  neces¬ 
sary  for  the  robot  to  have  a  completely  cal¬ 
ibrated  work  area.  (It  is  not  necessary  to 
know  the  exact  positions  of  everything  in 
the  robotic  workspace.  It  may  be  more  im¬ 
portant  to  know  only  the  exact  position  of 
certain  items.)  We  propose  a  new  tech¬ 
nique,  similar  to  the  work  of  Sawhney  [ll, 
12),  which  will  allow  the  robot  system  to 
maintain  an  arbitrary,  geometric  relation¬ 
ship  with  an  object  system,  and  as  a  result  of 
certain  operations,  the  robot-object  system 
can  “calibrate”  itself  to  or  “can  define  its  lo¬ 
cation  with  respect  to”  the  unknown  camera 
system.  The  newness  of  our  technique  arises 
from  the  fact  that  our  system  performs  the 
useful  task  of  moving  to  the  goal  position 
without  ever  really  knowing  the  true  loca¬ 
tion  of  the  camera  system. 

2  OVERVIEW  OF  METHOD 

In  order  to  perform  the  peg-in-hole  insertion 
task,  we  broke  the  task  into  two  parts:  the 
alignment  task  and  the  actual  insertion  task. 
The  alignment  task  servos  the  end  effector 
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in  a  plane  in  robot  space  until  the  alignment 
condition  occurs  (that  being  when  the  ob¬ 
ject  to  be  servoed  to  and  the  end  effector 
lie  on  the  same  ajus.)  The  insertion  task 
relies  on  the  fact  that  the  alignment  stage 
has  constrained  the  solution  to  lie  along  a 
line  (thus  making  the  insertion  task  simply 
a  one  degree  of  freedom  search.) 

A  simplified  setup  is  shown  in  figure  3. 
The  task  is  to  maneuver  the  end  effector  to 
a  position  directly  over  the  target  position. 

We  started  our  investigation  by  examin¬ 
ing  what  would  happen  if  we  attached  some 
sensing  system  to  the  rotational  axis,  such 
that  the  system  could  image  the  rotational 
axis.  We  noticed  the  following  effect  as  we 
servoed  the  rotational  joint  over  a  small  an¬ 
gle  (see  figure  4.)  Those  objects  which  were 


i 
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Figure  4:  Demonstration  of  the  observed  ef¬ 
fect 

further  away  from  the  axis  of  rotation  moved 
a  greater  distance  than  those  points  closer  to 
the  axis.  This  effect  is  not  new  and  is  very 
similar  to  the  work  done  by  researchers  on 
the  analysis  of  the  Focus  of  Expansion  for 
time  to  impact  studies. 

We  make  the  simplifying  assumption 
that  the  objects  do  not  change  their  appear¬ 
ance  as  we  perform  the  rotation.  One  simple 
way  of  doing  this  was  to  use  point-like  tar¬ 
gets.  The  point-like  targets  rely  on  the  fact 
that  the  perspective  distortion  is  highly  lo¬ 
calized  due  to  the  fact  that  the  targets  have 
a  high  level  of  spatial  coherence. 

Using  the  effect  noticed  above,  we  trans¬ 
formed  the  alignment  problem  into  one  of  a 
positioning  problem  in  a  plane.  The  simpli¬ 
fication  is  justified  by  the  following  observa¬ 
tions: 

1.  The  initial  movements  of  the 
manipulator-camera  system  are  in  the 
X  —  Y  plane.  The  projection  of  the  Z- 

•  distance  to  the  object  on  the  rotationad 
axis  is  kept  constant. 

2.  Once  the  alignment  has  been  performed 
in  A  —  y  space,  the  only  movement  nec¬ 
essary  is  a  pure  translation  along  the 
rotational  axis  (for  our  scenario,  the  Z- 
axis). 

To  perform  our  peg-in-hole  insertions 
we  also  make  the  following  assumptions: 

•  Ts  (the  transform  from  the  world  coor¬ 
dinate  system  to  the  end-effector  of  the 
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robot,  minus  the  last  rotational  degree 
of  freedom)  is  known. 

•  R  (the  last  degree  of  rotational  freedom) 
is  known. 

•  Tcam>  the  camera  transform  matrix,  is 
unknown. 

•  P,  the  perspective  eflFect  introduced  by 
the  camera  system  where  the  missing 
parameter  is  A,  the  focal  length,  is  un¬ 
known 

•  Once  the  camera  was  positioned,  it  re¬ 
mained  fixed  with  respect  to  the  rota¬ 
tional  axis  of  the  6th  joint  of  the  PUMA. 
The  camera  was  not  allowed  to  move  in 
its  mount  nor  was  the  mount  allowed  to 
move  with  respect  to  its  mounting  point 
on  the  robot. 

•  The  camera  can  image  the  target  object 
during  any  servoing  operation.  There 
should  not  be  a  time  where  the  object 
leaves  the  focal  plane. 

•  and  finally,  Tobject^,  is  unknown. 

The  following  constraints  were  not  nec¬ 
essary; 

•  intersection  of  the  optical  and  rotational 
axes.  Proof:  If  you  imagine  the  system 
setup  (camera,  extending  rod,  gripper 
and  task  system)  where  each  piece  is  im¬ 
movable,  you  will  notice  that  the  rota¬ 
tional  axis  projects  to  a  line  in  the  cam¬ 
era  imaging  area.  This  line,  by  defini¬ 
tion  of  the  rigid  system,  cannot  change. 
Its  position  is  dictated  by  a  fixed  pro¬ 
jection  and  since  the  line  simply  rotates 
around  its  own  symmetrical  axis,  its  po¬ 
sition  does  not  change.  Note:  if  the  ro¬ 
tational  axis  is  seen  to  have  translated, 
it  means  that  the  manipulator  did  not 
go  through  a  simple  rotation,  but  may 
actually  have  translated. 

•  knowledge  of  the  focal  length,  or 

•  knowledge  of  Team- 

In  the  figure  3,  the  robot-camera  sys¬ 
tem  is  constrained  to  move  in  the  plane  A, 
where  A  is  defined  by  the  circle  swept  by  the 
camera  around  the  rotational  axis,  R.  We 
have  simplified  the  alignment  task  to  one  of 


a  2  DOF  problem.  The  goal  state  is  one 
where  the  object  simply  rotates  in  the  im¬ 
age  plane  without  translating,  hence  satis¬ 
fying  the  alignment  condition  which  is  that 
the  object  lies  on  the  rotational  axis.  The 
rotational  degree  of  freedom  is  used  as  a  free 
variable  for  the  alignment  task  and  does  not 
contribute  to  the  final  alignment  state  (for 
circularly  symmetric  objects). 

Once  an  object  has  been  selected  in  the 
camera’s  view,  the  robot  rotates  the  camera 
around  its  rotational  axis,  R.  By  slowly  ro¬ 
tating  the  camera  around  its  rotational  axis, 
we  remove  the  correspondence  problem  (the 
object  moves  oidy  a  slight  bit  between  con¬ 
secutive  shots,  therefore  making  the  corre¬ 
spondence  between  two  shots  trivial  to  com¬ 
pute.)  If  the  only  movement  in  the  robot- 
camera  system  is  caused  by  the  rotation,  the 
object  will  trace  out  a  conic  section,  an  el¬ 
lipse  under  certain  conditions  ^ ,  in  the  cam¬ 
era  system.  We  propose  to  use  these  ellipti¬ 
cal  parameters  to  recover  the  alignment  con¬ 
dition.  One  simple  method  requires  that  we 
move  about  in  the  plane  A,  sweeping  out 
ellipses  in  camera  space.  The  further  away 
the  object  is  from  the  rotational  axis,  the 
larger  area  is  swept  out  by  its  ellipse  pro¬ 
jection  into  camera  space.  The  closer  we 
come  to  aligning  the  object  to  the  rotational 
axis,  the  smaller  the  projected  ellipses  will 
become.  The  goal  in  this  scenario  is  to  de¬ 
vise  a  method  for  maneuvering  the  end  ef¬ 
fector’s  position  in  plane  A  to  the  position 
which  causes  the  object  to  project  to  an  el¬ 
lipse  with  the  smallest  area. 


’The  degenerate  conditions  are  if  the  object; 

1 .  is  directly  orthogonal  to  the  image  plane  -  a  circle, 

2.  is  already  aligned  to  the  rotational  axis  -  a  point, 

3.  incident  with  a  plane  containing  the  focal  point 
which  is  parallel  to  the  image  plane  -  parabola, 

4.  passes  through  a  plane  containing  the  focal  point 
which  is  parallel  to  the  image  plane  -  hyperbola, 
and 

5.  lies  in  the  same  plane  as  the  optical  axis  (where  the 
optica]  axis  and  the  rotational  axis  are  perpendicu¬ 
lar  to  one  another)  -  line. 
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3  RECOVERY  OF  ELLIPTICAL 
PARAMETER  DATA 

The  majority  of  this  work  was  inspired  by 
Safaee-Ra.d  et.  al.  [lO,  9,  14],  Haralick  [2], 
Magee  et.  al.  [7],  Sawhney  et.  al.  [11, 
12]  and  Shin  et.  al.[l3]. 

While  inspired  by  these  methods,  we 
have  developed  a  new  formulation  for  de¬ 
riving  ellipses  from  scattered  point  data.  In 
our  current  scenario,  we  accumulate  the 
X  projection,  Y  projection)  triplet  derived 
from  combining  the  angle  made  between  the 
end  effectors  zero  position  and  the  current 
position  of  the  end  effector  and  the  pro¬ 
jection  of  the  trsxked  feature  into  camera 
space.  We  then  parameterized  the  curve 
traced  out  by  the  feature  as: 

i(d)  =  i4co8(d6)  +  jBsin(d6)  +  C  (1) 

y{6)  =  D  co8(6e)  -\-  E  s\n(0e)  +  F  (2) 

The  area  enclosed  by  this  curve  (computed 
using  Green’s  Theorem)  is 

(3) 

The  full  proof  that  the  parametric  curves 
generated  by  these  equations  are  ellipses  is 
contained  in  [17]. 

The  problem  of  fitting  raw  data  points 
to  elliptical  data  was  covered  in  both  the 
Sawhney  and  Safaee-Rad  works  cited  earlier. 
We  were  concerned  primarily  with  develop¬ 
ing  a  method  which  did  not  require  data 
points  be  taken  from  the  entire  ellipse  and 
which  could  be  solved  linearly.  In  our  exper¬ 
iments,  the  elliptical  data  was  taken  over  a 
90  degree  sector  of  the  ellipse.  Using  only 
this  data,  we  were  able  to  fit  ellipses  quite 
well  (see  figure  5.) 

4  THE  SIMPLE  SIMPLEX 
SEARCH  METHOD 

This  method  uses  a  version  of  the  simplex 
method  for  finding  local  minima.  We  were 
motivated  by  the  fact  that  the  solution  sur¬ 
face  was  fairly  smooth  and  by  the  idea  that 
even  a  simple  “walking”  algorithm  should 


be  able  to  find  the  solution.  We  proposed 
creating  a  “walker”  with  three  legs:  a  sim¬ 
plex  (for  two  dimensionad  “walking”)  requir¬ 
ing  three  starting  points.  A  simplex  was 
created  in  X-Y  space  from  the  set  of  three 
arbitrary,  non-collinear  positions  ((0.0, 0.0), 
(0.0,50.0),  and  (50.0,50.0)).  The  search 
method  using  the  simplex  simply  “walks” 
down  the  surface  by  tossing  the  “leg”  which 
is  furthest  uphill  an  equal  amount  down¬ 
hill.  Once  it  constrains  the  solution  to 
lie  between  its  “legs”  it  shrinks  itself  and 
tries  “walking”  down  the  surface  using  its 
new  position  and  new,  smaller  “legs.”  This 
method  is  similar  to  the  Simplex  method  of 
Nelder  and  Mead[8]. 

The  implemented  version  of  the  algo¬ 
rithm  for  the  simplex  search  runs  as  follows: 

1.  Initialize  simplex.  The  robot  moves  to 
each  position  in  the  initial  position  set 
in  turn.  At  each  position,  the  robot 
tracks  the  movement  of  a  point-feature 
in  the  ims^e  plane  as  the  robot  changes 
its  value  of  After  accumulating 
the  object  positions  in  the  image  plane 
(2-D  point  data),  the  computer  fits  a 
least-squares  conic  section  to  the  points. 
From  the  conic  section  parameters,  it 
computes  the  area  of  the  elliptical  tra¬ 
jectory  taken  by  the  projection  of  the 
object  in  the  image  plane.  This  process 
is  performed  once  at  each  robot  position 
given  in  the  initial  simplex  set. 

2.  While  none  of  the  areas  is  less  than 
a  predetermined  threshold  (where  zero 
area  indicates  a  perfect  alignment  be¬ 
tween  the  object  and  the  rotational 
axis), 

(a)  Find  the  point,  pi  in  plane  A, 
whose  ellipse  encompasses  the  great¬ 
est  area. 

(b)  Reflect  the  point,  through  the  line 
connecting  the  other  two  points,  p2 
and  p3. 

(c)  Evaluate  the  area  of  the  ellipse  at  the 
new  point.  If  the  new  point’s  area 
is  larger  than  the  area  of  the  point 
which  generated  it,  you’ve  trapped 
the  minimum,  so  decrease  the  area 
of  your  search  space. 
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The  above  algorithm  tries  to  trap  the  global 
minimum  using  large  simplex  movements  to 
surround  the  minima  and  when  the  minima 
is  trapped,  it  reduces  its  search  space  by 
moving  the  point  with  the  largest  area  to 
the  middle  of  the  simplex  (roughly  reduc¬ 
ing  the  bounding  area  by  one-third).  The 
algorithm  is  repeated  until  the  area  of  the 
ellipse  computed  for  a  position  is  falls  below 
the  area  threshold. 

5  IMPLEMENTATION 

In  figure  3,  we  show  a  schematic  of  the  sys¬ 
tem  set  up  for  testing  the  new  alignment 
method.  We  mounted  a  Sony  XC-77  CCD 
camera  in  a  bracket  system  off  the  end  ef¬ 
fector  of  a  Puma  560  robot.  The  camera 
was  not  calibrated  or  position  constrained 
when  initially  placed.  The  system  was  con¬ 
trolled  using  RCCL  and  RCI  [3].  The  im¬ 
ages  were  digitized  at  256x242  resolution 
and  8  bits  gray  scale  at  standard  NTSC 
frame  rates  using  the  PIPE  parallel  image 
processing  engine  [5].  The  resulting  images 
were  thresholded  to  recover  a  simple  black 
object  on  a  white  background.  In  general, 
any  recovery  method  can  be  used  to  generi- 
cally  extract  object  information  from  an  im¬ 
age  array.  The  object  was  positioned  so 
the  robot  would  not  encounter  singularities 
when  moving  to  the  new  control  positions. 

Given  that  the  only  information  neces¬ 
sary  to  constrain  the  alignment  is  the  area 
of  the  projected  ellipse  on  the  image  plane, 
it  is  not  necessary  to  know  anything  about 
the  geometry  of  the  sensor  setup. 

6  RESULTS 

In  the  experiment,  we  used  the  modified 
simplex  method  (see  section  4)  with  an  ini¬ 
tial  simplex  of  ((0.0, 0.0),  (0.0,50.0),  and 
(50.0,50.0)). 

We  built  a  feature  tracker  which  as¬ 
sumes  velocity  constrained  object  motion  in 
image  space.  At  the  beginning  of  the  ex¬ 
periment,  a  scene  was  extracted  by  the  im¬ 
age  processor  and  the  user  was  prompted  to 
move  a  pointing  device  to  the  location  of  the 


feature.  The  feature  extractor  used  a  Sobel 
operator  with  a  fixed  threshold  to  extract 
the  predominant  feature  in  the  selected  re¬ 
gion.  The  tracker  would  follow  the  feature 
over  consecutive  image  frames  as  long  as  the 
feature  moved  only  small  distances. 

After  establishing  the  feature  tracker, 
the  robot  was  instructed  to  move  to  the  first 
position,  stop  ,  and  rotate  its  last  joint  90 
degrees  over  the  course  of  which  it  would  ex¬ 
tract  16  images  spaced  equi-angularly  with 
respect  to  the  robot’s  rotation. 

The  feature  tracker  tracked  the  move¬ 
ment  of  the  selected  target  position  over  the 
complete  90  degrees  (reporting  to  the  con¬ 
troller  the  position  of  the  object  at  16  equi¬ 
angular  positions  over  the  duration  of  the 
movement.) 

The  centroid  of  feature  (in  image  coor¬ 
dinates)  was  then  fed  to  a  least  squares  esti¬ 
mator  to  recover  the  ellipse  parameters  asso¬ 
ciated  with  the  moving  features’s  trajectory. 
These  parameters  were  then  fed  into  formula 
3  for  computing  the  area  of  the  ellipse. 

The  process  was  repeated  for  the  re¬ 
maining  two  points  in  the  simplex.  We  ini¬ 
tialized  the  simple  simplex  algorithm  using 
these  three  areas  and  allowed  it  to  step  its 
way  to  the  minima. 

The  halting  condition  was  when  the 
area  of  the  ellipse  formed  from  a  position 
was  <  Ipixeh^.  The  following  table  tabu- 
lates  the  results  of  this  experiment: _ 


# 

X 

Y 

Area 

I 

0.000000 

0.000000 

1951.378696 

I 

-50.000000 

0.000000 

5811.561407 

I 

-50.000000 

-50.000000 

4679.246920 

3 

0.000000 

-50.000000 

284.406540 

4 

50.000000 

0.000000 

4349.398^17 

: 

* 

; 

22 

2.408169 

-39.056546 

30.307041 

23 

2.210029 

-38.058223 

3.082379 

24 

2.941625 

-37.898186 

0.380444 

Figure  5  shows  the  projected  ellipsoidal 
information  taken  from  the  three,  initial 
simplex  positions. 
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Figure  5:  Ellipses  recovered  by  sweeping  the 
initial  simplex  positions. 


Figure  6:  Superimposed  ellipses  shown  for 
all  25  positions 


In  the  figure,  the  raw  data  is  displayed 
as  point  data  while  the  predicted  ellipses  are 
drawn  in  as  solid  lines. 

Figure  6,  we  show  the  eUipses  generated 
by  all  25  positions  investigated.  Note  that 
the  system  is  not  guaranteed  to  be  mono- 
tonically  convergent  (in  terms  of  the  number 
of  evaluations)  but  the  system  is  convergent 
none  the  less. 

The  system  also  can  be  fooled  by  ellipses 


Figure  7:  Robot  system  after  performing  the 
insertion  task 

generated  at  points  which  are  sampled  very 
close  to  one  another.  In  the  case  of  the  fi¬ 
nal  few  ellipses,  noise  pixels  resulted  in  the 
oddish  ellipsoid  calculations.  A  more  intelli¬ 
gent  system  would  detect  this  condition  and 
would  hypothesize  about  the  area  of  the  el¬ 
lipse  taking  this  into  account.  But  even  with 
noisy  data,  the  system  reconstructs  an  ellip¬ 
soid  which  reflects  the  general  behavior  of 
the  points. 

In  the  experiment  above  and  in  figure  7, 
the  tracked  feature  was  a  2mm  diameter 
hole.  The  robot  system  was  able  to  place  the 
“peg”  (a  tapered  probe)  within  3mm  of  the 
hole  (this  using  uncalibrated  camera  data!) 
In  addition,  trying  the  same  experiment  3 
more  times  resulted  in  about  the  same  re¬ 
sult,  that  is:  an  error  of  about  3mm  for  the 
insertion  task.  When  using  a  10mm  diam¬ 
eter  hole,  the  robot  system  almost  always 
succeeds  in  placing  the  probe  in  the  hole. 

6.1  Evaluations  for  positions  on 
the  initial  simplex 

Upon  closer  examination,  the  modified  sim¬ 
plex  method  does  converge  as  well  as  a 
method  should  taking  into  account  the 
amount  of  knowledge  we  have  given  it  about 
this  system.  (See  figure  8.)  The  modified 
simplex  method  does  suffer  from  the  fault 
of  reexamining  points  analyzed  previously. 
This  can  be  seen  in  the  overlapping  num¬ 
bers  in  figure  8.  The  only  way  the  simplex 
can  shrink  itself  is  by  covering  all  possible 
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Figure  8:  The  positions  examined  by  the 
modified  simplex  method. 


point  reflections  and  then  after  exhaustively 
examining  all  possibilities,  it  determines  the 
best  step  for  proceeding  to  the  solution  state 
is  to  contract. 

Notice  that  the  simplex  method  suffers 
from  the  fact  that  it  must  “overextend”  the 
simplex  in  all  directions  before  coming  to 
the  conclusion  that  the  simplex  should  be 
shrunk.  This  ability  allows  a  simplex  to 
normally  “jump”  over  a  local  minima  and 
continue  its  search  in  a  more  fruitful  valley. 
In  the  case  of  our  system,  there  exists  only 
one  minima  and  simplex  need  not  evaluate 
all  positions  when  it  sees  an  increase  in  the 
area  of  an  ellipse  after  picking  a  new  point. 

This  observation  brings  up  several  pos¬ 
sible  places  where  the  algorithm  for  com¬ 
puting  the  next  position  can  be  improved 
and  makes  two  insightful  observations  which 
are  crucial  to  understanding  the  alignment 
problem.  The  first  observation  is  the  fact 
that  we  are  tracking  the  centroid  of  the  mov¬ 
ing  object  rather  than  the  true  center  of  the 
object.  In  the  case  where  the  object  is  point¬ 
like,  the  center  of  the  object  and  the  centroid 
of  the  object  are  very  close  together,  so  the 
algorithm  works.  But,  in  the  case  of  a  fairly 
large  object  observed  through  a  fairly  wide 
angle  lens  (take  for  instance:  12.5mm  focal 
length),  the  distortion  of  the  center  of  an  ob¬ 
ject  can  be  significant  (on  the  order  of  >  1/2 
the  radius  of  the  object)  depending  on  the 
angle  the  camera  takes  with  respect  to  the 
rotational  axis. 

The  second  observation  is  the  fact  that 
by  starting  with  several  observations  where 
the  rotational  axis  is  far  from  the  object  po¬ 


sition,  resulting  in  large  ellipses,  we  can  start 
with  accurate  estimates  of  the  “family”  of 
ellipses  over  small  rotations.  This  is  in  con¬ 
trast  to  the  smaller  ellipses  which  (because 
of  numerical  inaccuracies  in  estimating  the 
center  of  the  object  and  problems  caused  by 
quantization)  need  larger  movement  arcs  to 
adequately  recover  the  parameters  of  the  el¬ 
lipse.  This  sweep  function  is  a  function  of 
the  resolution  of  the  imaging  device  as  well 
as  the  size  of  the  object  and  position  of  the 
object. 

One  possibility  for  increasing  the  effec¬ 
tiveness  of  this  process  is  to  use  more  of  the 
innate  properties  of  the  ellipses  generated 
by  the  process.  A  more  sophisticated  search 
procedure,  one  based  on  the  physical  model 
of  the  parabolic  surface  which  is  formed  by 
the  ellipses  areas,  would  give  more  satisfac¬ 
tory  results. 

In  addition,  productive  results  will 
probably  be  gained  from  the  analysis  of 
other  properties  of  the  conic  sections.  If  the 
component  values  of  the  conic  sections  are 
traced  out  as  a  function  of  the  rotation,  the 
sinusoids  generated  will  show  a  phase  an¬ 
gle  difference  with  respect  to  the  rotation 
of  the  end-effector.  The  magnitude  of  the 
sinusoids  will  determine  the  net  amount  of 
translation  of  the  object  with  respect  to  the 
rotational  axis.  These  four  values  can  prob¬ 
ably  be  used  as  a  control  signal  to  effect  a 
net  change  to  drive  all  four  values  to  zero 
which  is  a  position  where  the  sinusoids  are 
both  in  phase  and  at  zero  amplitude  with 
respect  to  the  rotations:  the  alignment  con¬ 
dition.  This  technique  needs  to  be  examined 
in  further  detail. 

Another  problem  which  must  be  faced 
is  the  problem  of  small  ellipses.  When  im¬ 
age  noise  is  of  the  same  magnitude  as  the 
centroid  data  the  Least  Squares  fit  no  longer 
captures  the  true  centroid  information  of  the 
object.  Remember  that  the  object  itself  is 
perspective  transformed  and  the  true  object 
center  can  actually  be  a  great  distance  from 
the  objects  projected  center.  It  may  be  pos¬ 
sible  to  use  the  centroid  information,  if  we 
are  able  to  recover  the  varying  amounts  of 
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skew  caused  by  perspective.  It  may  also  be 
possible  to  recover  the  centroidal  informa¬ 
tion  by  using  the  parametric  description  of 
the  object  and  divining  the  focal  points  of 
the  object,  the  generating  lines,  and/or  the 
eccentricity  of  the  ellipses. 

The  final  positioning  error  may  be  im¬ 
proved  by  using  a  set  of  movement  primitive 
vectors  defined  by  a  spiral  like  the  logarith¬ 
mic  spiral  or  some  member  of  the  family  of 
spirals,  which  can  exploit  the  properties  of 
containment  and  possibly  approach  with  an 
incremental  goodness-of-fit  function  (which 
may  be  a  property  of  the  spiral). 

7  CONCLUSIONS 

We  have  demonstrated  a  method  for  per¬ 
forming  a  three  dimensional  task  in  essen¬ 
tially  two  dimensions.  The  peg-in-hole  ser- 
voing  task  and  the  the  vernier  alignment 
task  both  benefit  from  a  method  which  can 
constrain  the  initial  position  of  the  object 
(to  a  high  degree)  and  which  can  essen¬ 
tially  turn  a  three  dimensional  search  prob¬ 
lem  into  a  two  dimensional  search  in  uni- 
modal  space.  We  have  presented  such  a 
method  which  converges  to  a  solution  state 
even  when  using  a  very  simple  convergence 
algorithm. 

The  key  features/contributions  of  our 
system: 

•  It  does  not  require  calibrated  cameras. 

•  It  converges  with  simple  search  algo¬ 
rithms. 

•  Even  with  the  two  constraints  above, 
the  alignment  results  are  very  good  (our 
experiments  have  shown  that  we  can  po¬ 
sition  a  probe  within  3mm  of  a  2mm  fea¬ 
ture  consistently.) 

Active  camera  motion  that  recovers  image 
space  properties  of  tracked  objects  has 
shown  itself  to  be  useful  in  performing 
alignment  tasks  without  the  need  to 
calibrate  the  camera  systems. 
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Abstract 

Previous  work  has  shown  the  feasibility  of  merging 
surface  material  information  derived  iiom  moderate 
resolution  multispectral  imagery  with  estimates  of 
height  based  upon  stereo  matching  in  high  resolution 
panchromatic  imagery.  The  goal  is  to  use  surface 
material  information,  normally  highly  correlated  with 
object  location  in  complex  urt^  scenes,  as  a  source  of 
information  for  small  scale  mapping  of  man-made 
structures  such  as  buildings  and  ro^s,  as  well  as 
natural  feature.’’  such  as  soil,  vegetation,  and  water. 
The  fusion  of  height  estimates  with  surface  material 
estimates  provides  a  unique  synthetic  three 
dimensional  dataset  that  is  not  directly  available  in  any 
airborne  imaging  sensor. 

The  focus  of  this  paper  is  to  present  a  performance 
evaluation  of  two  classification  techniques,  gaussian 
maximum  likelihood  and  differential  radid  basis 
function,  on  the  task  of  surface  material  analysis.  In 
order  to  carry  out  this  evaluation  we  have  created 
several  highly  detailed  ground  truth  segmentations 
based  upon  manual  analysis  of  the  multispectral 
imagery,  as  well  as  by  inspection  of  panchromatic 
imagery  acquired  over  the  same  area.  Tools  built  for 
the  generation  and  validation  of  the  ground-truth 
surface  material  map  are  also  discussed.^ 
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the  United  States  Government. 
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1.  Introduction 

With  the  availability  of  moderate  resolution 
multispectral  imagery,  comparable  in  spatial  resolution 
to  aerial  moping  imagery,  opportunities  exist  to 
exploit  the  inherent  spectral  information  of 
multispectral  imagery  to  aid  urban  scene  analysis  for 
cartographic  feature  extraction.  Moderate  resolution 
multispectral  imagery  with  spatial  resolution  ranges  of 
5  to  8  meters  can  be  collected  with  existing  ai^me 
multispectral  scanners  like  Daedalus,  AVIRIS,  and 
MEIS. 

Our  research  in  multispectral  scene  information  fusion 
utilizes  moderate  resolution  airborne  imagery  (8  meter) 
and  high  resolution  panchromatic  aerial  photography 
(1.2  meter).  Using  traditional  spectral  classification 
techniques,  surface  material  information  is  derived 
from  the  multispectral  imagery,  refmed  by  monocular 
segmentations  from  the  panchromatic  imagery  and 
fused  with  high  resolution  stereo  disparity  maps  [Ford 
and  McKeown  92a,  Ford  and  McKeown  92b]. 

The  focus  of  this  paper  is  to  present  a  performance 
evaluation  of  two  classification  techniques,  gaussian 
maximum  likelihood  and  differential  radi^  basis 
function,  to  perform  surface  material  analysis.  In  order 
to  do  this  evaluation  we  have  created  several  highly 
detailed  ground  truth  segmentations  based  upon 
manual  analysis  of  the  multispectral  imagery,  as  well 
as  by  inspection  of  panchromatic  imagery  acquired 
over  the  same  area.  Tools  built  for  the  generation  and 
validation  of  the  ground-truth  surface  material  map  are 
also  discussed. 

In  the  remainder  of  this  section  we  give  a  brief 
overview  of  the  Daedalus  scanner  and  the  surface 
material  classification  task. 
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Figure  1:  CAOi  Site  (X  niotor) 


Figure  2:  C  IMI.I  Site  (S  meter) 
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Figure  3:  I’rban  Spectral  Classes 


1.1.  The  Daedalus  Scanner 

The  Daedalus  Airhume  Thematic  Mapper  (ATM)  is  an 
aircraft-based  nuillispeciral  imaginj:  system.  It 
contains  eleven  spectral  channels,  ten  of  which  ranee 
from  the  visible  (.4  to  .75  micron),  to  the  near  infrared 
(.7b  to  1.0).  to  the  shortwave  infraretl  (I..*’  to  2..‘' 
micron).  The  eleventh  channel  is  a  Ihennal  band 
between  X..S  and  l.f  microns  whose  physics  is 
fundamentally  different  than  the  channels  in  the 
rellective  part  of  the  eleciromajmelic  spectrum.  I  (.• 
more  detail  on  the  scanner  parameters  see  (l  ord 


McKeow  n  92a  |. 

By  cttllectine  rellecled  enerey  in  several  speclr-1 
regions,  multispectral  imaging  systems  provitle  an 
insight  into  a  material's  spectral  signature.  I  sing  the 
spectral  energy  measurements  from  a  set  of 
multispectral  baiuls.  inili\iilual  multispectral  image 
pixels  can  be  assigneil  or  classitieil  into  spectral  classes 
(eg  grass,  tree,  water,  soil,  etc.i  ot  similar 
UK. I  urements.  based  iin  the  multispectral  image 
'xel  '  inlensitx  or  brightness  \alues  LSabins  S?]. 


Since  the  Daedalus  scanner  is  aircraft-based,  the  sensor 
has  the  capability  to  acquire  moderate  resolution 
multispectral  imagery  by  flying  the  system  at  lower 
altitudes.  In  the  case  of  our  datasets,  the  scanner  was 
flown  to  achieve  a  ground  sample  distance  of 
approximately  8  meters  per  pixel.  ':^is  imagery  was 
collected  for  the  United  States  SPOT  HRV  Simulation 
Campaign  [SPOT  83,  SPOT  84]  and  should  not  be 
confused  with  actual  SPOT  HRV  imagery  at  20  meter 
spatial  resolution. 

1.2.  Surface  Material  Classification 

Figures  1  (GAOl)  and  2  (CIVILI)  show  two  of  the  urban 
test  sites  with  which  we  have  been  experimenting. 
Both  images  are  visually  presented  as  near  infrar^ 
images  using  Daedalus  bands  7,  5,  and  3.^  The 
outlined  regions  indicate  the  area  used  for 
classification.  The  scene  content  is  representative  of  a 
complex  urban  area  with  buildings,  street  networks  and 
landscaped  areas. 

The  objective  of  our  classification  task  involves  the 
generation  of  surface  material  classmaps  at  a  coarse 
level  for  urban  multispectral  scenes.  Coarse  level 
means  we  are  initially  only  interested  in  characterizing 
the  primary  level  of  land-cover  detail.  In  our  urban 
analysis  problem,  the  primary  land-cover  types  of  most 
interest  to  us  are  water,  vegetation,  soil  and  man-made 
features.  In  Figure  3,  these  primary  land-cover  types 
are  further  divided  into  specific  spectral  classes  based 
upon  visual  interpretation  of  Ae  Daedalus  ATM 
multispectral  imagery.  The  inclusion  of  a  shadow 
feature  in  the  spectral  class  hierarchy  alleviates 
misclassifications  of  shadow  pixels  as  water  spectral 
features  due  to  spectral  similarities  between  the  two 
features. 

2.  Two  Techniques  for  Multispectral 
Classification 

Traditional  multispectral  classifiers  can  be  categorized 
into  one  of  two  methods:  unsupervised  and  supervised. 
The  primary  distinction  between  the  two  multispectral 
classification  procedures  centers  around  the 
involvement  and  interaction  of  the  image  analyst  or 
domain  expert  with  the  classification  process. 
Typically,  time  must  be  spent  by  the  image  analyst  to 
identify  candidate  spectral  classes,  called  training  sets, 
prior  to  supervised  classification. 

In  this  section  we  describe  two  classification 
techniques,  Gaussian  Maximum  Likelihood  (GML) 
and  Differential  Radial  Basis  Function  (DRBF).  Both 
are  supervised  classifiers  requiring  training  sets  to 
characterize  the  spectral  classes  of  interest.  However, 
their  utilization  of  training  sets  for  spectral  class 


^  Color  images  have  been  replaced  by  their  corresponding 
luminance  (Y)  component  from  a  RGB  to  yiq  transformation  for 
black  and  white  reproduction. 


characterization  and  the  determination  of  a 
multispectral  pixel’s  class  assignment  differ.  We  then 
discuss  the  generation  of  training  datasets,  followed  by 
a  discussion  of  our  methodology  for  ground  truth 
generation  for  our  surface  material  classes.  The 
generation  of  a  ground  truth  surface  material  map  is 
critical  for  the  quantitative  analysis  of  classification 
results  discussed  in  Sections  3  and  4. 

2.1.  Gaussian  Maximum  Likelihood  Classifier 

In  the  maximal  likelihood  classification  model,  the 
training  set  of  a  spectral  class  statistically  characterizes 
the  class  in  the  form  of  a  mean  vector  and  covariance 
matrix  with  the  dimensionality  being  determined  by  the 
number  of  multispectral  bands  used  in  the  statistical 
analysis.  Sufficient  training  samples  for  each  spectral 
class  must  be  present  to  allow  reasonable  estimates  of 
the  mean  vector  and  covariance  matrix  [Richards 
86,  Swain  and  Davis  78].  The  mean  vector 
characterizes  average  intensity  or  brightness  level  for 
each  multispectral  biand  in  the  spectral  class,  while  the 
covariance  matrix  describes  the  shape  and  orientation 
of  the  population  of  the  spectral  class,  assuming  a 
multivariate  normal  distribution.  Diagonal  entries  of 
the  covariance  matrix  contain  the  variance  or 
dispersion  of  the  brightness  levels  for  each 
multispectral  band  of  the  spectral  class  while  off- 
diagonal  entries  indicate  the  degree  of  correlation 
between  a  given  pair  of  multispectral  bands. 

The  gaussian  maximum  likelihood  (GML)  classifier 
assumes  that  the  spectral  class  probabilities  are 
multivariate  normal  distributions.  This  is  an 
assumption,  rather  than  a  demonstrable  property  of 
natural  spectral  classes  [Richards  86].  The  probability 
distribution  of  each  individual  spectral  class  is  modeled 
by  using  its  mean  vector  and  covariance  matrix  as 
c^culated  from  its  training  set.  When  classifying  a 
multispectral  image  pixel,  the  probability  of  the  pixel 
belonging  to  each  of  the  candidate  spectral  classes  is 
determined  and  assigned  to  the  spectral  class  with  the 
highest  probability. 

2.2.  Differential  Radial  Basis  Function 
Classifler 

The  differentia]  radial  basis  function  (DRBF)  classifier 
is  a  modified  version  of  the  Gaussian  Radial  Basis 
Function  (RBF)  neural  network  architecture 
[Broomhead  and  Lowe  88,  Moody  and  Darken 
88,  Poggio  and  Girosi  89,  Medgassy  61].  The 
paradigm  is  identical  to  the  maximum  likelihood  model 
with  four  major  exceptions: 

•  The  DRBF  associates  a  discriminant  function  with 
each  spectral  class  rather  than  associating  an 
unpaiameterized  distribution  with  each  spectral  class. 
As  a  result,  the  DRBF  employs  discriminative  learning 
while  the  GML  uses  a  probabilistic  model. 

•  The  DRBF’s  discriminant  functions  have  a  peak 
value  of  unity,  whereas  the  GML  discriminant 
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functions  have  unit  area,  a  requirement  for 
probabilistic  models. 

•  The  covariance  matrix  associated  with  each 
discriminant  function  is  diagonal,  and  all  of  its 
diagonal  elements  have  the  same  value  (i.e.,  the 
covariance  matrix  has  orthonormal  eigenvectors  and 
all  of  its  eigenvalues  are  identical).  For  this  reason, 
each  discriminant  function  has  •  1  fewer  parameters 
than  its  maximum  likelihood  counterpart,  where  C 
denotes  the  dimensionality  of  the  spectral  intensity 
vector.  In  the  case  of  an  1 1 -element  spectral  intensity 
vector  representing  1 1  spectral  classes,  the  maximum 
likelihood  classifier  has  1452  parameters  compared  to 
the  DRBF  classifier’s  132.  Thus,  the  DRBF  has  an 
order  of  magnitude  fewer  parameters  than  its 
maximum  likelihood  counterpart. 

•  The  model  is  trained  differentially  [Hampshire 
93]  using  the  classification  figure-of-merit  (CFM) 
objective  function  [Hampshire  and  Waibel  90],  rather 
than  by  the  method  of  maximum  likelihood. 
Differential  learning  via  CFM  is  a  discriminative  form 
of  learning  that  focuses  on  classifying  patterns  with 
minimum  probability  of  error,  lliis  contrasts  with 
probabilistic  learning  strategies  such  as  maximum 
likelihood  and  conventional  neural  network  learning 
procedures,  which  focus  on  estimating  probabilities 
[Hampshire  and  Kumar  92]. 

23.  Training  Data  Sets 

Block  training  sets,  consisting  of  homogeneous  areas 
of  pure  pixels  representative  of  the  candidate  spectral 
classes  were  collected  manually  from  various  regions 
distributed  throughout  the  entire  Daedalus  ATM 
multispectial  imagery.  The  candidate  spectral  classes 
are  listed  in  TaMe  1  with  the  number  of  training 
samples  or  pixels  per  spectral  class  for  each  of  the 
classifiers.  One  can  note  that  these  are  relatively  small 
sample  sets  from  the  original  multispectial  imagery 
over  Washington,  D.C.,  measuring  716  rows  by  3000 
columns,  or  about  2  million  samples  in  each  of  the 
eleven  Daedalus  bands.  These  samples  were  selected 
prior  to  the  selection  of  the  GAOl  and  CIVILI  test  sites, 
so  as  to  cover  a  wide  range  of  materials  visible  over  the 
entire  swath  from  Virginia,  across  the  Potomac  river, 
and  through  the  center  of  Washington,  D.C.. 

One  artifact  of  our  training  sanqile  selection  was  that 
no  attempt  was  made  to  acquire  a  balanced  or  equally 
populated  set  of  training  data  across  each  of  the 
spectral  classes.  One  property  of  the  DRBF  classifier 
is  that  it  is  an  effective  way  of  classifying  multispectial 
data  with  simple  models  that  require  relatively  small 
training  samples  (i.e.,  training  data  sets).  However,  a 
related  property  of  the  DRBF  classifier  (and 
differentially  trained  classifiers  in  general)  is  that  they 
require  balanced  training  samples  in  which  the  number 
of  training  examples  for  each  spectral  class 


Number  of  Training  Samples 

Spectral  Class 

GML 

DRBF 

asphalt 

740 

740 

concrete 

720 

720 

coniferous  tree 

52 

52 

deciduous  tree 

9759 

1259 

deep  water 

16433 

933 

grass 

2544 

1044 

shadow 

140 

140 

shallow  water 

887 

887 

soil 

354 

354 

tile 

260 

260 

turbid  water 

2269 

769 

Table  1:  Spectral  Classes  Used  in  ClassiBcation 

corresponds  to  the  true  a  priori  probability  of  that 
class. 

Each  discriminant  function  of  the  maximum  likelihood 
model  is  trained  independent  of  the  other  functions, 
and  absent  any  a  priori  knowledge  regarding  the  prior 
probabilities  of  each  spectral  class,  the  final 
classification  of  the  maximum  likelihood  model 
assumes  that  these  priors  are  all  equal.  In  this  sense, 
maximum  likelihood  training  is  relatively  insensitive  to 
unbalanced  training  sanqiles,  assuming  the  class  prior 
probabilities  are  roughly  equal.  Because  differential 
learning  requires  that  all  discriminant  functions  be 
trained  simultaneously,  the  resulting  classifier  is  quite 
sensitive  to  unbalanced  training  samples.  If  the 
empirical  probability  of  a  given  spectral  class  does  not 
correspond  to  the  true  a  priori  probability  of  that  class, 
then  the  training  set  statistics  are  not  representative  of 
the  spectral  intensity  vector’s  true  prob^ilistic  nature. 
As  a  result  of  unbalanced  training  samples,  the 
differentially  trained  classifier  tends  to  form 
imqipropriate  decision  boundaries,  making  more 
classification  errors  for  under-represented  spatial 
classes  than  a  comparable  maximum  likelihood 
classifier. 

During  some  preliminary  testing  using  the  GML’s 
training  data,  it  was  observed  that  the  DRBF  classifier 
has  trouble  distinguishing  shadow  regions  from  water. 
This  is  due  to  two  factors,  the  first  and  most  obvious  of 
which  was  that  the  spectral  intensities  of  these  ground- 
truth  classes  are  quite  similar.  The  second  factor 
involved  the  water  to  shadow  training  sample  ratio. 
Because  the  number  of  water  training  examples  was 
over  19,0(K)  and  the  number  of  shadow  training 
examples  is  140,  the  training  sample  ratio  of  water  to 
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shadow  is  more  than  100:1.  However,  the  ratio  of 
water  to  shadow  in  the  test  samples  is  less  than  1:1.  As 
a  result  we  reduced  the  training  set  size  for  the  DRBF 
classifier  to  those  shown  in  Table  1.  Significant 
reductions  in  training  set  size  were  also  performed  for 
vegetation  classes.  No  special  selection  criteria  were 
employed  to  perform  this  reduction;  we  simply  used 
the  N  first  training  samples  for  each  class  in  those 
cases  where  the  GML  training  populations  were  large. 
Tables  2  thru  5  in  Section  3  include  some  of  the 
original  DRBF  results  with  the  unbalanced  training  sets 
labeled  as  DRBF(1).  It  can  be  observed  that  the 
balanced  training  DRBF(2)  performed  better  than  the 
unbalanced  experiment  DRBF(l). 

2.4.  Generation  of  Surface  Material  Ground 
Truth 

Our  previous  work  in  ground  truth  generation  used 
high  resolution  black  and  white  aerial  photographs 
These  manual  segmentation  tools  are  use^l  for 
generating  ground  truth  where  geometric  relationships, 
such  as  building  size,  shape  and  boundaries  are  the 
primary  focus.  With  regard  to  spectral  classification, 
the  ground  truth  needs  to  represent  the  material  types 
located  in  the  scene.  As  a  consequence,  our  previously 
developed  manual  hand  segmentation  tools  were 
inadequate  and  inappropriate.  We  have  addressed  this 
inadequacy  by  the  development  of  an  interactive 
supervised  classification  tool,  ICLASS,  to  generate 
surface  material  ground  truths  in  the  form  of  surface 
material  ground  truth  classmaps. 

The  ground  truth  generation  procedure  consists  of  a 
two  stage  process  using  ICLASS.  Initially,  a  surface 
material  classmap  is  generated  for  each  of  the 
individual  spectral  classes  visually  present  in  the  GAOl 
and  CIVIL!  test  sites.  An  individual  surface  material 
classmap  is  built  by  manually  segmenting  the  moderate 
resolution  multispectral  image  regions  containing  the 
surface  material  of  interest  using  various  false  color 
composite  presentations.  The  near  infrared  color 
presentation  using  Daedalus  ATM  band  7,  5  and  3  has 
proven  to  be  the  most  useful  visually.  When 
difficulties  in  visually  distinguishing  between  surface 
material  types  using  the  moderate  resolution  (8  meter) 
multispectral  imagery  occurred,  collateral  imagery  in 
the  form  of  high  resolution  panchromatic  aerial 
imagery  was  referenced  in  attempts  to  resolve 
ambiguities  during  surface  material  segmentation  with 
the  multispectral  imagery.  The  collateral  imagery  was 
helpful  in  varying  degrees  due  to  the  difference  in 
scene  content  between  the  two  image  datasets;  the 
aerial  imagery  was  acquired  in  1976,  while  the 
multispectral  imagery  was  collected  in  1983. 

With  creation  of  surface  material  classmaps  for  each  of 
the  individual  spectral  classes  complete,  the  surface 
material  classmaps  are  combined  together  to  generate 
the  surface  material  ground  truth  classmap.  During  the 
segmentation  process,  it  is  quite  possible  that  an 
individual  multispectral  image  pixel  may  be  assigned 


to  multiple  spectral  classes,  especially  along  surface 
material  transition  boundaries.  These  conflicting 
pixels  are  flagged,  reassigned  to  UNCLASSIRED  and 
resolved  by  visually  displaying  the  conflict  pixel  for 
assignment.  The  center  column  in  Figures  4  and  5 
shows  the  resulting  surface  material  ground  truth 
classmaps.  Eveiy  pixel  in  the  GAOi  (14014  pixels)  and 
CIVIL!  (12180  pixels)  test  sites  were  manually  labeled. 

While  the  training  set  data  acquired  over  the  entire 
Daedalus  image  had  examples  from  each  of  the  eleven 
surface  material  classes,  not  all  classes  were  present  in 
the  GAO!  and  CIVIL!  sites.  Missing  from  both  scenes 
were  soil,  coniferous  tree,  deep  water,  shallow  water, 
and  turbid  water  classes.  A  secondary  issue,  not 
explored  in  this  paper,  is  the  issue  of  intrinsic  accuracy 
of  the  surface  material  ground  truth.  A  recent  study 
has  shown  statistically  significant  differences  in 
ground  truth  labeling  accuracy  across  multiple  analysts 
using  Landsat  Thematic  Mapper  imagery  [McGwire 
92].  Such  imagery  has  significantly  lower  spatial 
resolution  than  the  Daedalus  scanner  data  and  that  may 
relate  to  increased  variability  in  human  labeling. 

Nevertheless,  it  is  certainly  the  case  that  our  ground 
truth  class  map  contains  some  errors  due  to  our 
inability  to  discern  precise  material  boundaries  or  to 
label  small  pixel  populations  embedded  in  large  areas 
of  nearly  homogeneous  surface  materials.  For 
example,  at  times  it  was  visually  difficult  to  distinguish 
between  the  deciduous  tree  canopy  and  the  underlying 
grass  areas.  A  standard  deviation  classification  map 
may  be  useful  distinguishing  between  the  textured  tree 
canopy  and  the  flat  areas  of  grass.  Similarly,  it  can  be 
very  difficult  to  visually  distinguish  between  asphalt 
and  shadow.  The  Daedalus  thermal  band  1 1  was  not 
extensively  used  during  the  manual  ground  truth 
segmentation.  A  re-examination  of  asphalt  and 
shadow  transition  areas  using  the  thermal  band  may  be 
useful  in  improving  the  ground  truth. 

3.  Classification  Test  Results 

Surface  material  classmaps  were  generated  for  GAO! 
and  CIVIL!  using  the  Gaussian  Maximum  Likelihood 
and  Differential  Radial  Basis  Function  classifiers.  The 
resulting  surface  material  classmaps  for  the  two 
classifiers  are  shown  as  greylevel  classification  images 
in  the  left  and  right  columns  of  Figures  4  and  5.  The 
manually  generated  Ground  Truth  is  shown  in  the 
center  column  of  each  Figure.  Each  horizontal  set  of 
images  are  side-by-side  comparisons  of  the 
classification  results  for  each  of  the  major  surface 
material  classes:  MAN-MADE,  SHADOW/SOIL, 

VEGETATION,  and  WATER.  Within  each  class,  sub¬ 
class  features  are  distinguished  by  grey  shades,  with 
white  being  reserved  in  all  cases  to  indicate  no  pixels 
classified  as  a  member  of  this  class.  All  results  shown 
in  these  images  represent  the  classification  achieved 
using  all  1 1  Daedalus  spectral  bands  for  training  and 
classification. 
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GAOl  Site 


Classifier 

4  Bands 
(%) 

GML 

79.6 

DRBF(l) 

79.1 

DRBF(2) 

85.4 

(%) 


84.8 


14014  Pixels  Sampled 


Table!:  GAOl  Coarse  Classification  Accuracy 


CIVILI  Site 


10  Bands  1 1  Bands 
(%)  (%) 


Classifier 

4  Bands 
(%) 

GML 

75.9 

DRBF(l) 

76.8 

DRBF  (2) 

79.9 

80.8 


12180  Pixels  Sampled 


Table  3:  CIVILI  Coarse  Classification  Accuracy 


GAOl  Site 


Classifier 

4  Bands 
(%) 

GML 

58.7 

DRBF(l) 

59.0 

DRBF  (2) 

66.8 

67.7 


14014  Pixels  Sampled 


Tabled:  GAOl  Fine  Classification  Accuracy 


aviLi  Site 


Classifier 

4  Bands 
(%) 

GML 

49.1 

DRBF  (1) 

58.2 

DRBF  (2) 

63.7 

Classifier 

4  Bands 
(%) 

GML 

50.9 

DRBF(l) 

47.1 

DRBF  (2) 

62.2 

GAOl  Site 


10  Bam 
(%) 


S8.6 


61.7 


14014  Pixels  Sampled 


Table  6:  GAOl  Coarse  Classification  Kappa 


aViLl  Site 

Classifier 

4  Bands 
(%) 

10  Bands 
(%) 

11  Bands 
(%) 

GML 

58.8 

60.4 

67.3 

DRBF(l) 

57.8 

« 

63.3 

DRBF  (2) 

63.8 

65.1 

70.1 

12180  Pixels  Sampled 

Table  7:  civiLi  Coarse  Classification  Kappa 


GAOl  Site 

Classifier 

4  Bands 
(%) 

10  Bands 
(%) 

11  Bands 
(%) 

GML 

44.1 

50.3 

48.7 

DRBF(l) 

41.5 

* 

44.0 

DRBF  (2) 

52.4 

53.5 

54.3 

14014  Pixels  Sampled 

64.6 


12180  Pixels  Sampled 


Table  5:  CiviLi  Fine  Classification  Accuracy 


The  surface  material  classmaps  generated  by  the 
Gaussian  Maximum  Likelihood  and  Differential  Radial 
Basis  Function  classifiers  were  compared  against  the 
surface  material  ground  truth  classmaps  for  GAOl  and 
CIVILI.  Classification  accuracies  were  performed  for 
both  coarse  and  fine  surface  material  class  sets.  In  the 
following  sections  we  present  overall  classification 


Table  8:  GAOl  Fine  Classification  Kappa 


CIVILI  Site 


10  Bands 
(%) 


41.3 


Classifier 

4  Bands 
(%) 

GML 

38.0 

DRBF  (1) 

46.3 

DRBF  (2) 

53.0 

53.9 


12180  Pixels  Sampled 


Table  9:  CIVIL  i  Fine  Classification  Kappa 


accuracies  and  measure  of  agreement  for  each  of  the 
methods. 
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Gaussian  Maximum  Likelihood  Ground  Truth  Differential  Radial  Basis  Function 

Figure  4:  GAOI  Generated  Surface  Material  Classmaps 
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Gaussian  Maximum  Likelihood  Ground  Truth  Differential  Radial  Basis  Function 


Figure  5:  civil  l  Generated  Surface  Material  Classmaps 
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3.1.  Classiflcation  Accuracy 

With  the  generation  of  classmaps  by  different 
classification  methods,  it  is  necessary  to  have  some 
method  for  evaluating  and  comparing  the  accuracy  of 
the  generated  classmaps  to  aid  in  the  assessment  and 
improvement  of  these  methods.  Various  techniques 
have  been  developed  and  implemented  in  the  remote 
sensing  community  to  determine  and  evaluate  the 
accuracy  of  land-use/land-cover  maps  derived  from 
remotely  sensed  data  [Aronoff  82,  Dicks  and  Lo 
90,  Fitzpatrick-Lins  81,  Story  and  Congalton  86].  The 
basic  accuracy  assessment  procedure  involves  the 
selection  of  samples  from  the  land-use/land-cover  map, 
verification  of  the  samples  using  ground  tmth,  and 
statistical  evaluation  of  the  verified  samples  from  the 
land-use/land-cover  map  [Congalton,  et.  al. 
83,  Congalton,  et.  al.  84,  Congalton  and  Mead  86]. 

In  our  accuracy  assessment  process,  all  samples  (i.e. 
pixels)  contained  in  the  generated  surface  material 
classmap  are  used.  This  requires  the  availability  of  a 
surface  material  ground  truth  classmap  with  the  same 
coverage  as  the  test  area,  a  condition  rarely  found  in 
most  remote  sensing  experiments.  Our  surface 
material  ground  truth  classmap  provides  the  necessary 
ground  truth  during  the  verification  process  of  the 
generated  surface  material  classmap. 

Each  of  the  Tables  2  through  5  give  classification 
accuracies  with  respect  to  our  ground  truth 
segmentation.  The  row  labeled  GML  gives  the  results 
for  the  gaussian  maximum  likelihood  classifier.  The 
rows  laiwled  DRBF(1)  and  DRBF(2)  are  the  results  of 
the  differential  radial  basis  function  classifier.  As 
described  in  Section  2.3,  DRBF(1)  was  an  initial 
experiment  using  unbalanced  training  sets  and  was 
superseded  by  the  results  of  the  DRBF(2)  classifier, 
using  the  balanced  training  sets. 

Three  experiments  were  run.  The  first  column  shows 
accuracy  using  four  of  the  11  Daedalus  spectral  bands 
(3,  5,  7,  10)  for  classification.  These  bands  were 
selected  by  calculating  the  average  Jeffries-Matusita 
Distance  [Mausel,  et.  al.  90,  Richards  86,  Swain  and 
Davis  78]  using  the  statistics  from  the  spectral  class 
training  sets,  in  order  to  rank  spectral  class  separability 
for  all  four  band  combinations. 

The  second  column  shows  the  classification  accuracy 
using  all  ten  of  the  reflective  bands  of  the  Daedalus 
scanner.  The  third  column  gives  classification 
accuracy  using  all  eleven  bands,  including  the  thermal 
band.  From  a  remote  sensing  standpoint  this  generally 
would  not  be  performed,  since  the  physics  of  the 
thermal  band  is  quite  different  than  the  reflective 
bands.  Nevertheless  we  tried  this  experiment  and  were 
surprised  to  see  some  measurable  improvement  over 
the  results  using  the  ten  reflective  bands.  These 
improvements  were  not  only  evident  in  the 
classification  accuracy  measure,  but  also  in  the 
measure  of  agreement,  the  Kappa  coefficient. 


discussed  in  Section  3.2. 

The  results  were  tabulated  in  two  ways,  first  using  the 
coarse  classification  metric  that  grouped  the  eleven 
surface  material  classes  into  five  groups:  man-made, 
vegetation,  shadow,  soil,  and  water.  The  groupings  are 
shown  in  Table  12.  The  second  tabulates  accuracy  for 
the  fine  classification  of  each  of  the  surface  material 
classes.  For  both  the  GAOl  and  CIVILI  sites  there 
appears  to  be  no  significant  difference  for  coarse 
analysis  between  the  GML  and  DRBF(2)  classifiers. 
Both  performed  quite  well,  84  to  87  percent  accuracy, 
in  the  1 1  band  case.  Similar  performance  was 
achieved  in  the  10  band  case,  with  noticeably  poorer 
results  for  the  4  band  case.  This  is  interesting  since  the 
spectral  class  separability  analysis  that  led  us  to  select 
these  four  bands  indicated  a  high  degree  of  information 
content.  This  may  well  be  the  case  with  respect  to 
spectral  class  separability,  but  it  is  clear  that  the  use  of 
additional  bands  always  improved  classification 
accuracy  in  our  experiments. 

3.2.  Measure  of  Agreement 

An  additional  metric,  introduced  to  the  remote  sensing 
conununity  by  Congalton  et  al.  [Congalton,  et.  al. 
83,  Congalton,  et.  al.  84],  is  also  commonly  used  as  a 
measure  of  classification  accuracy.  It  is  called  the 
Kappa  coefficient  of  agreement  and  has  been  used  as  a 
standard  measure  when  reporting  classification 
accuracy  [Hudson  87,  Rosenfeld  86].  Cohen  [Cohen 
60]  developed  the  Kappa  coefficient  for  nominal  scales 
which  measures  the  relationship  of  beyond  chance 
agreement  to  expected  disagr^ment.  An  advantage  of 
the  Kappa  coefficient  is  that  its  calculation  takes  into 
consideration  off-diagonal  entries  of  the  error  matrix, 
or  errors  of  omission  and/or  of  commission.  The 
Kappa  coefficient  provides  a  measure  of  difference 
between  the  observed  agreement  between  two 
classmaps  and  agreement  that  is  contributed  by  chance. 
It  theoretically  deflates  accuracy  statistics  ba^  upon 
chance  occurrence  of  correct  classification  [Congalton, 
et.  al.  83,  Rosenfeld  86,  Hudson  87,  Dicks  and  Lo  90]. 

Tables  6  through  9  show  the  Kappa  accuracies  that 
correspond  to  the  original  accuracy  analysis  in  Tables 
2  through  5.  One  can  observe  that  the  overall 
classification  accuracies  have  been  reduced 
significantly  in  most  cases,  yet  the  overall  trends  in 
relative  performance  are  maintained.  In  order  to 
understand  the  details  of  the  strengths  and  weaknesses 
of  our  classification  techniques,  we  present  a  more 
detailed  error  analysis,  in  terms  of  coarse  and  fine 
errors  of  commission  and  omission,  in  the  following 
section. 

4.  Classification  Performance  Evaiuation 

In  this  section  we  describe  a  more  detailed  accuracy 
assessment  procedure  to  evaluate  the  surface  material 
classmaps  for  the  GML  and  the  DRBF(2)  classifiers. 
As  we  have  seen  in  Section  3.1,  the  most  common  way 
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to  express  the  accuracy  of  a  generated  classmap  is  by  a 
statement  of  the  percentage  of  the  classmap  that  has 
been  correctly  classified  when  compared  with  a 
reference  classmap  or  ground  truth.  In  addition  to  this 
overview  one  may  desire  a  more  detailed  tabulation  in 
the  form  of  an  error  or  confusion  matrix.  In  this  kind 
of  tally,  the  reference  classmap  (represented  by  the 
matrix  columns)  is  compared  to  the  generated  or  test 
classmap  (represented  by  the  matrix  rows).  The  major 
diagonal  indicates  the  agreement  between  these  two 
classmaps.  Overall  accuracy  for  a  particular  generated 
classmap  is  then  calculated  by  dividing  the  sum  of  tte 
entries  that  form  the  major  diagonal  (i.e.  the  number  of 
correct  classifications)  by  the  total  number  of  samples 
(i.e.  pixels)  taken  [Story  and  Congalton  86]. 

The  off-diagonal  entries  indicate  the  omission  and 
conunission  errors.  Errors  of  omission  correspond  to 
pixels  belonging  to  the  spectral  class  of  interest  that  the 
classifier  has  failed  to  recognize  (false  negative), 
whereas  errors  of  commission  are  those  that 
correspond  to  pixels  from  other  sf '  'tral  c>  .ses  that  the 
classifier  has  labeled  as  belongin.  the  spectral  class 
of  interest  (false  positive),  'll.  former  refer  to 
columns  of  the  error  matrix,  whereas  the  latter  refer  to 
rows  [Richards  86]. 

4.1.  Coarse  Class  Analysis 

Due  to  our  objective  of  generating  surface  material 
classmaps  at  a  coarse  level  for  urban  multispectral 
scenes,  we  evaluated  the  performance  of  the 
classification  results  using  surface  materials  comprised 
of  man-made,  vegetation,  shadow,  soil  and  water 
features.  For  the  coarse  class  analysis,  nine  of  eleven 
spectral  classes  are  aggregated  into  three  surface 
materials  of  man-made,  vegetation  and  water  during 
performance  evaluation.  Listed  in  Table  12  are  the 
spectral  class  membership  into  the  five  coarse  surface 
materials  along  with  the  Key  used  in  the  error  matrices. 
Tables  10  and  1 1  contain  the  performance  evaluation 
results  for  the  1 1  band  GAOl  classification  using  GML 
and  DRBF(2),  respectively,  while  Tables  13  and  14 
highlight  the  1 1  band  CIVILI  classification. 

The  GML  overall  classification  accuracies  are  87.8% 
and  83.2%  for  GAOl  and  CIVILI  while  the  DRBF(2) 
overall  accuracies  are  85.5%  and  84.3%.  Based  on  the 
overall  accuracies  from  both  sites,  the  GML  and 
DRBF(2)  classification  results  appear  very  similar. 
Upon  inspection  of  the  error  matrices,  man-made  and 
vegetation  features  account  for  approximately  90%  of 
the  surface  materials  contained  in  both  sites’  ground 
truth.  'These  features  dominate  the  performance 
evaluation  process  when  determining  the  overall 
classification  accuracy  which  fails  to  indicate 
differences  between  the  GML  and  DRBF(2) 
classification  results. 

Further  examination  of  die  GAOl  and  ClviLi  error 
matrices  at  the  coarse  classification  level  show  some 
interesting  trends.  Referring  to  Tables  10  through  14, 


the  GML  was  able  to  distinguish  between  water  and 
man-made  or  shadow  significantly  better  than  the 
DRBF(2).  The  GML  classifsd  23  shadow  (column 
C3)  pixels  as  water  (row  C5)  in  the  GAOl  site  and  only 
1  shadow  pixel  as  water  for  CIVILI.  In  comparison,  the 
DRBF(2)  assigned  513  man-made  (column  Cl)  and 
317  shadow  (column  C3)  pixels  as  water  (row  C5) 
from  GAOl  while  labeling  196  man-made  and  87 
shadow  pixels  for  water  in  CIVILI.  'The  DRBF(2)’s 
inability  to  discriminate  water  from  man-made  and 
shadow  was  partially  influenced  by  the  unbalanced 
nature  of  the  water  training  sets  originally  presented  to 
it  as  shown  in  Table  1 . 

In  terms  of  locating  shadow  (row  C3,  column  C3) 
pixels,  the  DRBF(2)  omitted  259  and  249  fewer 
candidate  pixels  than  the  GML  in  the  GAOl  and  CIVILI 
sites,  respectively.  Almost  60%  of  shadow  (column 
C3)  pixels  were  classified  as  man-made  (row  Cl)  by 
the  GML. 

'Die  DRBF(2)  also  correctly  included  significantly 
fewer  soil  pixels  as  man-made  when  compared  to  the 
GML  for  both  test  sites.  Examining  the  error  matrices 
for  CIVILI,  the  DRBF(2)  classified  348  fewer  man¬ 
made  (column  Cl)  pixels  as  soil  (row  C4)  than  the 
GML.  An  example  of  classifying  man-made  pixels  as 
soil  by  the  GML  is  illustrate  in  Figure  5  along  the 
rooftop  of  the  Department  of  Interior  building  in  the 
center-right  portion  of  the  image. 

At  the  coarse  level,  it  is  not  obvious  which  man-made 
and  water  surface  material  members  are  contributing  to 
die  classification  error.  Examining  the  fine 
classification  results  against  the  ground  truth  at  the 
detailed  class  level  provides  more  insight  into  these 
discrimination  errors  between  individual  surface 
materials. 

4.2.  Detailed  Class  Analysis 

In  the  detailed  class  analysis,  all  eleven  sp^tral  classes 
are  examined  as  individual  surface  materials.  Tables 
15  and  16  contain  the  performance  evaluation  results 
for  the  1 1  band  classification  of  GAOl  and  CIVILI  using 
the  GML  and  DRBF(2).  The  GML  overall 
classification  accuracies  are  63.6%  and  55.3%  for 
GAOl  and  CIVILI  while  the  DRBF(2)  overall  accuracies 
are  68.7%  and  68.8%.  From  the  overall  accuracies,  the 
GML  and  DRBF(2)  performed  about  the  same  for 
GAOl  but  are  dramaticdly  different  for  CIVIL  l  with  the 
DRBF(2)  out-performing  the  GML. 

'The  sharp  drop  in  GML  classifier  accuracy  between 
GAOl  and  CIVILI  is  due  to  the  major  differences  in 
relative  scene  material  composition.  'The  civiLi  site  is 
composed  of  33%  vegetation  (i.e.  grass  and  deciduous 
tree)  while  the  GAOl  site  contains  13%  vegetation.  The 
ground  truth  for  GAOl  and  CIVILI  in  Figures  4  and  5 
illustrate  the  increase  in  vegetation  composition. 
Referring  to  the  GML  error  matrix  in  Table  16,  it  is 
clear  that  the  GML  has  difficulty  in  distinguishing 
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Surface  Material  Ground  Truth 


Cl 

C2 

C3 

C4 

C5 

Row 

Total 

Conunissioo 

Enor 

GML 

Cl 

10648 

613 

756 

□ 

□ 

12017 

11.4 

Surface 

C2 

12 

1205 

24 

0 

0 

1241 

2.9 

Material 

C3 

41 

2 

451 

0 

0 

494 

8.7 

Test 

C4 

201 

38 

D 

0 

0 

239 

100.0 

C5 

0 

0 

23 

□ 

□ 

23 

100.0 

Column 

Total 

10902 

1858 

1254 

0 

0 

I40I4 

Omission 

Error 

2.3 

35.1 

64.0 

H 

H 

■ 

Percent 

GML  Overall  Accuracy  =  12304  /  14014  =  87.8% 


TaUe  10:  Error  Matrix  Showing  Results  of  GAOI 
Coarse  Classification  Using  GML 


Surface  Material  Ground  Truth 


Cl 

C2 

C3 

C4 

C5 

Row 

Total 

Conmission 

Error 

DRBF(2) 

Cl 

10176 

720 

186 

D 

D 

11082 

8.2 

Surfoce 

C2 

85 

1089 

41 

D 

0 

1215 

10.4 

Material 

C3 

73 

8 

710 

D 

0 

791 

10.2 

Test 

C4 

55 

24 

0 

□ 

0 

79 

100.0 

C5 

m 

17 

317 

D 

D 

847 

100.0 

Column 

Total 

1858 

1254 

0 

0 

I40I4 

Omission 

Error 

B 

41.4 

43.4 

H 

B 

B 

Percent 

DRBF(2)  Overall  Accuracy  =11975/ 14014  =  85.5% 


Table  11:  Error  Matrix  Showing  Results  of  GAOl 
Coarse  Classification  Using  DRBF(2) 


Key 

Surface  Material 

Members 

Cl 

Man-Made 

Asphalt 

Concrete 

Tile 

C2 

Vegetation 

Grass 

Coniferous  Tree 
Deciduous  Tree 

C3 

Shadow 

Shadow 

C4 

Soil 

Soil 

C5 

Water 

Deep  Water 
Shallow  Water 
Turbid  Water 

Table  12:  Error  Matrix  Legend  for 
Coarse  Classification 


between  deciduous  trees  and  grass.  It  labeled  IS60 
pixels  as  grass  (row  C4)  when  they  were  actually 
deciduous  trees  (column  C3).  To  the  GML,  portions  of 


Surface  Material  Ground  Truth 


Cl 

C2 

C3 

C4 

C5 

Row 

Total 

Commissioo 

Error 

GML 

Cl 

6703 

859 

El 

D 

D 

169 

Surface 

C2 

35 

3159 

Q 

0 

0 

3219 

1.9 

Material 

C3 

22 

10 

0 

0 

302 

10.6 

Test 

C4 

523 

60 

a 

0 

0 

594 

100.0 

C5 

0 

0 

1 

0 

0 

1 

100.0 

Column 

Total 

7283 

809 

0 

0 

12180 

Omission 

Error 

B 

22.7 

66.6 

B 

B 

B 

Percent 

GML  Overall  Accuracy  =  10132  /  1 2180  =  83.2% 


Table  13:  Error  Matrix  Showing  Results  of  CIVILI 
Coarse  Classification  Using  GML 


Surface  Material  Ground  Truth 


Cl 

C2 

C3 

C4 

C5 

Row 

Total 

CommissitMi 

Error 

DRBF(2) 

Cl 

6740 

956 

El 

D 

D 

7850 

14.1 

Surface 

C2 

117 

3011 

O 

0 

0 

3177 

5.2 

Material 

C3 

55 

34 

Ea 

0 

0 

14.6 

Test 

C4 

175 

74 

D 

0 

0 

249 

100.0 

C5 

196 

13 

Q 

o 

0 

296 

100.0 

Column 

Total 

7283 

4088 

m 

0 

0 

12180 

Omission 

Error 

B 

26.3 

35.8 

B 

B 

B 

Percent 

DRBF(2)  Overall  Accuracy  =  10270/  12180  =  84.3% 


Table  14:  Error  Matrix  Showing  Results  of  CIVILI 
Coarse  Classification  Using  DRBF(2) 

the  deciduous  tree  canopy  appear  spectrally  similar  to 
grass  while  the  DRBF(2)  is  able  to  discriminate 
between  the  two  vegetated  surface  materials.  In  the 
coarse  analysis,  this  error  was  not  observed  due  to  the 
grouping  of  grass  and  deciduous  tree  into  vegetation. 

We  noted  in  the  coarse  class  analysis  that  the  DRBF(2) 
had  difficulty  in  distinguishing  between  water  and 
man-made  or  shadow.  From  the  error  matrices  for  the 
fine  classification,  the  DRBF(2)  is  labeling  asphalt 
(column  Cl)  and  shadow  (column  CS)  pixels  as  deep 
water  (row  C9)  and  turbid  water  (row  Cll).  As 
previously  stated,  it  is  believed  that  a  more  balanced 
training  set  would  alleviate  these  errors. 

The  GML’s  higher  number  of  soil  pixels,  as  observed 
in  the  coarse  analysis,  is  relat^  to  the  GML’s 
misclassification  of  concrete  (column  C2)  pixels  as  soil 
(row  C6).  The  spectral  characterization  of  soil  and 
concrete  is  evidently  too  similar  under  the  GML 
model. 

In  general,  both  classifiers  had  difficulty  in 
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Surface  Material  Ground  Truth 


Cl 

C2 

C3 

C4 

C5 

4637 

873 

472 

70 

152 

Commission 

Enor 


6S8  2936  0 


0  0  I  6d 


0  6222 


0  I  3603 


GML 

Surface 

Material 

Test 


5 


40 


43  155 


972  221 


64 


630  487 


0  2192 


Column 

Total 


Omission 

Error 


6355  4193  1290  568  1254 


354  0  0 


0  14014 


27.0  30.0  95.0  14.3  64.0 


GML  Overall  Accuracy  =  8907  /  14014  =  63.6% 


Key  Surface  Material  Key  Surface  Material  Key  Surface  Material  Key  Surface  Material 


Cl  Asphalt  C4 

C2  Conctete  C5 

C3  Deciduous  Tree  C6  I  Soil 


C7  Tile  CIO  Shallow  Water 

C8  Coniferous  Tree  C 1 1  Turbid  Water 

C9  Deep  Water 


Surface  Material  Ground  Truth 


4988  1069  642  69  164 


DRBF(2) 

Surface 

Material 

Test 


674  2896 


15  0 


0 


C7 

128 

C8 

5 

C9 

418 

CIO 

0 

Cll 

56 

0  327  38 


17  265  447 


8  0 


53  12  12 


48 


33 


0 


0  710 


Column 

Total  6355  4193  1290  568  1254  0  354  0  0 


CIO 

Cll 

Row 

Total 

0 

0 

7018 

0 

0 

3579 

0 

0 

413 

0 

0 

729 

0 

0 

791 

0 

0 

79 

0 

0 

485 

0 

0 

73 

0 

0 

719 

0 

0 

0 

0 

0 

128 

1 

0 

0 

14014 

Omission 

Error 


21.5  30.9  74.7  21.3  43.4 


DRBF(2)  Overall  Accuracy  =  9626  /  14014  =  68.7% 


Table  15:  Error  Matrices  Showing  Results  of  GAOI  Fine  Classification  Using  GML  and  DRBF(2) 
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3201  409  S!6  216  37 


Soiface  Material  Ground  Truth 


C2  C3  C4  C5  C6  C7  C8  C9  CIO  Cl  1 


Row  Commission 
Total  Error 


GML 

Surface 

Material 

Test 


Column 

Total 


Omission 

Error 


0 

0 

0 

0 

0 

0 

0 

0 

0 

4305 

2915 

2927 

25.6 

40.5 

80.5 

GML  Overall  Accuracy  =  6738  /  12180  =  55.3% 


Key  Surface  Material  Key  Surface  Material  Key  Surface  Material  Key  Surface  Material 


Key 

Surface  Material 

Cl 

Asphalt 

C2 

Concrete 

C3 

Deciduous  Tree 

C4  Grass 
C5  Shadow 


Surface  Material  Ground  Truth 


C7  Tile  CIO  Shallow  Water 

C8  Coniferous  Tree  CM  Turbid  Water 

C9  Deep  Water 


3554  381  705  229  60 


DRBF(2) 

Surface 

Material 

Test 


Column 

Total  4305  2915  2927  1161  809  0  63 


CIO 

Cll 

Row 

Total 

0 

0 

4945 

0 

0 

2476 

0 

0 

1729 

0 

0 

1326 

0 

0 

608 

0 

0 

249 

0 

0 

429 

0 

0 

122 

0 

0 

168 

0 

0 

0 

0 

0 

128 

0 

0 

12180 

Omission 

Error 


17.4  27.7  49.6  41.3  35.8 


ORBF(2)  Overall  Accuracy  =  8378  / 1 2 1 80  =  68.8% 


Table  16:  Error  Matrices  Showing  Results  of  CIVILI  Fine  Classification  Using  GML  and  DRBF(2) 
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discriminating  asphalt  (row  Cl)  from  deciduous  tree 
(column  3)  and  grass  (column4).  These  pixels  occur  in 
transition  boundaries  between  streets  and  overhanging 
tree  canopies  or  lawn  areas,  resulting  in  a  mixed  pixel. 
Mixed  pixels  are  a  problem  commonly  encountered  in 
multispectral  classification,  which  is  usually  rectified 
as  a  post-processing  step.  We  have  not  addressed  these 
issues  to  date  in  our  research. 

5.  Conclusions 

The  goal  of  this  paper  was  to  present  a  performance 
evaluation  of  two  classification  techniques,  gaussian 
maximum  likelihood  (GML)  and  diffeienti^  radial 
basis  function  (DRBF).  This  analysis  was  performed 
using  two  urban  test  sites  in  Washington  D.C.  for 
which  we  have  high  resolution  mapping  photography. 
What  we  have  found  is  that  the  GML  and  DRBF 
classifiers  appear  to  perform  at  similar  accuracy  rates 
for  coarse  surface  material  classification  tasks.  For  a 
more  detailed  surface  material  classification  task,  the 
DRBF  performed  significantly  better  than  the  GML 
when  the  amount  of  grass  and  deciduous  tree  scene 
composition  increased.  The  DRBF  was  able  to 
distinguish  between  deciduous  tree  and  grass  more 
reliably  than  the  GML  as  demonstrated  by  the  CIVILI 
test  site.  We  have  not  explored  the  extent  to  which  the 
two  methods  provide  consistent  estimates  of  surface 
materials,  and  this  may  lead  to  a  combined  or  hybrid 
analysis  technique. 

We  have  also  shown  that  when  evaluating 
classification  accuracy,  the  overall  accuracy  can  be 
misleading  if  a  dominant  feature  is  present. 
Consultation  of  the  error  matrix  should  be  part  of  the 
performance  evaluation  process  to  gain  a  better 
understanding  of  the  classification  results. 

At  the  present  time,  all  pixels  contained  in  the  test  site 
are  used  during  performance  evaluation.  In  the  coarse 
class  analysis,  it  was  seen  that  a  dominant  surface 
material  can  bias  the  overall  classification  accuracy. 
Alternative  sampling  schemes  for  determining 
classification  accuracy  to  remove  bias  of  dominant 
scene  features  will  be  investigated. 

We  plan  to  continue  our  research  in  the  fusion  of  stereo 
and  monocular  cues  with  surface  material  classification 
to  improve  detection  and  delineation  ol  buildings, 
roads,  and  other  man-made  stmetures.  Our  previous 
research  has  shown,  our  ability  to  co-register  stereo 
disparity  data  and  surface  material  classmaps.  With  a 
firm  basis  for  the  accuracy  of  our  material 
classifications,  we  can  begin  to  focus  on  higher  level 
analysis  towards  object  detection,  delineation,  and 
recognition. 

Acknowledgements 

We  thank  STOT  Image  Corporation  for  supplying  the 
Daedalus  ATM  multispectral  imagery  flown  in  the 
1983  SPOT  Simulation  Campaign.  Tom  Ory  of 


Daedalus  Enterprises,  Inc.  was  very  helpful  in 
supplying  information  concerning  the  Daedalus  sensor 
characteristics  and  image  format. 

We  thank  members  of  the  Digital  Mapping  Laboratory 
for  providing  an  interesting  research  environment. 
Particular  thanks  go  to  Ed  Allard  for  software  support 
for  data  exchange  and  evaluation  analysis,  and  Jeffrey 
Shufelt  for  valuable  comments  on  an  earlier  draft  of 
this  paper. 

References 

[Aronoff82] 

Aronoff,  S.  Classification  Accuracy:  A  User 
Approach.  Photogrammetric  Engineering  and 
Remote  Sensing  48(8):  1 299- 1 307,  August,  1 982. 

[Broomhead  and  Lowe  88] 

D.  S.  Broomhead  and  D.  Lowe.  Multivariable 
Function  Interpolation  and  Adaptive  Networks. 
Complex  Systems  2:321-355,  1988. 

[Cohen  60] 

Cohen,  J.  A  Coefficient  of  Agreement  for  Nominal 
Scales.  Educational  and  Psychological 

Measurement  20(\):37-46,  1960. 

[Co^alton  and  Mead  86] 

Congalton,  R.  and  Mead,  R.  A  Review  of  Three 
Discrete  Multivariate  Andysis  Techniques  Used  in 
Assessing  the  Accuracy  of  Remotely  Sensed  Data 
from  Error  Matrices.  IEEE  Transactions  on 
Geoscience  and  Remote  Sensing 
GE-24(  1 ):  1 69- 1 74,  January,  1 986. 

[Co^alton,  et.  al.  83] 

Congalton,  R.,  Oderwald,  R.,  and  Mead,  R. 
Assessing  Landsat  Classification  Accuracy  Using 
Discrete  Multivariate  Analysis  Statistical 
Techniques.  Photogrammetric  Engineering  and 
Remote  Sensing  49(1 2):  167 1-1678,  December, 
1983. 

[Congalton,  et.  al.  84] 

Congalton,  R.,  Oderwald,  R.,  and  Mead,  R. 
Erratum:  Assessing  Landsat  Classification 

Accuracy  Using  Discrete  Multivariate  Analysis 
Statistical  Techniques.  Photogrammetric 

Engineering  and  Remote  Sensing  50(10):  1477, 
October,  1984. 

[Dicks  and  Lo  90] 

Dicks,  S.  E.  and  Lo,  T.  H.  C.  Evaluation  of 
Thematic  Map  Accuracy  in  a  Land-Use  and  Land- 
Cover  Mapping  Program.  Photogrammetric 
Engineering  ana  Remote  Sensing  56(9):  1247- 1252, 
September,  1990. 

[Fit^atrick-Lins  81] 

Fitzpatrick-Lins,  K.  Comparison  of  Sampling 
Procedures  and  Data  Analysis  for  a  Land-Use  and 
Land-Cover  Map.  Photogrammetric  Engineering 
and  Remote  Sensing  47(3):343-351,  March,  1981 . 


434 


[Ford  and  McKeown  92a] 

Ford,  S.  J.,  and  McKeown,  D.  M.  Utilization  of 
Multispectral  Imagery  for  Cartographic  Feature 
Extraction.  In  Proceedings  of  the  DARPA  Image 
Understanding  Workshop,  pages  805-820.  Morgan 
Kaufmann  ftblishers,  Inc.,  San  Mateo,  CA, 
January,  1992. 

[Ford  and  McKeown  92b] 

Ford,  S.  J.,  and  McKeown,  D.  M.  Information 
Fusion  of  Multispectral  Imagery  for  Cartographic 
Feature  Extraction.  In  Commission  VII: 
Interpretation  of  Photographic  and  Remote  Sensing 
Data.  Washington,  DC,  XVII  ISPRS  Congress, 
August  2-14,  1992. 

[Hampshire  93] 

J.  B.  Hampshire  II.  A  Differential  Theory  of 
Learning  for  Efficient  Statistical  Pattern 
Recognition.  PhD  thesis,  Carnegie  Mellon 
University,  Department  of  Electrical  and  Computer 
Engineenng,  expected  April,  1993. 

[Hampshire  and  Kumar  92] 

J.  B.  Hampshire  II  and  B.  V.  K.  Vijaya  Kumar. 
Why  Error  Measures  are  Sub-Optimal  for  Training 
Neural  Network  Pattern  Classifiers.  In  IEEE 
Proceedings  of  the  1992  International  Joint 
Conference  on  Neural  Networks,  Vol.  4,  pages 
220-227.  June,  1992. 

[Hampshire  and  Waibel  90] 

J.  B.  Hampshire  11  and  A.  H.  Waibel.  A  Novel 
Objective  Function  for  Improved  Phoneme 
Rec^nition  Using  Time-Delay  Neural  Networks. 
IEEE  Transactions  on  Neural  Netw’orks 
l(2):216-228,  June,  1990. 

[Hudson  87] 

Hudson,  W.  D.  and  Ramm,  C.  W.  Correct 
Formulation  of  the  Kappa  Coefficient  of 
Agreement.  Photogrammetric  Engineering  and 
Remote  Sensing  53('l,:421-422,  April,  1987. 

[Mausel,  et.  al.  90] 

Mausel,  P.  W,,  Kramber,  W.  J.,  and  Lee,  J.  K. 
Optimum  Band  Selection  for  Supervised 
Classification  of  Multispectral  Data. 

Photogrammetric  Engineering  and  Remote  Sensing 

56(1);55-^,  January,  1990. 

[McGwire  92] 

McGwire,  K.  C.  Analyst  Variability  in  Labeling  of 
Unsupervised  Classifications.  Photogrammetric 
Engineering  and  Remote  Sensing 

58(12):1673-1677,  December,  1992. 

[Medgassy  61] 

P.  Mragassy.  Decomposition  of  Superposition  of 
Distribution  Functions.  Publishing  House  of  the 
Hungarian  Academy  of  Sciences,  Budapest,  1961. 

[Moody  and  Darken  88] 

J.  Moody  and  C.  Darken.  Learning  with  Localised 
Receptive  Fields.  In  Touretzky,  Hinton  and 
Sejnowski  (editor).  Proceedings  of  the  1988 
Connectionist  Models  Summer  School.  Morgan 
Kaufmann,  San  Mateo,  CA,  1988. 

[Poggio  and  Girosi  8^ 

T.  Poggio  and  F.  Girosi.  A  Theory  of  Networks  for 
Approximation  and  Learning.  AI  Memo  1 140, 
Mfr,  1989. 


[Rich^ds  86] 

Richards,  J.  A.  Remote  Sensing  Digital  Image 
Analysis:  An  Introduction.  Springer-Verlag, 
Berlin,  1986. 

[Rosenfeld  86] 

Rosenfeld,  G.  H.  and  Fitzpatrick-Lins,  K.  A 
Coefficient  of  i^reement  as  a  Measure  of 
Thematic  Classification  Accuracy. 

Photogrammetric  Engineering  and  Remote  Sensing 
52(2):223-227,  February,  1986. 

[Sabins  87] 

Sabins,  F.  F.  Remote  Sensing:  Principles  and 
Interpretation.  W.  H.  Freeman,  New  York,  NY, 
1987; 

[SPOT  83] 

SPOT  IMAGE  Corporation.  1983  U.S.  SPOT 
Simulation  Campaign  Auxilliary  Information 
Package  SPOT  IMAGE  Corporation,  1150  17th 
Street  NW  Suite  307,  Washington,  DC  2(K)36, 
1983. 

[SPOT  84] 

SPOT  Image  Coiporation.  SPOT  Simulation 
Applications  Handbook.  American  Society  for 
Pnotogrammetry,  Falls  Church,  VA,  1984. 

[Story  and  Congalton  86] 

Story,  M.  and  Congalton,  R.  Accuracy 
Assessment;  A  User’s  Perspective. 
Photogrammetric  Engineering  and  Remote  Sensing 
53(3):397-399,  March.  1986. 

[Swain  and  Davis  78] 

Swain,  P.  H.  and  Davis,  S.  M.  Remote  Sensing: 
The  Quantitative  Approach.  McGraw-Hill,  New 
York.  NY,  1978. 


435 


Incorporating  Vanishing  Point  Geometry  Into  a  Building  Extraction  System 


J.  Chris  McGlone 
Jefferey  A.  Shufelt 

Digital  M2^)ping  Laboratory 
School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh.  PA  15213-3891 


Abstract 

Knowledge  about  the  imaging  geometry  and 
acquisition  parameters  provides  useful  geometric 
constraints  for  the  analysis  and  extraction  of  man¬ 
made  features  in  aerial  imagery,  particularly  in 
oblique  views.  In  this  paper,  we  discuss  the 
application  of  vanishing  points  for  the 
identification  of  vertical  and  horizontal  lines,  and 
the  use  of  multiple  views  for  verification  of  these 
lines.  The  vertical  and  horizontal  attributions  are 
used  to  constrain  the  set  of  possible  building 
hypotheses.  Preliminary  results  exploiting  these 
attributions  are  described.* 

1.  Introduction 

Building  extraction  is  a  fundamental  problem  in 
automated  cartography  [Nicolin  87.  Huertas 
88.  Mohan  89.  Irvin  89.  Liow  90.  McKeown 
90.  Shufelt  93].  Systems  implemented  to  date  have 
had  basic  similarities:  all  have  used  vertical  aerial 
imagery,  assuming  simplified  imaging  geometry  in 
their  calculations,  and  all  have  used  intensity 
features  as  the  basic  cues  for  feature  extraction. 
Several  have  made  use  of  shadow  geometry  for 
hypothesis  generation  and  verification.  Low  level 
boundary  determination  is  usually  region-based  or 
based  upon  geometric  analysis  of  lines  found  in  the 
image. 


'  This  work  was  sponsored  by  the  Defense  Advanced 
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Projects  Agency  or  the  United  States  Government. 


Many  of  these  techniques  exhibit  poor 
performance  when  building  structures  are 
composed  of  complex  shapes,  when  there  is  poor 
contrast  between  object  and  background,  and  when 
viewing  geometry,  building  height,  and  building 
density  cause  occlusions  and  partial  views,  or 
views  of  surfaces  (Hher  than  the  building  roof.  As 
a  result,  even  in  the  case  of  nominally  nadir 
imagery,  the  three-dimensional  nature  of  the  world 
can  not  be  ignored.  In  the  case  of  non-traditional 
mapping  photography,  particularly  oblique  views 
used  in  aerial  photo-interpretation,  there  is  a 
greater  need  to  explicitly  model  the  viewing 
geometry;  such  modeling  needs  to  be  performed 
within  the  context  of  a  rigorous  photogrammetric 
calculation  in  order  to  take  advantage  of  all 
geometric  information  available. 

Our  current  experiments  have  been  focused  on  the 
modification  of  babe  (Builtup  Area  Building 
Extraction),  a  building  detection  system  built  at 
CMU  [McKeown  90]  based  on  a  line-comer 
analysis  method.  We  have  been  experimenting 
with  the  inclusion  of  geometric  constraints  derived 
from  knowledge  of  the  full  camera  position  and 
orientation.  In  brief,  BABE  proceeds  through  four 
major  phases  to  incrementally  generate  Iniilding 
hypotheses.  The  first  phase  constructs  comers 
from  lines,  under  the  assumption  that  buildings  can 
be  modeled  by  straight  line  segments  linked  by 
(nearly)  right-angled  comers.  The  second  phase 
constructs  chains  of  edges  which  are  link^  by 
comers,  to  serve  as  partial  structural  hypotheses. 
The  third  phase  uses  these  line-comer  stmctures  to 
hypothesize  boxes,  parallelopipeds  which  may 
delineate  man-made  features  in  the  scene.  The 
fourth  phase  evaluates  the  boxes  in  terms  of  size 
and  line  intensity  constraints,  and  the  best  boxes 
for  each  chain  are  kept,  subject  to  shadow  intensity 
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constraints  similar  to  those  proposed  in  [Nicolin 
87]  and  [Huertas  88].  In  addition,  the  results  of  the 
third  phase  of  analysis  are  directly  used  as  sources 
of  building  hypotheses  for  other  modules  that 
perform  grouping,  shadow  analysis,  and  stereo 
matching. 

Our  initial  modifications  to  the  BABE  system 
include  the  use  of  a  rigorous  photogrammetric 
camera  model,  the  incorporation  of  vanishing  point 
geometry  as  an  additional  input  to  the  building 
hypothesis  construction  process,  and  the 
substitution  of  exact  metric  calculations  for 
distances  and  angles  instead  of  approximations 
based  upon  image  scale  and  near-nadir  orientation. 
This  paper  describes  the  current  status  of  the  BABE 
system,  starting  with  an  overview  of  vanishing 
point  geometry  as  used  for  the  extraction  of 
horizontal  and  vertical  edges  and  a  brief 
description  of  the  BABE  system.  The  current 
integration  of  the  vanishing  point  information  into 
BABE  is  outlined  and  some  preliminary  results  are 
given. 

2.  Vanishing  point  geometry 

As  is  well  known  from  projective  geometry 
[Barnard  83],  parallel  lines  in  a  scene  meet  in  a 
common  point  in  an  image  of  the  scene.  This  point 
is  known  as  the  vanishing  point,  since  it  is  the 
image  of  a  point  at  infinity.  In  an  aerial  image 
vertical  lines  in  the  scene  meet  at  the  vertical 
vanishing  point,  traditionally  referred  to  as  the 
nadir  point  because  it  is  directly  below  the 
perspective  center  of  the  image.  Sets  of  parallel 
lines  at  varying  orientations  in  a  plane  have 
vanishing  points  which  lie  along  a  straight  line  in 
the  image.  If  the  sets  of  parallel  lines  are 
horizontal,  the  line  is  the  true  horizon. 

This  apparent  convergence  of  parallel  lines  gives 
important  cues  to  the  orientation  of  the  image  and 
to  the  structure  of  objects  within  the  scene. 
Previous  work  has  looked  at  using  vanishing  points 
to  determine  image  orientation  [Barnard  83]  and  to 
determine  the  structure  of  objects  within  the  scene 
[Brillault  92,  Lebegue  92]. 

Most  previous  work  using  vanishing  point 
geometry  has  been  done  with  robotics  imagery 
from  standard  video  cameras  viewing  objects  at 
close  range.  The  applicability  of  vanishing  point 
analysis  is  obvious;  perspective  effects  are  strong 
due  to  the  wide  angle  lenses,  close  objects,  and 


often  oblique  viewing  angles.  Image  edges 
corresponding  to  hallways,  doors,  and  structures 
are  numerous,  long  and  usually  have  high  contrast, 
allowing  good  solutions  for  vanishing  points  and 
image  orientations. 

Aerial  imagery  presents  different  problems.  The 
standard  vertical  viewpoint  lessens  perspective 
effects,  while  individual  objects  cover  a  much 
smaller  proportion  of  the  image.  Vertical  lines  in 
particular  are  less  prominent,  typically  only  a  few 
pixels  long.  Edge  contrast  may  be  lessened  due  to 
illumination  and  atmospheric  conditions.  It  is  well 
known  that  standard  edge  detectors  have  problems 
extracting  such  short,  weak  edges,  often  distorting 
their  geometry  or  mistakenly  combining  them  with 
intersecting  edges. 

Further,  in  cartographic  applications  it  is  assumed 
that  the  aircraft  position  and  orientation  in  space  is 
fairly  well  known,  and  camera  properties  such  as 
focal  length,  distortion  and  sensor  type,  film, 
scanning  array,  etc.,  are  quite  well  modeled.  For 
these  reasons,  our  approach  starts  with  the 
assumption  that  the  orientation  of  the  aerial  image 
is  known  beforehand.  Instead  of  using  the 
vanishing  points  to  determine  image  orientation, 
we  focus  on  using  the  vanishing  point  geometry  to 
assist  in  extracting  buildings.  Of  course,  given 
strong  enough  vanishing  point  information  from 
the  image  the  orientation  can  be  refined,  but  in  this 
work  no  refinement  was  attempted. 

This  section  outlines  the  calculation  of  the  vertical 
vanishing  point  and  the  horizon  line,  and  the 
identification  of  vertical  and  horizontal  lines  using 
this  information. 

2.1.  Calculation  of  the  vertical  vanishing 
point 

The  image  orientation  is  specified  by  a  3  by  3 
orientation  matrix  M  which  rotates  the  ground 
coordinate  system  into  the  image  coordinate 
system.  This  matrix  is  determined  by  three 
independent  orientation  angles  or  parameters,  e.g., 
roll,  pitch,  and  yaw  [Slama  80]. 

The  vertical  vector  in  object  space  is  [0,  0,  1] 
and  is  transformed  into  the  image  coordinate 
system  by  multiplication  with  the  ground-to-image 
orientation  matrix  M. 
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Figure  1:  Fon  Hood  test  area  radtmwob. 


When  the  vector  is  placed  at  the  perspective  only  a  point;  its  image  must  therefore  be  the 


center  of  the  image  (coordinates  0,0./).  it  pierces 
the  image  plane  r  =  0  at 

vertical  vanishing'  point. 

m,-. 

2.2.  Horizontal  vanishing  point 

l.t  ^ 

A  =  - f 

determination 

The  horizontal  vanishing  points  are 

calculated 

m-,-, 

y  =  ~f 
m.. 

using  a  variant  of  the  Gaussian  sphere 
first  applied  in  [Barnard  S3[. 

technique 

Since  this  vector  is  vertical  it  is  parallel  to  all  other  The  Gaussian  sphere  represents  a  vector 
vertical  lines  in  the  scene  and  its  image  must  pass  orientation  in  3-space  as  a  point  on  the  sphere.  As 
through  the  vertical  vanishing  point.  However,  its  in  [Barnard  S3|  we  assume  that  the  perspective 
image,  where  the  vector  pierces  the  image  plane,  is  center  of  the  image  is  at  the  center  of  the  sphere. 
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the  origin  of  the  image  coordinate  system,  and  the 
image  is  tangent  to  the  sphere. 

An  "interpretation  plane"  is  associated  with  each 
line  in  the  image,  passing  through  the  image  line 
and  the  perspective  center.  The  interpretation 
plane  is  represented  on  the  sphere  by  the  point 
corresponding  to  the  orientation  of  its  normal  and, 
in  a  dual  sense,  by  the  great  circle  formed  by  the 
intersection  of  the  plane  and  sphere. 

The  great  circles  cut  by  the  interpretation  planes 
corresponding  to  parallel  lines  in  the  scene 
intersect  on  the  sphere  in  the  vanishing  point  for 
that  set  of  parallel  lines.  The  vanishing  points  for 
sets  of  parallel  lines  of  different  orientations  in  a 
plane  lie  on  the  vanishing  line  for  that  plane  —  i.e., 
for  horizontal  lines,  on  the  horizon  line. 

Using  the  known  image  orientation  we  calculate 
the  horizon  plane  and  its  corresponding  great  circle 
on  the  sphere,  the  vanishing  line  for  horizontal 
lines.  (The  normal  to  the  horizontal  plane  is,  of 
course,  the  vertical  vector.)  The  interpretation 
plane  corresponding  to  each  horizontal  line  in  the 
scene  will  intersect  the  horizon  line  at  the 
vanishing  point  for  horizontal  lines  in  that 
direction. 

To  identify  parallel  sets  of  horizontal  lines,  the 
great  circle  for  each  line  in  the  image  is  formed 
and  intersected  with  the  horizon.  The  horizon 
great  circle  is  divided  into  equal  bins  along  its  arc 
length  (instead  of  quantizing  azimuth  or  elevation) 
and  the  number  of  intersections  within  each  bin  is 
tallied.  The  bin  with  the  maximum  number  of 
intersections  corresponds  to  the  most  numerous  set 
of  parallel  lines  within  the  scene. 

3.  Identincation  of  vertical  and  horizontal 
lines 

Given  the  b^  :kground  for  the  determination  of  the 
the  vertical  vanishing  point  and  our  method  for 
determining  parallel  sets  of  horizontal  lines,  we 
proceed  with  a  demonstration  and  discussion  of 
current  performance  using  an  oblique  image  of  a 
barracks  area  over  Fort  Hood,  Texas. 

3.1.  Vertical  lines 

In  order  to  find  vertical  lines  in  the  scene  each 
edge  in  the  image  is  fit  to  a  line  constrained  to  pass 
through  the  vanishing  point,  leaving  only  the  slope 
of  the  line  to  be  determined.  The  residuals  are 


calculated  and  if  the  rms  error  exceeds  2.0  pixels, 
the  edge  is  eliminated.  Since  extremely  short 
edges  will  have  small  residuals  for  any  orientation 
of  line  fit,  edges  below  a  minimum  length  are 
eliminated. 


Figure  2:  Edges  for  test  area  radt9WOB. 


The  same  resection  that  produces  the  image 
orientation  used  to  calculate  the  vertical  vanishing 
point  also  calculates  the  precision  of  the 
orientation  angles,  from  which  the  precision  of  the 
vanishing  point  location  can  be  determined  and 
used  to  set  the  acceptance  criteria  for  slopes  and 
line  fitting.  For  oblique  imagery,  where  the 
vanishing  point  is  usually  outside  the  image  area 
itself,  the  precision  has  a  small  effect.  For  vertical 
images,  however,  the  vertical  vanishing  point  is 
near  the  center  of  the  frame  and  is  close  to  the 
edges  being  tested.  Error  in  its  location  can  change 
the  slope  of  the  test  line  significantly  and  should  be 
taken  into  account  in  the  line  fitting  procedure. 

As  a  further  test  a  line  not  constrained  to  pass 
through  the  vanishing  point  is  also  fitted  to 
accepted  edges  and  the  slope  of  that  line  compared 
to  the  direction  from  the  centroid  of  the  edge  to  the 
vanishing  point.  If  the  slopes  do  not  agree  within 
an  angular  tolerance  of  0.2  radians,  the  line  is 
eliminated. 
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Figure  3:  Hori/onlal  edges. 

3.2.  Horizontal  edge  extraction 

Our  goal  is  to  identity  man-made  structures,  which 
are  usually  defined  by  perpendicular  sets  of 
parallel  lines.  We  therefore  examine  the  histogram 
bins  (as  described  in  section  2.2)  and.  instead  of 
choosing  the  single  bin  with  the  maximunt  score, 
we  add  the  score  of  each  bin  to  the  scores  of  the 
bins  representing  directions  perpendicular  to  it. 
The  maximum  of  this  sum  indicates  the  directions 
of  the  strongest  mutually  perpendicular  sets  of 
parallel  lines  in  the  scene.  In  areas  where 
buildings  and  roads  are  all  on  a  common  grid,  this 
is  sufficient;  in  areas  where  not  all  the  buildings 
are  parallel  to  each  other,  secondary  maxima  can 
be  determined  to  label  buildings.  Bins 
corresponding  to  perpendicular  directions  can  be 
easily  determined,  since  right  angles  at  the  center 
of  the  sphere  between  points  on  the  horizon  great 
circle  correspond  to  true  right  angles  between 
horizontal  lines  in  object  space. 

Figure  I  shows  an  oblique  image  of  a  barracks  area 
within  Fort  Hood,  Texas.  Such  scenes  are  not 
atypical  for  military  bases  or.  with  some 
architectural  modifications,  for  houses  in  a 
suburban  development.  Figure  2  shows  the  edges 
extracted  by  an  implementation  of  the  Ncvatia- 
Babu  line  finder  [Nevatia  80|.  while  candidate 
horizontal  and  vertical  edges  are  shown  in  Figures 
and  4.  Some  edges  are  labeled  as  both  horizontal 
and  vertical  due  to  the  viewing  angle  of  the  image, 
which  happened  to  align  many  of  the  horizontal 
edges  with  the  vertical  vanishing  point.  In  such 


Figure  4:  Vertical  edges. 

ambiguous  cases,  external  information  or  other 
V  iew  s  must  be  used  to  decide  between  these  labels. 

4.  Horizontal  and  vertical  line  verification 

Given  a  single  view  and  only  geometric 
informatK)n.  the  inherent  ambiguities  of 
perspective  projection  prevent  an  absolute 
determination  of  whether  a  given  line  is  horizontal 
or  vertical.  False  positive  identifications  due  to 
accidental  alignments  are  unavoidable.  Since  these 
false  positives  increase  the  number  i)f  edges 
Bagged  for  later  analysis  and  the  computational 
effort  required,  we  would  like  to  eliminate  as  many 
as  possible. 

A  first  step  is  filtering  against  a  minimum  length  or 
height  threshold.  Highly  textured  areas  produce  a 
large  number  of  short,  randomly  oriented  edges, 
some  of  which  will  align  w  ith  the  vanishing  point 
of  interest.  Using  the  assumed  horizontal  iir 
vertical  orientation  for  the  line,  we  can  calculate  an 
approximate  length  or  height  and  compare  it  to  the 
minimum  values  we  would  expect  to  see.  For 
example,  if  we  are  looking  for  buildings,  heights 
will  typically  be  greater  than  ?>  meters  and  lengths 
greater  than  10  meters.  Such  constraints  can  be 
easily  modified  by  world  knowledge  to  search  for  a 
specific  set  of  buildings  within  a  range  of  heights 
or  volumes.  Currently  we  view  this  process  as  one 
of  filtering  rather  than  selection.  Each  edge 
segment  that  passes  these  filters  is  given  an 
attribution  as  either  horizontal  or  vertical.  The 
entire  collection  of  edges  can  then  be  used  in  a 


variety  of  ways  to  construct  plausible  building 
hypotheses.  In  the  following  section  we  describe 
the  use  of  attributed  edge  segn^nts  to  detect  and 
construct  possible  building  comers. 

If  multiple  views  of  the  scene  are  available,  we  can 
use  the  epipolar  condition  to  determine  if 
consistent  edges  appear  in  both  images.  For  each 
edge  in  the  image,  we  calculate  the  epipolar  plane 
through  its  midpoint  and  determine  which  edges,  if 
any,  ate  intersected  by  the  epipolar  line  on  the 
other  image.  For  lines  which  lie  on  corresponding 
epipolar  lines,  we  can  also  compare  their 
calculated  dimensions.  Horizontal  lines  can  also 
be  matched  using  their  calculated  directions  in 
object  space,  obtained  from  the  horizontal  line 
extraction  procedure  described  above. 

5.  Corner  detection  with  line  attributions 

The  vanishing-point  geometry  of  a  scene  can 
provide  important  additional  cues  for  feature 
extraction.  Under  the  assumption  that  man-made 
features  in  aerial  photography  can  be  modeled  by 
parallelopipeds  joined  at  edges,  horizontal  and 
vertical  edge  segment  attributions  are  useful  cues 
in  assembling  building  hypotheses.  We  illustrate 
the  utility  of  these  attributions  in  the  context  of  a 
building  extraction  system,  babe,  originally 
designed  for  analysis  of  mapping  photography 
having  nadir  and  near-nadir  acquisition  geometries. 

BABE  begins  processing  by  generating  intensity 
edges  for  an  image,  using  a  Nevatia-Babu  edge 
fuider  [Nevada  80].  BABE  applies  a  range  search 
to  locate  and  connect  collinear  edges  whose 
endpoints  are  in  close  proximity,  to  address  the 
possibility  of  fragmented  edges.  These  edges  are 
then  used  as  the  basis  for  comer  detection. 

BABE  performs  another  range  search  on  the  edges, 
to  locate  edges  which  meet  at  approximately  right 
angles.  The  intersections  of  these  edges  represent 
the  comer  points.  BABE  then  uses  these  comer 
points  to  link  sequences  of  edges  such  that  the 
direction  of  rotation  along  a  sequence  is  either 
clockwise  or  counterclockwise,  but  not  both,  since 
building  stmctuie  is  assumed  to  be  well  modeled 
by  parallelopipeds. 

Even  when  a  building  can  be  modeled  perfectly  by 
a  rectangle,  the  chain  of  edges  representing  it  may 
not  be  a  closed  stmcture,  due  to  extraneous  or 
missing  comers  in  the  chain.  BABE  addresses  this 
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Figure  5:  Simple  building  nnodel. 


problem  by  generating  building  hypotheses,  i.e., 
boxes,  for  every  subchain  of  edges  in  a  chain.  This 
is  accomplish^  by  taking  every  subchain  of  at 
least  two  edges  and  completing  them  to  four-sided 
boxes. 


Typically,  only  about  10%  of  the  boxes  generated 
for  a  scene  correspond  to  buildings.  BABE’s 
verification  phase  selects  building  candidates  from 
the  boxes  generated  in  the  previous  phase.  It 
performs  this  task  by  examining  the  boxes  for 
indications  of  a  shadow  region  along  the  shadow 
casting  edges. 

Under  an  oblique  viewing  geometry,  BABE’s  model 
first  breaks  down  in  the  comer  detection  phase 
where  right-angled  comers  in  the  scene  may  not 
translate  to  right-angled  comers  in  the  image.  In 
fact,  the  actual  angle  depends  not  only  on  the 
obliquity  of  the  viewing  geometry,  but  on  the 
relative  position  and  orientation  of  the  building  in 
the  scene. 

Using  the  horizontal  and  vertical  line  identification 
techniques  described  in  Section  3,  we  can  assign 
attributions  to  each  edge  prior  to  comer  generation. 
We  can  then  make  use  of  a  simple  building  model, 
outlined  in  Figure  S.  This  model  presents  two 
simple  and  common  classes  of  buildings,  those 
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Figure  6:  Babh  hypotheses.  RAIiT^woB. 


with  flat  roofs  and  those  with  peaked  roofs.  The 
two  types  of  buildings  are  shown  from  various 
viewpoints  (symmetric  eases  are  omitted  for 
brevity). 

Each  distinct  line  segment  in  the  diagram  has  been 
assigned  a  label,  indicating  whether  it  is  a  vertical 
or  horizontal  line  in  object  space,  or  whether  it  is 
neither.  In  object  space,  we  observe  that  for  flat- 
roof  structures,  side  and  front  facets  of  buildings 
are  instances  of  rectangles  composed  of  alternating 
horizontal  and  vertical  segments,  and  roof  facets 
are  instances  of  rectangles  formed  by  four 
horizontal  segments.  For  peaked-rwf  structures, 
each  side  facet  is  again  represented  by  a  rectangle 
of  alternating  horizontal  and  vertical  segments; 
roof  facets  arc  now  instances  of  rectangles  of 
alternating  horizontal  and  unlabelcd  segments.  A 
front  facet  of  a  peaked-nwf  structure  is  a  pentagon, 
composed  of  two  unlabeled  segments,  two 
verticals,  and  a  horizontal  segment. 

It  is  worth  noting  that  BABH  does  not  explicitly  use 
this  simple  model  in  its  prtKcssing  phases;  there  is 
nothing  in  principle  that  prohibits  an  extension  to 
BABH  for  constructing  more  complex  shapes  by 
joining  these  rectangular  or  pentagonal  facets.  The 
model  is  useful,  however,  for  visualizing  the 
relationships  between  horizontal,  vertical,  and 
unlaheled  lines  in  typical  man-made  structures. 

These  properties  of  building  facets  suggest  the 
following  set  of  heuristics  for  corner  detection: 


Figure  7:  Geometrically  consistent  hypotheses. 


•  Two  intersecting  verticals  never  form  a  valid 
comer  in  object  space. 

•  A  horizontal-vertical  intersection  is  allowed  to 
form  a  comer. 

•  Two  intersecting  horizontals  arc  allowed  to  form 
a  comer,  if  their  intersection  in  object  space  forms 
a  right  angle. 

•  An  unlabeled  line  intersecting  with  a  labeled  line 
is  allowed  as  a  comer,  since  it  is  potentially  part  of 
a  peaked  roof. 

•  Two  intersecting  unlabeled  lines  are  allowed  to 
form  a  corner,  as  they  may  be  part  of  a  pentagonal 
facet;  it  should  be  noted,  however,  that  the  current 
version  of  BABH  will  not  generate  pentagonal 
descriptions.  Wc  intend  to  pursue  more  general 
shape  constructions  in  future  work. 

These  heuristics  must  take  into  account  the  fact 
that  a  given  line  may  be  labeled  as  both  horizontal 
and  vertical,  if  the  imaging  geometry  is  such  that 
the  direction  of  the  horizt)ntal  vanishing  point  for 
some  set  of  lines  is  the  same  as  the  vertical 
vanishing  point.  They  do  so  by  allow  ing  such  lines 
to  be  regarded  as  both  horizontal  and  vertical  lines 
during  comer  formation. 

6.  Building  hypothesis  generation 

Given  the  ability  to  generate  comers  in  oblique 
imagery.  BABE  can  be  used  to  generate  structural 
hypotheses,  boxes  which  delineate  structure  in  the 
scene.  In  the  original  implementation  of  BABE,  the 
only  geometric  constraint  applied  during  line- 
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comer  linking  and  box  formation  was  the  right- 
angle  constraint  on  comers.  In  the  new 
implementation,  we  can  apply  our  simple  building 
nnodel  at  this  stage  to  pmne  geometrically 
inconsistent  hypotheses. 

For  each  box  generated  by  BABE,  we  examine  the 
horizontal  and  vertical  line  attributions  assigned  to 
each  line  segment  of  the  box.  If  the  four 
attributions  are  consistent  with  the  labelings  of  any 
building  facet  in  the  building  model,  the  box  is 
accepted.  One  example  would  be  a  facet  with 
alternating  horizontal  and  vertical  lines,  which  is 
consistent  with  a  side  facet  of  a  building.  If  the 
four  attributions  do  not  match  any  of  the  allowable 
building  facets,  the  box  is  rejected  as  being 
geometrically  inconsistent.  One  example  of  this 
situation  would  be  a  box  comprised  of  four  vertical 
lines;  such  a  facet  is  impossible  in  the  model. 

Figure  6  shows  the  complete  set  of  boxes 
generated  by  BABE  prior  to  the  application  of 
geometric  labeling  constraints;  in  this  case,  there 
are  2899  boxes.  Figure  7  shows  the  set  of  628 
boxes  left  after  the  labeling  constraints  have  been 
exercised.  As  the  figures  show,  the  labeling 
constraints  alone  provide  a  strong  constraint  on  the 
permissible  hypothesis  geometries. 

After  the  ^plication  of  the  labeling  constraints,  the 
boxes  are  passed  through  BABE’s  verification 
phase,  which  estimates  shadow  intensity  and  sun 
illumination  direction  and  uses  this  knowledge  to 
score  each  hypothesis  based  on  its  conformance 
with  these  parameters.  At  this  time,  the 
verification  phase  makes  no  use  of  the 
photogrammetric  information,  and  hence  treats  all 
hypotheses  as  though  they  represented  features  in  a 
nadir-acquisition  geometry.  We  intend  to  address 
this  shortcoming  in  future  work. 

At  this  stage,  we  are  left  with  a  set  of  hypotheses 
which  are  presumed  to  be  geometrically  consistent, 
in  that  they  are  composed  of  comers  exhibiting 
valid  angles  in  image  space  and  that  they  possess 
valid  labelings  with  respect  to  our  simple  building 
model,  and  which  are  presumed  to  be 
photometrically  consistent,  in  that  they  exhibit  a 
combination  of  strong  intensity  gradient  across 
edge  boundaries  and  are  adjacent  to  dark  regions  in 
the  image  which  could  plausibly  be  the  shadows  of 
the  hypothesized  stmctures. 


Given  these  presumptions,  it  is  reasonable  to 
regard  these  hypotheses  as  verified  facets  of  three- 
dimensional  structure  in  the  scene.  Using  the 
scene  geometry  in  conjunction  with  our  building 
model,  it  becomes  possible  to  extrapolate  these 
partial  delineations  of  building  stmcture  into  more 
complete  building  models.  We  consider  one  such 
extrapolation  here,  that  of  completing  partially 
peaked  roofs  to  cover  the  entire  roof  Using  our 
model,  we  know  that  facets  with  alternating 
imlabeled  and  horizontal  lines  must  be  peaked  roof 
facets;  we  can  detect  these  facets  by  examining  the 
line  labelings  and  applying  geometric  constraints 
to  extrapolate  the  other  peaked  roof  facet  in  the 
pair. 

Figure  8  illustrates  the  situation  at  hand.  The 
hypothesized  facet  represents  a  BABE  hypothesis 
which  we  wish  to  use  as  a  guide  for  hypothesizing 
the  other  half  of  the  rooftop.  We  begin  by 
computing  the  line  perpendicular  to  the  horizontal 
line  R  in  object  space,  and  projecting  this 
perpendicular  into  image  space  (line  C).  Next,  we 
intersect  that  line  with  the  line  drawn  through  the 
roof  peak  point  p  and  the  vertical  vanishing  point 
vp,  to  obtain  a  point  x.  In  object  space,  the 
distance  between  x  and  e  is  equal  to  the  distance 
between  x  and  n;  we  assume  that  these  distances 
are  equal  in  image  space  as  well,  and  complete  the 
new  Gilding  facet  by  using  the  roof  peak  point  p, 
points  n  and  /,  and  the  application  of  symmetry  to 
generate  g. 
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Mgurt*  9:  Original  ft  \Hl  results. 


I  ii:ure  9  shows  the  original  HAHC  results  lor  the 
scene;  Fiiiure  10  illustrates  the  final  result 
generated  by  our  eurrent  extensions  to  HAHt .  Two 
improvements  are  apparent  frt>m  the  figures;  the 
geometrie  eonsisteney  pruning  has  eliminated 
tnany  spurious  boxes  generated  by  aeeidental 
alignments  of  small  edge  features,  and  the  peak 
projeetion  technique  has  improved  the  modeling  of 
peaked  structures  in  this  scette.  correctly 
hypothesizing  roof  facets  that  were  either  lost  in 
the  shadow  evaluation  phase  of  BAHi;.  or  were 
never  generated  due  to  a  lack  of  edge  information. 
There  are  still  problems;  one  of  the  false 
hypotheses  on  the  right  side  of  Figure  10  has  a 
long  narrow  facet,  generated  by  the  peak  projection 
technique.  The  immediate  problem  is  that  the 
supposed  horizontal  roof  edge  line  is  perpendicular 
tv>  the  vertical  lines  in  image  space,  and  hence  the 
line  through  the  roof  peak  and  the  vanishing  point 
will  be  nearly  parallel  to  the  roof  edge  line, 
resulting  in  a  facet  hypothesis  w  hose  roof  edge  w  ill 
be  very  far  from  the  original  facet. 

The  problem  just  described  arises  in  part  from  the 
fact  that  the  facet  projection  technique  is  only  an 
approximation  to  the  true  image  space 
relationships  between  edges.  and  this 
approximation  breaks  down  when  structures 
possess  horizontal  roof  lines  and  vertical  lines  w  ith 
the  same  orientation  in  image  space.  Ultimately, 
however,  the  problem  is  due  to  issues  in  imnleling 
and  hypothesis  generation.  In  a  full 


Figure  10:  .\ew  liABI  results. 


implementation  of  a  general  v  iewpoint  BAIU  .  it 
would  be  desirable  to  maintain  the  generate-and- 
test  paradigm  used  in  the  original  version  of  BAHh. 
During  the  line-corner  chain  forming  phase,  one 
wxHild  like  to  construct  full  three-dimensional 
structural  models  in  object  space,  rather  than  two- 
dimensional  models  in  image  space.  These  mi>dels 
would  then  be  subjected  to  a  verification  process 
similar  in  spirit  to  the  shadow  constraint 
algorithms  BABi:  now  employs,  but  with  the  added 
information  provided  b\  scene  geometry  attd 
illumination  constraints  on  adjacent  planar 
surfaces.  This  point  w  ill  be  discussed  again  in  the 
final  section. 

7.  Analysis  and  future  work 

Preliminary  results  from  the  inclusion  of  geometric 
and  metric  knowledge  into  the  building  extraction 
system  have  been  promising,  although  they  have 
highlighted  the  limitations  of  the  current  itnplieit 
building  models  within  the  BABi;  system.  We 
believe  that  these  limitations  are  typical  of  other 
building  extraction  research  based  upon  nadir  view 
assumptions. 

Our  experimentation  has  been  limited  to  five  test 
areas  visible  in  each  of  four  images  of  Fi>rt  Hood. 
Two  of  the  images  have  near-nadir  geometry, 
while  two  are  oblique  with  a  relatively  wide  angle 
field  of  view.  We  expect  to  continue  to  refine  and 
validate  our  research  on  a  wider  set  of  imagery. 
•Some  specific  observations  regarding  our  work  are 
as  follows: 
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Figure  11:  Multiple  view  verilleation. 


•  The  height  e^timates  for  the  eanclidate  \ertieal 
lines  are  good  refinement  and  information  fusitm 
eues.  sinee  the  objeet-spaee  measurements  ean  be 
direetly  compared  with  other  sources  of  heights, 
such  as  shadows.  A  next  step  is  to  incorporate 
precision  information  on  the  measurements  into  an 
information  fusion  framework  [Shufelt  9.^|  to 
allow  for  relative  weighting  of  the  measurements. 

•  The  height  and  length  attributions  can  also  be 
used  to  constrain  hypothesis  search  by  pruning 
unrealistic  building  sizes. 

•  A  small  catalog  of  structural  formation 
constraints  (Figure  .“s)  can  be  a  powerful  tool  for 
pruning  hy  potheses. 

•  Verification  of  horizontal  and  vertical  lines 
across  multiple  views  can  reduce  the  number  of 
hypotheses  for  later  processing  stages,  thereby 
increasing  efficiency.  A  future  step  is  to  perform 
multiple  view  verification  of  corners,  which 
should  also  decrease  the  number  of  hypotheses  and 
improve  their  quality.  Figure  I  I  shows  the  ItAHt-; 
results  using  horizontal  edges  verified  with 
another  image.  F'rom  16.^4  original  horizontal 
edges,  478  were  eliminated  because  of  length  (less 
than  10  meters)  and  141  were  eliminated  because 
they  did  not  match  another  horizxmtal  edge  on  the 
overlapping  image.  A  comparison  of  Figure  I  I 
with  Figure  10  shows  .^0  fewer  building 
hypotheses  generated,  with  only  one  of  the 
structures  eliminated  aetually  corresponding  to  a 
real  structure  in  the  scene. 

•  Although  BABh  has  been  designed  as  a 
monoseopic  system,  the  capability  of  precisely 


Figure  12:  Reprojected  BABl  results. 


combining  multiple  views  using  the 
photogrammetrie  inlormalion  allows  the 
hyptnhesis  generation  and  verification  to  take 
place  completely  in  object  space.  Figure  12  shows 
the  BABH  results  for  test  area  RADT'AVOB  projected 
into  another  image,  using  the  camera  resection 
results  for  the  two  images.  This  illustrates  the 
advantages  of  being  able  to  tie  images  together 
with  rigontus  camera  models,  especially  required 
for  oblique  imagery  . 

•  The  BABH  model  can  also  be  extended  to  handle 
illumination  constraints  on  the  building  facets 
(such  as  variatit)n  across  peaked  riiofs  of  uniform 
material,  given  the  sun  angle),  more  rigorous 
shadow  detection  and  verification,  and  stereo 
disparity. 

Results  thus  far  have  shown  the  need  for  true 
three-dimensional  modeling  of  object  structure. 
Figures  1.^  and  14  compare  the  original  and  new 
BABH  hypotheses  for  another  of  our  test  scenes.  In 
this  case,  the  new  data  are  qualitatively  much 
worse,  despite  the  system's  ability  in  a  few 
instances  to  eorreetly  extrapolate  roof  structure. 
The  prtiblem  here  arises  in  BABH's  hypothesis 
verification  algorithms.  In  the  old  version  of 
BABl.  no  geometric  consistency  cheeks  are 
performed  on  the  set  of  boxes  before  thev  are 
passed  to  the  verification  phase.  In  the  new 
versitm,  geometric  pruning  presents  the 
verification  phase  with  a  smaller  set  of  boxes  for 
statistical  analysis,  and  the  verification  algorithms 
choose  a  higher  cutoff  for  plausible  hypotheses. 


445 


Figure  13:  Original  HAHl-  results.  RADIsWOH. 


eliminating  many  true  boxes.  This  can  in  part  be 
blamed  on  the  adaptive  nature  of  the  verification 
scoring  algorithms  in  HAHt;.  but  the  problem  is 
symptomatic  of  the  larger  issue  of  modeling. 
Ideally,  the  generation  and  verification  algorithms 
would  work  with  three-dimensional  models  in 
object  space.  This  strategy  allows  all  feasible 
models  to  reach  verification,  where  preci.se 
geometric  information  permits  rigorous  testing  of 
illumination  constraints  across  adjacent  planar 
surfaces',  prediction  and  verification  of  cast 
shadows  [Irvin  89j;  and.  the  application  of 
stereoscopic  information  for  consistency 
constraints  across  multiple  views.  Understanding 
these  issues  and  the  evaluation  of  rigorous 
techniques  to  address  these  problems  are  the 
subject  of  current  research. 


Figure  14:  New  bahi  results.  rai)T,''\V()H. 


8.  Conclusion 

We  have  described  initial  experiments  in 
incorporating  photogrammetric  calcidations  in  an 
existing  building  detection  system,  analy/ed  the 
results  on  a  small  set  of  oblique  aerial  images,  and 
raised  several  issues  of  modeling,  hypothesis 
generation,  and  hypothesis  verification  that  must 
ultimately  be  addressed  in  a  complete 
implementation  of  a  photogrammetrically  rigorous 
feature  extractor.  Although  the  work  is  certainly  in 
its  preliminary  stages,  and  many  issues  and  flaws 
remain  to  be  addressed,  we  believe  that  the 
qualitative  results  show  that  the  combination  of 
precise  camera  modeling  and  geometric 
information  with  existing  feature  extraction 
algorithms  provides  a  powerful  approach  for 
increasing  the  performance  of  building  detectors 
on  complex  aerial  imagery  . 
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Abstract 

Image  defonnations  dae  to  tdatiTc  motioB  between 
an  obeenrer  and  an  object  may  be  nsed  to  infer  3-D 
stmctnie.  Up  to  lint  order  these  deformations  can  be 
written  in  terms  of  an  affine  transform.  This  paper 
presents  a  method  for  measuring  the  affine  tran^rm 
locaDy  between  two  image  patches  that  correctly  han¬ 
dles  the  difficnlty  of  anHing  the  correspondence  be¬ 
tween  deformed  patches,  ne  method  uses  weighted 
moments  of  brightness.  It  is  shown  that  the  moments 
of  deformed  patches  ate  rdated  throngh  fonctions  of 
affine  transforms.  In  the  special  case  where  the  affine 
transform  can  be  written  as  a  scale  change  and  an  in¬ 
plane  rotation,  the  seroth  and  first  moment  equations 
ate  solved  for  the  scale.  The  robnstness  of  the  method 
is  demonstrated  aqierimentaUy. 


1  Introduction 


Changes  in  the  relative  orientation  of  a  surface  with 
respect  to  a  camera  cause  deformations  in  the  im¬ 
age  of  the  surface.  Deformations  can  be  nsed  to  in¬ 
fer  local  surface  geometry  from  motion  [Koendetink 
and  van  Doom,  1987;  Sawhney  and  Hanson,  1991; 
CipoOa  and  Bl^,  1M3;  Jones  and  Malik,  1992b]. 
Since  a  repeating  texture  pattern  can  be  thought  of 
as  a  pattern  in  motion,  shape  from  texture  can  be 
derived  from  deformations  iKanade  and  Kender,  1983; 
Super  and  Bovik,  1993].  Constraints  on  the  shape  of 
the  nndeformed  stracture  also  allow  the  computation 
of  shape  from  texture  [Brown  and  Shyvaster,  1990; 
Carding,  1990]. 

To  first  order,  the  image  deformatioB  and  translation 
due  to  relative  motion  can  be  described  using  a  six 
parameter  affiru-  transformation  Awhere 
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r,  and  are  the  image  coordinates  and  uo  and  sp 
the  image  translation.  This  is  a  valid  iqiprorimatioB 
assuming  local  planarity  and  weak  perspective  projec¬ 
tion  [Kanade  and  Kender,  1983].  Even  in  situations 
where  fhU-perspectivc  projection  must  be  used,  it  can 
be  shown  that  if  the  change  in  rdative  orientation  of 
the  surface  patches  is  small,  the  image  projections  can 
again  be  related  by  an  afilne  transfim  [Adiv,  1985]. 

The  recovery  of  3-D  structure  from  the  affine  trans¬ 
form  requites  robust  loesl  estimates  of  the  affine  pa¬ 
rameters.  Consider  the  case  where  an  image  patch  Fx 
is  deformed  into  a  patch  ¥%  (either  in  the  same  im¬ 
age  or  in  another  image)  by  unknown  affine  trans¬ 
form.  The  problem  of  measuring  the  affine  transform 
is  to  first  find  the  cortespondiag  patch  Ft  pven  Fx  and 
second  to  recover  the  afibw  parameters  ^m  the  two 
patches.  Even  if  the  centroids  of  the  image  patches 
are  matched,  the  precise  sise  and  shape  of  Ft  is  diffi¬ 
cult  to  determine  since  it  is  a  fuactioa  of  the  unknown 
deformation.  If  this  correspondence  is  not  precisely 
done,  the  affine  parameters  will  be  determine  incor^ 
rectly.  Thus  the  problem  is  more  dfficult  than  in  stan¬ 
dard  correspondence  problems  e.g.  the  determination 
of  optical  flow. 

Existing  methods  using  image  patches  usually  ignore 
this  problem.  A  number  of  techniques  assume  that 
the  affine  parameters  are  smaD  and  then  linearise  the 
brightness  fhnetioB  or  filtered  versfons  of  it  with  re¬ 
spect  to  the  spatial  coordinates  (Bergen  et  at,  1992; 
Koenderink  and  van  Doom,  1987;  Campani  and  Verri, 
1993;  Werkhoven  and  Koenderind,  1990].  Thus  these 
methods  are  restricted  to  cases  where  the  affine  trans¬ 
form  is  small  which  in  turn  requires  that  the  3-D  mo¬ 
tion  be  small.  [Jones  and  Malik,  1992b]  do  not  assume 
that  the  affine  transform  is  smaD.  Their  method,  how¬ 
ever,  uses  brate  force  search  techniques  and  again  ig¬ 
nores  the  determinarion  of  predae  correspondence.  A 
natural  way  to  find  correspondence  is  to  use  straight 
lines  [Sawhney  and  Hanson,  1991]  or  dosed  boundary 
contours  [GpoDa  and  Blake,  1992J,  with  the  change  in 
the  aise  and  sh^pe  ci  the  enclosrf  area  defining  the 
affine  tranform.  These  methods,  however,  foil  when 
such  stroctures  are  absent  as  in  many  richly  textured 
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Mcenea.  Farther,  thdr  om  hai  only  been  demonatrated 
on  homogeneous  image  regions  with  closed  boundaries. 

This  paper  presents  a  technique  for  reliably  measur¬ 
ing  affine  transforms  that  correctly  handles  the  diffi¬ 
culty  of  corresponding  deformed  image  patches.  The 
image  patches  are  filtered  using  gaussians  and  derivar 
tives  of  gaussians.  Measuring  the  affine  transform  is 
then  recast  as  a  problem  of  finding  the  deformation 
parameters  of  the  filters  rather  than  the  patci.  rs.  For 
example,  let  Fi  and  Fa  be  related  by  a  scale  change 
s.  Then  the  output  of  Fi  filtered  with  a  gaussian  of 
a  will  be  equal  to  the  output  of  Fa  filtered  with  a 
gaussian  of  aa.  Similar  relationships  hold  for  arbitrary 
affine  transfornu  and  filters  described  by  derivatives  of 
gaussians.  These  equations  ate  exact  for  any  arbitrary 
affine  transform  in  arbitrary  dimensions. 

The  second  part  of  the  paper  focuses  on  solving  for 
the  affine  transform  when  it  can  be  written  as  the 
product  of  a  scale  change  and  a  rotation  (the  solution 
for  the  general  case  will  be  considered  in  future  pa¬ 
pers).  For  example,  this  situation  arises  in  the  case  of 
mostly  translational  camera  motion  and  shallow  struc¬ 
tures  (i.e.  structures  whose  extent  in  depth  is  small 
compared  to  their  distance  from  the  camera  [Sawhney 
and  Hanson,  1991]). 

The  equation  can  be  solved  by  sampling  the  a  space. 
Rather  than  use  a  brute  force  search  technique,  the 
search  space  is  sampled  for  a  few  different  a*  and  one 
of  the  </  is  picked  as  the  operating  point.  The  scale  is 
recovered  by  Uneatising  the  gaussian  filter  with  respect 
tc<  a  about  this  operating  point  using  the  diffusion 
equation.  Con^tency  is  us^  to  establish  the  correct 
operating  point.  Note  that  linearization  U  done  vitk 
reaped  to  <r  za  opposed  to  linearisation  with  respect 
to  the  image  coorditiatea  done  by  other  methods.  As 
a  result,  scale  changes  of  arbitrary  magnitude  can  be 
dealt  with  by  choosing  different  operating  points.  The 
rotations  can  also  be  arbitrary.  In  contrast,  linearis¬ 
ing  with  respect  to  the  image  coordinates  is  a  valid 
approximation  only  for  small  affine  transfornu. 

The  gaussian  (seroth  moment)  equation  is  linear  in 
the  scale  parameter.  By  sampling  at  several  scales, 
an  overconstrained  linear  system  is  obtained.  This 
is  solved  for  scale  using  singular  value  decomposition. 
Using  the  first  moment  an  equation  which  is  nonlin¬ 
ear  with  respect  to  scale  is  obtained.  Again,  this  may 
be  sampled  at  multiple  scales  to  provide  an  overcon¬ 
strained  system  of  equations.  This  non-linear  system 
is  solved  using  the  Gauss-Newton  technique.  The  first 
moment  equation  also  allows  the  computation  of  the 
rotation.  Both  the  formulation  and  the  solution  are 
done  for  arbitrary  dimensions,  not  just  2.  Experimen¬ 
tal  results  are  shown  on  both  synthetic  and  real  im¬ 
ages  attesting  to  the  robustness  and  simplicity  of  the 
method. 


2  Deformation  of  Filters:  Zero 
Moments 

Notation  Vectors  will  be  represented  by  lowercase 
letters  in  boldface  while  matrices  will  be  represented 
by  uppercase  letters  in  boldface. 

We  will  assume  that  the  image  translation  is  known 
and  has  been  set  to  seto.  Methods  for  finding  the  im¬ 
age  traiulation  are  briefly  discussed  in  section  4.2.4. 
Then  the  affine  transform  has  only  four  deformation 
parameters.  It  is  also  assumed  that  shading  and  illu¬ 
mination  effects  can  be  ignored.  These  can,  however, 
be  taken  care  of  by  incorporating  an  additional  con¬ 
stant  factor  in  the  equations.  For  simplicity,  we  focus 
on  the  2-D  case  although  the  discussion  is  dimension- 
independent. 

Our  discussion  is  based  on  two  observations.  First,  the 
result  of  a  filtering  operation  on  two  image  patches  will 
be  different  in  general  unless  the  filter  is  appropriately 
deformed  for  the  second  image  patch — the  deforma¬ 
tion  being  a  function  of  the  affine  transform.  Second, 
moments  of  the  image  patches  are  related  by  simple 
fnnctioiu  of  the  affine  transforms,  and  this  can  be  ex¬ 
ploited  to  compute  the  afiSne  traiuform. 

Consider  two  functions  Fi  and  Fa  related  by  an  affine 
transform  of  the  underlying  coordinate  system.  Then 

Fr(r)  =  Fa(Ar)  (2) 

Thdr  integrals  over  some  finite  interval  are  related  by: 

J  Fi(r)dr  =  J  Fa(Ar)dr 

fAm 

=  /  Fa(Ar)d(Ar)detA-'  (3) 

J-Am 

expressed  succinctly  as 

vi  =  va>i  detA”^.  (4) 

Let  s,-  be  the  scale  change  along  the  dimension  and 
n  the  number  of  dimennons.  Then  det(A)  =  H^s,. 
This  can  be  intuitively  understood  as  follows.  Con¬ 
sider  the  I'D  case  (n  =  1),  where  the  affine  trans¬ 
form  reduces  to  a  scale  change.  Let  the  function  Fi 
be  graphed  on  a  rubber  sheet.  The  graph  of  Fi  is 
obtained  by  stretching  the  sheet  and  attached  coor¬ 
dinate  system.  The  determinant  term  is  equal  to  the 
stretching  undergone  by  the  coordinates.  Note  that 
the  integral  of  a  function  may  also  be  viewed  as  its 
seroth  moment. 

(3)  cannot  be  used  directly  because  the  limits  on  the 
right-hand  nde  depend  on  the  afline  transform  and  are 
therefore  unknown.  This  cmdal  point  has  not  been 
handled  correctly  before.  On  the  other  hand,  taking 
the  limits  from  -oo  to  oo  would  not  preserve  local¬ 
isation.  The  solution  to  this  problem  is  to  weight  the 
function  by  another  which  decays  rapidly — here  the 
gausuan  is  used. 
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We  present  the  weighted  eqnationa  analogous  to  (3) 
first  for  the  case  where  A  =  sR  (i.e.  the  affine  trans¬ 
form  equals  a  scale  change  a  times  a  rotation  R),  fol¬ 
lowed  by  the  general  case. 

2.1  Case  A  =  sR 

Denote  the  unnormalised  gaussian  by 

J5r(r,  a^)  =  exp(-r’'r/2<T’).  (5) 

Multiply  both  ^es  of  (2)  by  S{T,er^)  to  obtain 

i!i(r)ir(r,  <r*)  =  F2{At)H(t,  a*).  (6) 

Fkom  the  orthonormality  of  rotations  it  follows  that 
R(r,  o’)  =  Jr(Ar,  (so)*),  (7) 

which  allows  (6)  to  be  rewritten  as 

Fi(r)JI(r,  o’)  =  Fa(Ar)R(Ar,  (so)’).  (8) 

The  weighted  seroth  moment  is  therefore 

J  Ji(r)ir(r,o’)dr  =  J  F3(Ar)J7(Ar,  (so)’)dr 
=  J  J3(Ar)R(Ar,  (so)’)d(sRr)s"*, 

(») 

where  the  limits  are  taken  from  — oo  to  oo.  The  factor 
s~*  =  detA~^  can  be  eliminated  by  using  normalised 
ganssians 

in  place  of  H.  The  moment  equation  then  becomes 

J  Fi(r)G(r,o’)dr 

=  J  F3(Ar)G(Ar,  (so)’)d(Ar).  (11) 

The  integral  may  be  interpreted  as  a  gaussian  conTO- 
lution  or  filtering  at  the  origin.  Thus  we  write  (11) 

Fi  *  G(r,  o’)  =  F2  *  G{tu  (so)’).  (12) 

where  =  Ar. 

(12),  the  weighted  analog  of  (3),  is  exact  and  valid  for 
arbitrary  dimennons.  The  problem  of  recovering  the 
affine  parameters  has  been  reduced  to  finding  the  de¬ 
formation  of  a  known  function,  the  gaussian,  rather 
than  that  of  the  unknown  brightness  functions.  How¬ 
ever  since  (12)  is  invariant  to  rotation  it  can  only  be 
used  for  recovering  the  scale  (the  recovery  of  rotation 
by  other  means  is  discussed  later).  Note  that  although 
the  limits  are  infinite,  smce  the  gaussian  is  a  rapidly 
decaying  function,  it  suffices  in  practice  to  take  limits 
from  -4(7  to  4(7  (and  correspondingly  from  — 4so  to 
4s(7  on  the  right-hand  ude). 


2.2  General  Case 


A  similar  equation  can  be  shown  to  ludd  for  arbitrary 
afilne  transforms,  provided  generalised  ganssians  axe 
used.  Define  a  generalised  gaussian  as 

C('.M)  =  (2,)»/adet(M)»/>  2  ^ 

where  M  is  a  symmetric  pontive  semi-definite  matrix. 
Then 


G(r,(7’l)  = 


(2»)»/’det((7»I)»/a  2(7>  ^ 

-  1  (A>)’-(AA’-)-^(Ar) 

“  (2x)-/aw^^  2(7*  ' 

=  det(AA’*)»/’G(Ar,(7’(AA’')).  (14) 


Thus  the  weighted  moment  equation  may  be  written 
J  Fi(r)G(r,a’j)dr  =  det(A)x 

J  i’3(Ar)G(Ai‘,(7’(AA’’))d(Ar)det(A-‘) 

=  J  F3(Ar)G(Ar,(7’(AA’’))d(Ar),  (15) 


where  the  identity  de^AA**)^^’  =  det(A)  has  been 
used.  The  matrix  AA’^is  a  symmetric,  positive  semi- 
definite  matrix  and  may  ther^re  be  written 

-^AA**  =  RSR**,  (16) 

CT* 

where  R  is  a  rotation  matrix  and  Z  a  diagonal  matrix 
with  entries  Sict*,  S3o’...s«(7’  (sj  >  0).  Thus 

J  Fi(r)G(r,(7’l)(ir=:  J  J'3(Ar)G(Ar,  RZR’')(l(Ar) 

(17) 

Again,  to  show  the  connections  to  convolution  and  fil¬ 
tering,  this  may  be  written  as 

Fi  *  G(r,  a^  =  Ft*  G(ri,  RIR**).  (18) 


(18)  is  the  analog  of  the  sero  moment  equation  (3), 
and  can  be  used  for  determining  the  affine  transform. 
The  levd  contours  of  the  generalised  gaussian  are  el¬ 
lipsoids  rather  than  spheres.  The  tilt  of  the  ellipsoid 
is  given  by  the  rotation  matrix  while  its  eccentricity 
is  ^ven  by  the  matrix  Z,  which  is  actually  a  func¬ 
tion  of  the  scales  along  each  dimension.  (18)  clearly 
shows  that  to  recover  affine  transforms  by  filtering, 
one  must  deform  the  filter  appropriately;  a  point  ig¬ 
nored  in  previous  work  [Bergen  e<  al.,  1992;  Koen- 
derink  an(l  van  Doom,  1987;  Cmnpani  and  Verri,  1992; 
Werkhoven  and  Koenderind,  1990;  Jones  and  Malik, 
1992b].  The  sero  moment  equation  (18)  alone  does 
not  permit  the  recovery  of  the  complete  afi^e  matrix — 
only  the  scales  and  the  tilt.  To  find  the  complete  affine 
transform,  higher  order  moments  need  to  be  consid¬ 
ered.  Uung  higher  order  moments  also  permits  the 
use  of  more  overconstrained  equations. 
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3  Higher  Order  Moments 

The  fiist  oidet  moments  of  Fi  and  Fj  are  related  hj 

J  rFi(r)dr  =  rF3(Ar)dr 

/An 

=  A-^  /  ArF2(Ar)d(Ar)detA-* 

J-Ati 

(19) 

or 

/rj  =  A~^ft2  X  detA~^.  (20) 

The  second  order  moments  are  given  by 

J  rr^Fi(r)dr 

=  A-^  Ar(Ar)’'F(Ar)d(Ar)(A-')’'detA-‘ 

J-Am. 

and  this  may  be  expressed  as 

Ti  =  A-^r2(A-^)’‘det(A-^).  (21) 

Note  that  the  seroth  moment  equation  is  a  scalar  equa¬ 
tion,  (20)  a  vector  one,  and  (21)  is  a  matrix  equation. 
As  before,  the  moment  equations  (20)  and  (21)  are  not 
directly  usable  due  to  the  difficulty  that  the  limits  of 
the  patches  integrated  over  depend  on  the  deforma¬ 
tion.  Therefore  we  again  employ  gaussian  weighted 
moments,  unng  the  lact  that  the  derivatives  of  gans- 
sians  are  closely  related  to  moments  weighted  with 
gaussians. 

The  effect  of  filtering  with  derivatives  of  gaussians  can 
be  obtained  by  differentiating  the  gaussian  (13).  First 
write  ri  =  Ar.  Differentiating  (13)  gives 

Fi(r)*dG(r,ff*I)/dr  =  A*’F3(ri)*dG(ri,  AA^<r*)/dri 

(22) 

where 

dG(r,  <r»I)/dr  =  o»I)  (23) 

and 

dG(ri,  AA^a^)/dti  =  — (AA’'<7*)“*riG(ri,  AA^a*). 

(24) 

Thu  equation  looks  different  from  the  first  moment 
(20)  because  the  first  derivative  of  the  gaussian  has 
been  normalised.  Convolving  with  second  derivatives 
of  a  gausnan  gives 

Fi(r)  ♦  d*G(r,<r*I)/(l(rr^)  = 

A^F2(r,)  *  d»G(ri,  AA’'cr»)/d(riri^)A  (25) 

where 

d»G(r,  a*I)/d(rr’’)  =  ~^G(r,  o*I)  (20) 

o* 


and 

d*G(ri,  AA*’a’)/d(riri’')  = 

I(A'Aw»)-»r»rf  (A'A<r>)-»  -  (AA*')"*]  x 

G(ri,AA*’w»).  (27) 

(25)  and  (21)  ate  seen  to  be  closely  related;  the  differ¬ 
ences  ate  the  additional  term  due  to  the  gaussian  in 
(25)  and  due  to  normalisation. 

Since  convedntions  with  ganssiaiu  and  derivatives  of 
gausnans  are  so  doady  related  to  the  original  weighted 
moment  equations,  they  will  often  be  referred  to  as 
moments  in  the  test  of  the  paper. 

4  Solving  the  Moment  Equations 

4.1  The  problem  of  even  and  odd 
functions  F 

If  the  value  of  the  moments  is  sero,  (or  neat  seto  in 
practice),  the  moment  equations  ate  ill-conditioned 
and  cannot  be  solved.  This  can  occur  in  two  wajrs; 
other  the  signal  strength  is  too  low  (i.e.  the  magni¬ 
tude  of  F  is  small)  or  the  function  F  is  pnrdy  even  or 
purdy  odd  causing  some  of  the  moment  equations  to 
be  seto.  There  is  little  that  can  be  done  in  the  first 
case.  The  latter  case,  however,  provides  insight  into 
the  number  of  moment  equations  requited  to  solve  for 
the  affine  parameters. 

Consider  first  the  1-D  case.  It  is  easy  to  see  that  the 
even  moments  of  any  odd  function  be  sero  while 
the  odd  moments  of  any  even  function  are  sero.  Since 
the  seroth  moment  is  even,  and  the  first  moment  is 
odd  and  only  one  parameter  (the  scale)  needs  to  be 
determined,  these  two  equations  suffice  to  find  it. 

The  rituation  is  a  little  more  complicated  in  higher 
dimenrions.  One  way  of  stating  the  problem  is  to  con¬ 
sider  each  dimension  separatdy.  Then  if  a  function  is 
odd  along  any  dimension,  its  contribution  to  the  even 
moment  ftom  that  dimenrion  will  be  sero  and  hence 
inferences  along  that  dimension  cannot  be  made.  Sim¬ 
ilarly,  if  a  function  is  even  along  any  dimension,  its 
contribution  is  sero  to  the  odd  moments  along  that 
dimension.  Note  that  tyjnadly  a  function  is  even  or 
odd  only  at  a  few  points  over  its  domain,  so  this  may 
not  be  a  significant  proUem. 

How  many  moments  are  required  in  2-D  to  solve  for 
the  affine  transform?  In  general  four  affine  parame¬ 
ters  need  to  be  determined.  Straightforward  equation 
counting  seems  to  show  that  there  are  four  even  equar 
tions  (1  from  the  seroth  moment  and  3  from  the  second 
moment),  and  there  are  six  odd  equations  (2  from  the 
first  moment  and  4  if  the  third  moment  is  niwd).  Thus 
even  if  the  function  is  purely  even  or  odd,  moments  up 
to  third  order  suffice  to  sedve  for  the  affine  parameters. 

However,  the  third  moment  is  actnaUy  not  required, 
since  the  previous  analysis  ignores  the  information 
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available  from  the  deformatioD  of  a  gansnan.  Conaidet 
the  seioth  moment  when  an  ailutrary  affine  tianifotm 
A  needs  to  be  measnied.  In  this  case,  the  seioth  mo¬ 
ment  may  be  nsed  to  find  the  matrix  AA^  (i.e.  3 
parameters  may  be  computed).  Accounting  for  the 
additional  information  available  from  the  deformation 
of  the  gaussian,  there  are  at  least  6  even  equations  (3 
from  the  seroth  moment  and  at  least  3  from  the  sec¬ 
ond  moment)  und  at  least  4  odd  equations  (from  the 
first  moment). 

The  function  F  may  also  be  transformed  so  that  some 
of  the  moments  are  always  non-sero.  For  example,  if 
instead  of  the  function  F,  the  magnitude  of  its  auto¬ 
correlation  is  nsed,  the  seroth  and  the  second  moment 
are  always  nonsero.  This  follows  from  the  radial  sym¬ 
metry  of  the  anto-cortelation  function — ^which  implies 
that  the  odd  moments  are  aD  sero  while  the  even  mo¬ 
ments  are  nonsero. 

A  different  transformation  uses  certain  algebraic  tricks 
to  convert  any  function  to  an  odd  or  an  even  func¬ 
tion  thus  ensuring  that  every  moment  equation  is  well- 
defined.  For  example,  consider  the  1-D  case  again. 
Every  function  F  can  be  written  as  the  sum  of  an 
even  part  EF  and  an  odd  part  OF.  The  odd  part  OF 
can  be  converted  into  an  odd  function  by  taking  its 
magnitude.  The  even  part  EF  can  be  converted  to 
an  odd  function  by  flipping  one  half  of  the  function. 
The  problem  is  somewhat  more  complicated  in  higher 
dimensions.  The  2-D  case  will  be  dealt  with  in  the 
solution  section. 

4.2  Solving  the  Zeroth  Moment 
Equation 

In  the  remainder  of  this  paper,  only  the  case  where 
the  affine  transform  A  =  sR  (i.e.  a  scale  change  and 
a  rotation)  will  be  considered  (see  section  2.1);  the 
general  affine  transform  will  be  considered  in  a  later 
paper.  The  seroth  moment  equation  will  be  dealt  with 
first  followed  by  the  first  moment  equation. 

When  the  affine  transform  is  described  by  A  =  sR, 
(12)  can  be  written  as 

fi(r)  *  G(r,o»)  =  F-2(r0  *  G{tu  (a*)»)  (28) 

where  <r*  =  sa.  The  important  point  here  is  that 
the  rotation  matrix  does  not  figure  in  a*.  The 
problem  of  finding  the  scale  parameter  has  therefore 
been  converted  into  the  problem  of  finding  the  value 
of  a*.  Older  methods  [Bergen  e<  of.,  1992;  Koen- 
derink  and  van  Doom,  1987;  Campani  and  Verri,  1992; 
Werkhoven  and  Koenderindi,  1990;  Jones  and  Malik, 
1992b]  have  instead  concentrated  on  the  much  mote 
difficult  problem  of  trying  to  correspond  the  functions 
Fi  and  Ft. 

The  equation  can  be  solved  by  sampling  the  space  of 
possible  values  of  a* ,  filtering  for  each  sampled  value 
and  decUring  the  solution  to  be  that  value  of  a*  for 


which  the  above  equation  has  smallest  residual  error 
according  to  some  norm.  A  more  elegant  approach 
uses  the  fact  that  the  affine  transform  can  be  analyti¬ 
cally  interpolated.  The  idea  is  to  sample  over  a  small 
set  of  <r*  and  then  interpolate  using  a  Taylor  series 
approximation.  Consider  first  a  given  <r*.  The  Tkylor 
series  approximation  to  first  order  gives 

G(»ji  (*»)’)  «  G(»i,<r*)  +  a<r^^^^^^  (29) 
=  G(ri,  ff*)  aw*  V*G(ri, 

where  s  =  1  -f  a.  The  last  equality  follows  from  the 
diffnsioa  equation  ^  =  aV*G.  This  allows  the  con¬ 
volution  (28)  to  be  written  as 

Fi  *  G(r,  <r*)  tn  Fa*  G(ri,  cr*)  +  cur^Fa  *  V*G(ri,  <r*) 

(31) 

The  above  equation  is  linear  in  s.  To  find  s,  three 
filtering  operations  need  to  be  performed:  two  gans- 
sian  filtering  operations  and  one  lapladan  operation. 
Note  that  the  above  equation  expresses  the  well-known 
result  that  a  lapladan  can  be  approximated  by  a  dif¬ 
ference  of  gansrians. 


4.2.1  Issues  of  Scale 


Information  in  an  image  is  scale  dependent.  There 
may  be  information  present  at  several  different  scales 
or  at  only  one  of  them.  A  method  which  does  not 
take  this  into  account  is  not  likely  to  be  robust.  Thus 
it  is  desirable  to  s<dve  the  above  equation  at  several 
different  scales  (cTj).  let  a  set  ofwj  be  chosen.  For  each 
such  ffi  an  equation  of  the  form  (31)  may  be  written 
giving  the  following  system  of  equations 

Ft  *  G(r,  (To)  »  Fa*  G(ri,  og)  aa^Fa  *  V*G(ri,  ag) 
Ft  *  G(r,  af)  fs  Fa*  G(rt,  wj)  -f  aafFa  *  V*G(ri,  <rj) 


Ft  *  G(r,  (T?)  «  Fa  *  G(r,,  a?)  +  oaf  Fa  *  V»G(r,,  <r?) 


Ft  *  G(r,  af)  »  Fa*  G(ri,  of)  +  aafFa  *  V*G(ri,  <rf) 

(32) 

This  is  an  overconstrained  set  of  equations  in  the 
unknown  a.  The  redundancy  offered  by  the  over- 
constrained  problem  also  makes  it  more  robust  with 
respect  to  noise. 

The  particular  choice  of  the  o,-  is  to  some  extent  ar¬ 
bitrary  although  some  general  criteria  may  be  speci¬ 
fied.  Too  sm^  a  Oj  will  make  the  system  sensitive 
to  noise  while  localisation  requires  that  Oj  not  be  too 
large.  The  actual  values  are  not  very  crucial.  In  prac¬ 
tice,  a  set  of  eight  different  Oj  were  diosen.  They 
were  all  spaced  apart  by  half  an  octave  (a  factor  of 
1.4).  The  filter  width  =  8  at  (since  the  filters  need  to 
range  from  — 4<r{  to  4aj).  The  widths  actually  chosen 
were  (3,5,7,10,14,20,28,40)  (see  also  [Jones  and  Malik, 
1992a]). 
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The  above  system  of  equations  was  cast  into  the  fol¬ 
lowing  linear  least  squares  problem 

Ei||Fj*G(r,  ffi)-Fa*G(ri,  a?)-|-aa?Fa*V*G(ri, 

(33) 

and  was  solved  nsing  Singular  Value  Decomposition 
(SVD).  It  was  found  that  the  lowest  filters  (widths 
=  3,5,7)  were  noisy  and  hence  they  were  disregarded 
(one  reason  may  be  that  the  lapladan  is  noisy  when 
the  filter  sise  is  small).  The  scale  was  recovered  fsirly 
accurately  nsing  the  other  widths  (see  the  experimen¬ 
tal  section  for  details).  This  set  of  Uter  widths  worked 
better  than  another  one  where  8  filters  were  used  with 
their  at  spaced  apart  by  a  foctor  of  1.2;  presumably 
because  with  a  larger  variation  in  scale,  there  is  mote 
information  available  at  multiple  scales. 

4.2.2  Choosing  a  Different  Operating 
Point 

For  large  s  (say  a  >  1.3)  the  recovered  scale  sometimes 
tends  to  be  poor.  Thk  is  because  the  Thylor  series 
approximation  is  good  only  for  a  small  change  in  a. 
The  problem  arises  because  in  (31)  the  right-hand  side 
is  expanded  around  the  same  at  as  on  the  left-hand 
side.  A  better  approximation  is  obtained  by  expanding 
a*  as  close  to  the  correct  scale  as  possible.  An  example 
should  clarify  this  point.  Assume  that  the  left-hand 
side  uses  a  —  oq  and  that  the  scale  change  a  is  1.3, 
then  it  is  better  to  expand  the  right-hand  side  around 
a  I  =  lAffo  (i<c*  the  half-octave  step  closest  to  the 
actual  scale)  rather  than  expanding  at  o-q.  In  this  case, 
(31)  may  therefore  be  modified  to 

Fi*G(r,<r?) «  F3*G'(riiO^?+r)+<»Vi+rfa*V’G(ri,of.i) 

(34) 

where  a  =  1.4(l-{-a').  Since  filtering  by  a  set  of  a  is  al¬ 
ready  being  performed  for  the  overconstrained  system 
no  additional  filtering  operations  ate  required.  Again, 
an  overconstrained  system  may  be  implemented  eas¬ 
ily.  For  each  value  of  Oi  on  the  left-hand  ude  of  (34), 
expand  around  at+i  on  the  right-hand  side. 

A  similar  scheme  may  be  implemented  if  the  scale  a 
<  0.8  by  expanding  around  a  Oj  which  is  smaller  by  • 
factor  of  1.4. 

A  priori  the  at  around  which  the  expansion  should  be 
done  is  not  known  since  the  value  of  the  scale  is  not 
available.  The  solution  is  to  expand  around  all  three 
of  them  (i.e  Oj,  lAat  and  ai/\A)  and  then  pick  the 
correct  answer  to  be  the  one  which  gives  <r*  close  to 
the  operating  point.  Again  an  example  will  clarify  this 
point.  Assume  that  the  correct  scale  is  again  1.3  and 
that  the  three  different  operating  points  return  the 
following  values  of  a  (1.25,  1.32,  1.18).  Consistency 
decides  the  correct  answer  here.  1.18  is  inconsistent 
with  expanding  around  o/lA  and  can  be  rejected.  The 
other  answers  are  both  between  1  and  1.4  and  closer  to 
1.4.  Therefore,  the  appropriate  operating  point  to  pick 
is  1.4ai  =  (Ti+i.  Experimentally,  this  method  seems  to 


work  weD.  An  alternative  is  to  compare  the  residual 
error  after  SVD  minimisation;  this  does  not  seem  to 
work  as  weD,  partly  because  the  different  errors  are 
not  really  comparable — ^they  have  different  numbers 
of  equations.  Another  technique  that  has  been  tried  is 
to  make  all  the  equations  into  a  single  overconstrained 
system  and  solve  it  using  SVD  — based  oa  the  answer 
obtained,  some  of  the  equations  may  be  dropped  and 
the  system  resolved. 

In  principle  the  same  technique  can  be  used  to  ex¬ 
pand  around  nonnearest  neighbor  operating  points  oj-, 
|y  — 1|  >  1,  if  the  scale  gets  very  large  (or  small).  The 
range  of  scales  to  be  expected  depends  on  the  applica¬ 
tion.  For  structure  for  motion,  a  scale  change  of  more 
than  1.4  almost  never  happens  in  practice.  In  find¬ 
ing  shape  from  texture,  in  principle  any  scale  change 
can  occur.  If  the  surface  is  smooth,  it  is  expected  that 
there  is  likely  to  be  a  neighbouring  texture  patch  whose 
scale  change  is  less  than  2.5.  In  this  case  one  should 
also  expand  around  2.0<r  and  0.5<r  in  addition  to  a,  1.4a 
and  l/1.4a.  Very  high  scale  changes  are  probably  dif¬ 
ficult  to  measure  in  any  case  because  of  the  extreme 
foreshortening  that  this  implies.  We  reemphasise  that, 
apart  from  such  inherent  limitations,  our  approach  can 
in  principle  handle  large  magnitude  affine  transforms 
with  little  approximation,  whereas  previous  methods 
were  limited  to  small  transforms. 

4.2.3  Ensuring  that  Fi  is  always  even 

The  method  does  not  work  if  the  output  of  the  gans- 
sian  convolution  is  sero  (or  close  to  sero  in  practice). 
This  can  happen  either  ff  the  ngnal  is  weak  or  if  the 
signal  shows  odd  symmetry  along  any  dimension. 

The  1-D  case  was  dealt  with  in  section  4.1.  Here  it 
is  shown  how  a  function  in  2-D  may  always  be  con¬ 
verted  into  an  even  function.  One  cannot  consider  each 
dimension  separately  for  this  would  destroy  the  rota¬ 
tional  symmetry  of  the  gaussian.  Instead,  the  function 
is  decomposed  into  parts  which  are  radially  even  ErFi 
and  radially  odd  OrFi  where 

ErFi  =  Fi(r) -t- Fi(-r)  (35) 

OrFi  =  F<(r)-Fi(-r)  (36) 

The  magnitude  of  both  fimctions  (|  ErFi  |  |  OrF^  |)  is 
then  taken.  The  resulting  functions  are  both  even  and 
the  gaussian  convolution  is  nonsero  for  both.  The  SVD 
is  perform^  on  each  set  of  these  functions  separately 
and  the  one  srith  the  lower  error  is  then  used  (this 
ensures  that  if  either  the  even  or  odd  components  is 
really  small,  it  is  ignored). 

4.2.4  Finding  Image  lYanslation 

Before  scale  can  be  recovered,  the  two  patches  must 
be  aligned  by  finding  the  image  translation.  These  can 
be  found  n^g  traditional  optical  flow  or  displacement 
schemes  [Anandan,  1989].  Alternatively,  the  residual 
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of  the  SVD  enor  can  be  ued  to  localise  the  image 
translation  to  ±0.5  pixels.  This  is  done  in  the 
lowing  manner.  The  first  image  patch  is  filtered  with 
the  set  of  ganssians.  The  second  patch  is  filtered  with 
gaussians  at  every  pixel  in  a  small  nindow  centered 
at  the  first  patch’s  location  and  the  SVD  computed. 
That  pixel  for  which  the  SVD  residual  is  tniwitniwl  ia 
declared  to  be  the  correct  image  translation.  Experi¬ 
mentally,  this  method  was  found  to  work  satisfactorily. 
Note  that  in  general  no  additional  filtering  operations 
are  required  since  the  filtering  operations  are  done  at 
every  point  in  the  image  anyway. 

4.3  Experimental  Results 

Experiments  were  carried  out  both  on  synthetic  images 
as  well  as  a  pair  of  real  images.  The  first  synthetic 
image  (Figure  1)  shows  a  corine  wave  generated  by 
the  equation  Fi{z,y)  =  127 cos((i;,y 4y)'  The  cosine 
was  picked  for  the  following  interesting  properties.  It 
can  be  made  even  or  odd  at  any  point  depending  on 
the  value  of  dy.  Further,  there  is  no  information  along 
the  y  direction  (the  so-called  aperture  problem).  In 
spite  of  that  the  scale  can  be  recovered. 

For  the  first  experiment,  dy  ***  chosen  to  be  aero,  so 
that  the  function  was  even.  Wy  =  0.2  was  chosen.  A 
second  corine  function  was  generated  uring  the  frilow- 
ing  function  F3(x,y)  =  127cos(swyS  -f-  (Figure  2). 
Fy  is  rotated  90”  with  respect  to  Fi  «nd  also  scaled  1^ 
the  factor  s.  For  various  values  of  s,  the  scale  was  re¬ 
covered  using  the  seroth  moment.  The  results  ate  tab¬ 
ulated  in  Table  1.  The  experiment  was  repeated  with 
noise  added.  First,  uniform  noise  ranging  from  —10 
to  10  was  added  to  F;.  Second,  ganasian  noise  with 
a  standard  deviation  of  10  was  added  to  Fy.  These 
results  are  also  tabulated  in  Thble  1.  Two  operating 
points  were  used:  <r  and  1.4<r.  The  appropriate  oper¬ 
ating  point  was  picked  as  discussed  in  the  text. 

Table  1  is  to  be  read  as  follows.  The  first  column  in 
Table  1  is  the  actual  scale  while  column  2  shows  the 
recovered  scale  in  the  noise-firee  case.  Two  different 
percentage  errors  are  tabulated  and  they  arise  from  the 
following  considerations.  Assume  that  an  object  is  at  a 
depth  of  20  and  after  a  translation  T,  in  the  s  direction, 


its  new  depth  is  si  =  xo  +  T,.  Then  the  percentage 
ettM  in  finding  the  quantity  *i/xo  >*  pven  hy  ^  * 
100  and  this  is  tabulated  in  erdumn  3  for  the  aoise> 
free  case.  On  the  other  hand,  the  percentage  error  in 
finding  the  quantity  Ta/xo  is  given  by  ^*100  and  this 
is  tabulated  for  the  noise-free  case  in  ^nmn  4.  Which 
of  these  quantities  is  more  important?  Since  the  depth 
20  is  a  priori  unknown,  the  quantity  of  relevance  at 
least  in  the  motion  case  is  T,/xo  and  the  corresponding 
percentage  error  is  more  significant.  Similac  values  are 
tabulated  when  gansrian  noise  (ednmns  5,6  and  7)  and 
uniform  ndse  (^nmns  8,0,  and  10)  ate  added. 

The  results  show  that  even  with  nmse  depth  recon¬ 
struction  effectively  has  an  accuracy  on  the  order  of 
several  percent.  The  results  are  excellent  in  the  noise- 
free  case.  The  percentage  errors  in  column  3  ate  aD  less 
than  about  3%  while  even  in  column  4  the  percentage 
errors  do  not  exceed  7%.  Note  that  the  method  recov¬ 
ers  scale  accurately  in  siute  of  the  large  rotation. 

With  noise  added,  the  results  are  as  good  except  for 
the  lowest  scales  (1.05  and  1.10,  corresponding  to  the 
largest  depths).  These  results  for  the  lower  scales 
mi^t  be  improved  by  using  operating  points  separated 
by  ratios  smaller  than  1.4. 

The  experiment  was  repeated  using  dy  =  v/2  for  both 
images.  The  method  failed  because  the  function  now 
becomes  an  odd  function  at  the  origin  and  thus  the 
resnlt  of  gaussian  filtering  is  sero.  However,  if  the 
function  is  transformed  into  an  even  function  uring 
the  methods  discussed  in  the  text,  the  seroth  moment 
can  once  again  be  applied  and  the  results  are  similar. 

The  experiment  was  repeated  with  random  dot  images. 
A  random  dot  image  of  sise  64  by  84  was  generated 
(Figure  3).  The  image  was  then  afl^e  transformed  and 
smoothed  using  a  cubic  interpolation  scheme.  For  var- 
ious  values  of  the  scale  factor  s,  the  scale  was  recovered 
uring  the  seroth  moment  method.  The  results  are  tab¬ 
ulated  in  Table  2.  The  highest  error  in  column  4  (rel¬ 
ative  depth  error)  is  less  than  9%  if  the  smallest  s^e 
(1.05)  is  ignored.  Again  the  relative  error  in  s  (column 
3)  is  much  lower.  lie  error  is  somewhat  larger  in  this 
case  because  the  program  that  afiine  transforms  the 
image  does  interp^tion  which  tends  to  destroy  im- 
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age  stnictiixe.  This  is  moie  serioos  at  the  lower  scales. 

Finally  the  algorithm  aw  tested  on  a  pair  of  real  im* 
ages  from  a  sequence  [Sawhney  and  Hanson,  IMl]. 
'ne  images  were  taken  with  a  Sony  ccd  camera  us¬ 
ing  a  robot  moving  straight  ahead.  The  robot  moved 
about  1.4  ft  between  frames.  Since  the  original  images 
were  taken  arith  the  intent  of  using  a  line  based  algo¬ 
rithm,  most  of  the  objects  have  little  intensity  varia¬ 
tion  in  their  interior.  However,  the  posters  on  the  back 
waQ  show  some  intensity  variation  and  can  therefore 
be  used.  Points  1  and  2  arere  picked  by  hand  in  the  first 
image  (Figure  4).  The  corresponding  points  in  the  sec¬ 
ond  image  (Figure  5)  were  determined  using  the  SVD 
residual  error.  For  pmnt  1,  the  recovered  scale  was 
1.07  which  corresponds  to  a  distance  of  1.4/(1.07  -  1) 
=  20  ft.  For  pmnt  2  the  recovered  scale  was  l.Ofi  which 
corresponds  to  a  distance  of  1.4/(1.065  -  1)  =  21.5  ft. 
The  measured  distance  to  the  back  waO  is  20.3  ft.  The 
accuracy  is  thus  within  6%. 

4.4  Solving  the  First  Moment 
Equation 

Again  for  the  case  where  the  afline  transform  is  de¬ 
scribed  by  A  =  sR,  the  first  moment  equation  (22) 
may  be  written 

Fi(r)  *  dG(r,flr’)/dr  =  sRF3(ri)  ♦  dG(ri,  (s<r)*)/dri 

(37) 

The  difinsion  equation  applied  to  the  derivatives  of  the 
gansaan  gives  the  following  identity 

BGrJda  =  V»G,„  (38) 

where  Gry  denotes  the  derivative  of  the  gansaan  with 
respect  to  the  coordinate.  Using  this  identity, 
the  right-hand  side  of  (37)  is  expanded  around  a  and 
rewritten  as 

^*1(0  *  G,,(r,<r’) « 

•  vR,F2(ri)  *  [Gr^(ri,<r*)  +  aa’V’G,^(ri,<r’)(39) 

where  Rj  is  the  row  of  R  and  s  =  1  -(■  o.  Note 
that  there  are  2  such  equations.  This  may  be  more 
conveniently  written  as 

pi(<7)  =  (1  +  a)R[/i2((r)  +  Qff*((<r)],  (40) 


e(a)  =  #!i(ri)  •  d(V»G(r»,ir»))/dri  (41) 
and 

Ml(flr)  =  li(r)  ♦  dG(ri,  <r*)/dn.  (42) 

The  rotation  matrix  can  be  by  taking  the 

dot-product  (Lc.  the  magnitude)  ci  both  sides  (40) 
and  equating  them.  This  gives 

=  (1  + 

(43) 

This  is  a  polynomial  equation  in  the  unknown  a.  As 
before  several  different  scales  Wj  are  used  to  give  an 
overconstrained  system.  The  resulting  system  can  be 
solved  using  the  Gauss-Newton  technique  [Gill  ei  cL, 
1981].  The  Gauss-Newton  procedure  works  by  lin¬ 
earising  the  system  around  the  current  estimate  of  the 
solution  reducing  the  problem  to  alinear  least-squares 
problem.  Ddfine  a  vector  function  e(a)  where  the 
component  is  given  1^ 

«<(“)  =  Mii'Mii-(l  +  tt)’[M2i’'^ 

-I-  Vi*’^i]  (44) 

Then  if  at  is  the  current  estimate  of  a,  then  as.fi  = 
as  -t-  Pfc  where  ps  is  the  solution  of  t^  linear  least 
squares  problem 

P*P+e*||i.  (45) 

where  a  quantity  subscripted  by  k  denotes  that  quan¬ 
tity  evaluated  at  as  and  tl(a)  is  the  Jacobian  matrix 
of  c(a). 

The  least  squares  problem  was  solved  using  SVD  and 
convergence  was  found  to  be  raind — ^withfo  a  couple 
of  iterations.  The  method  was  tested  on  a  sine-wave 
pattern. 


In  two  dimensions,  the  rotation  may  be  computed  in 
the  Icdlowiag  manner.  Consider  (40)  again.  This  may 
be  rewritten  as 

Ml  (a)  =  Rb,  (46) 

where 

b  =  (1  -b  a)l/t2(<r)  +  aa^((a)]  (47) 

is  a  known  quantity  (since  a  is  now  known).  Let 
b  =  (61,63)'.  Then  (46)  can  be  transformed  into  the 
foDowing  form 

Mi(a)  =  Bw  (48) 

where 

B=  ^  X\ 

u  =  (cosfi,sinfi)  and  $  is  the  rotation  angle.  Us¬ 
ing  the  identities  cosfi  =  ^1  —  sin’fi,  and  = 
— B/(detB),  (48)  can  be  tra^ormed  into  the  follow¬ 
ing  pair  of  equations  each  linear  in  the  unknown  rin  B. 


'l-[(6iei-b63€a)/(detB)]>  = 


(— kaci -f  6iC3)(detB)  =  rin  8  (51) 


where  mi  =  (‘iiCr)^-  Such  pairs  of  equations  can  be 
written  for  every  <ri  and  the  resulting  linear  system  of 
overconstrained  equations  can  be  solved  using  SVD  for 
the  rotation  angle  B. 
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5  Future  Extensions 

Fntnze  work  incladcs  the  tolntioa  for  tke  cate  of  tke 
general  affine  transform  as  wdl  as  the  nae  of  tke  second 
moment  equation.  Other  possibilities  include  the  au¬ 
tomation  of  the  process  over  the  entire  image  and  the 
detection  of  occlusions.  Finally,  the  use  of  the  affine 
transform  to  find  surface  orientation  from  both  texture 
and  motion  cues  wiD  be  explored. 
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Figure  1:  Cosine  Image 


457 


Figue  2:  Scalad  ami  RoUted  Codat  Image 


Kgwte  4:  Real  Imaft  1 
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Abstract 

We  consider  the  problem  of  matching  perspectiTe 
views  of  coplanar  stmctnres  composed  of  line  seg* 
ments.  Both  model-to-image  and  image-to-image  cor¬ 
respondence  matching  are  given  a  connstent  treat¬ 
ment.  Although  these  matching  scenarios  generally  re¬ 
quire  discovery  of  an  eight  parameter  projective  map¬ 
ping,  when  the  horison  line  of  the  object  plane  can 
be  found  in  the  image,  nsing  vanishing  point  analy¬ 
sis,  for  instance,  these  problmns  can  be  reduced  to  a 
simpler  six  parameter  affine  matching  problem.  When 
the  intrinsic  lens  parameters  of  the  camera  are  known, 
the  problem  further  reduces  to  four  parameter  affine 
simOarity  matching. 

1  Introduction 

Matching  is  a  ubiquitous  problem  in  computer  vi¬ 
sion.  Correspondence  matching  can  be  broken  into 
two  general  areas:  model-Uhimage  matching  where 
correspondences  are  determined  between  known  3D 
model  features  and  their  2D  counterparts  in  an  im¬ 
age,  and  imagt-io-imagt  matching  where  correspond¬ 
ing  features  in  two  images  of  the  same  scene  must  be 
identified.  Fast  and  reliable  matching  techniques  exist 
when  good  initial  guesses  of  pose  or  camera  motion  are 
available  [Beve90]  or  when  the  distance  between  views 
is  small  [AnanST].  What  is  lacking  are  good  meth¬ 
ods  for  finding  matches  in  monocular  images,  formed 
by  perspective  projection,  and  taken  foom  arlntrary 
viewpoints. 

This  paper  examines  the  problem  of  matching  coplar 
nar  stmctnres  consisting  of  line  segments.  A  ample 
method  is  presented  that,  when  apjfficaUe,  allows  foat 
and  accurate  matching  of  coplanar  stmctnres  across 
multiple  images,  and  of  locating  stmctnres  that  cor¬ 
respond  to  a  model  consisting  of  significant  planar 
patches.  The  main  point  to  this  paper  is  that  the  foil 
perspective  matching  problem  for  coplanar  stmctnres 
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contract  DAAE07-91-C-R03B  and  by  the  RADIUS  project 
under  DARPA/Army  contract  TEC  DACA7B-92-1U0039. 
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can  often  be  reduced  to  a  simpler  four  parameter  afline 
matching  problem  when  the  horison  line  of  the  planar 
stmctnre  can  be  determined.  Given  the  horison  fine, 
it  is  possible  to  transform  the  image  to  show  what  it 
would  have  looked  like  if  the  camera's  line  of  sight  had 
been  perpendicular  to  the  object  {dane.  This  process 
is  calM  rtetifieotum  in  aerial  photogrammetry. 

2  Planar  IVanaformations 

Essentially  aD  problems  involve  solving  for 

both  a  discrete  correspondence  between  two  sets  of  fea¬ 
tures  (model-image  or  image-image)  and  an  associated 
transformation  t^t  maps  one  set  features  into  reg¬ 
istration  with  the  other.  These  two  solutions  together 
constitute  matching:  a  match  being  a  correspondence 
plus  a  transformation.  For  planar  stmctnres  under  a 
perspective  camera  model,  the  relevant  set  of  transfor¬ 
mations  is  the  eight  parameter  projective  transforma¬ 
tion  group  [FangM]. 

More  restrictive  transformations  are  worth  special  at¬ 
tention.  Often  these  transformations  are  more  eas3y 
computed,  thus  makiag  matching  easier.  One  such 
special  case  occurs  for  frontal  plane*,  planar  stmctnres 
viewed  *head-on*  with  the  viewing  direction  of  the 
camera  held  perpendicular  to  the  object  plane.  When 
the  intrinsic  camera  parameters  are  known,  perspec¬ 
tive  mapping  of  a  frontal  plane  to  its  appearance  in 
the  image  can  be  described  with  just  four  afline  pa- 
rametera  an  image  rotation  angle,  a  2D  translation 
vector,  and  an  image  scale  [Sawh92]. 

2.1  IVontal  Planes 

Under  the  standard  innhde  camera  model,  the  image 
projection  of  a  3D  world  point  (JT,  Y,  Z)  is  the  image 
point  {X/Z,  Y/Z).  In  this  case,  the  appearance  of  any 
3D  object  is  governed  only  by  ^e  relative  petition  and 
orientation  of  the  camera  with  respect  to  the  object, 
i.e. ,  the  camera  pose.  There  are  6  degrees  of  freedom 
for  camera  pose:  three  rotation  angles  of  roll,  pan  and 
tflt,  and  thm  translations  2^,  Tg  and  Tj.  If  tbe  cam¬ 
era  b  constrained  to  ptint  directly  perpendicular  to 
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as  object  plane,  yielding  a  fnmtal  view  of  the  plane, 
its  two  pan  and  tilt  angles  must  stay  fixed.  This  leaves 
one  free  camera  rotation  about  the  normal  of  the  ob¬ 
ject  plane,  and  the  three  nnconstrained  translations, 
four  parameters  in  alL  The  effect  of  camera  roll  on  the 
image  plane  is  an  in-plane  rotation  about  the  ori^. 
Itanslation  parallel  to  the  planar  sni&ce  shows  np  as 
as  a  2D  translation  of  the  image.  Finally,  translation 
directly  towards  or  away  from  the  object  plane  mani¬ 
fests  itself  as  a  uniform  change  in  scale  of  the  projected 
image.  These  are  precisely  the  effects  of  the  fonr  pa¬ 
rameter  affine  similaiity  mapping.  The  pinhole  camera 
projection  of  a  frontal  plane  is  therefore  described  by 
fonr  affine  parameters  that  are  directly  related  to  the 
physical  pose  of  the  camera  with  respect  to  the  plane. 

A  more  reaUstic  camera  model  most  take  into  account 
the  intrinsic  parameters  of  the  camera  lens.  To  a  first 
approximation,  lens  effects  are  often  modeled  by  n  set 
of  linear  parameters  inclnding  focal  length,  lens  aspect 
ratio,  optical  center,  and  optical  axis  skew,  whose  com¬ 
bined  effects  can  described  by  a  six  dimensional 
affine  mapping  of  the  pinhole  image  onto  the  actual 
raster  image  [Hom86].  A  mote  realistic  model  of  the 
projection  of  a  ftontiJ  plane  is  thus  a  fonr  parameter 
affine  mapping  of  the  plane  onto  an  idealised  pinhole 
plane,  foUowed  by  *  ^  parameter  affine  mapping  onto 
the  actual  measured  image. 

In  summary,  the  perspective  projection  of  a  frontal 
plane  is  described  in  general  by  a  six  parameter  affine 
transformation.  When  a  calibrated  camera  is  used  its 
intrinsic  lens  effects  are  known,  and  can  be  inverted 
to  recover  the  ideal  pinhole  projection  image.  After 
correction  for  lens  effects,  the  frontal  view  of  a  plane 
can  be  described  by  a  fonr  parameter  affine  nmilarity 
mapping. 


2.2  Arbitrary  Orientations 

For  planes  viewed  at  arbitrary  orientations,  the  full 
six  degrees  of  pose  freedom  may  manifest  themselves 
in  the  image.  The  two  camera  rotation  angles,  pan  and 
tilt,  not  used  for  frontal  images,  are  directly  related  to 
the  tilt  of  the  object  plane  with  respect  to  the  camera’s 
line  of  sight.  Under  perspective  projection,  lines  that 
are  parallel  on  a  tilted  object  plane  appear  to  converge 
in  the  image  plane,  intersecting  at  a  vanuhing  point. 
The  locus  of  vanishing  points  of  coplanar  parallel  lines 
of  all  orientations  forms  a  line  in  the  image  called  the 
vanuhing  line  or  horizon  line  of  the  plane. 

For  frontal  planes,  all  parallel  lines  on  the  object  plane 
remra  parallel  in  the  image.  By  convention  a  set  of 
parallel  lines  in  the  image  is  said  to  intersect  in  a  point 
"at  infinity.”  When  all  vanishing  points  appear  at  in¬ 
finity,  the  vanishing  line  passing  through  them  is  also 
said  to  be  at  infinity.  Since  a  transformation  is  afline 
if  and  only  if  all  parallel  lines  of  arbitrary  orientation 
remain  paraUel,  it  follows  that  the  defining  feature  of 


a  frontal  view  of  a  cofdanar  structure  is  that  the  van¬ 
ishing  line  of  that  structure  appears  at  infinity. 

Conversely,  by  nppiying  a  projective  mapping  taking 
the  vanishing  line  of  an  image  of  a  coplanar  structure 
to  the  line  at  infinity,  the  vanishing  pdnts  of  all  lines 
in  the  plane  will  now  be  at  infinity,  hence  aU  parallel 
lines  in  the  plMM  structure  must  now  be  parallel  in 
the  image.  This  implies  that  the  new  image  is  a  frontal 
view  of  the  original  set  of  coplanar  lines. 


2.3  Rectification 


We  have  seen  that  the  vanishing  line  of  a  frontal  plane 
appears  at  infinity  in  the  image  plane,  and  further¬ 
more,  that  it  is  possible  to  recover  a  frontal  view  from 
the  image  of  a  tilted  object  plane  by  applying  a  pro¬ 
jective  transformation  that  maps  the  object’s  vanish¬ 
ing  line  to  infinity.  There  is,  however,  a  whole  six- 
dimensonal  space  of  projective  transformations  that 
all  map  a  given  line  in  the  image  off  to  infinity.  How 
to  choose  a  "best”  one  is  described  in  this  section. 

For  a  pinhole  camera  image,  the  position  and  orienta¬ 
tion  of  the  vanishing  line  of  an  object  plane  determines 
the  true  3D  orientation  of  the  plane  with  respect  to  the 
camera’s  line  of  sight.  When  the  equation  of  the  van¬ 
uhing  line  isaa-(-ky-|-e  =  0,  the  normal  to  the  object 
plane,  in  camera  coordinates,  is 

n  =  (a,b,e)/||(a,b,e)||.  (1) 


For  a  frontal  plane,  the  normal  of  the  plane  must  be 
parallel  to  the  Z  ctmera  axis,  once  the  plane  is  per¬ 
pendicular  to  the  line  of  tight.  If  the  camera  could 
move,  the  image  of  a  frontal  plane  could  be  recovered 
from  the  image  of  a  tilted  plane  by  merely  rotating 
the  camera  to  point  directly  towards  the  plane.  The 
camera  can  no  longer  be  moved  physically,  of  course, 
but  the  image  can  be  transformed  artifically  to  achieve 
the  desired  3D  rotation. 


Assume  the  unit  orientation  of  the  object  plane  has 
been  determined  to  be  n,  as  in  equation  1,  oriented 
into  the  image  (e  >  0).  To  bring  this  vector  into  coin¬ 
cidence  with  the  potitive  Z  axis  reqnkes  a  rotation  of 
angle  Cos~^(n  -  (0, 0, 1))  about  the  axis  n  x  (0, 0, 1). 
The  effects  of  this  camera  rotation  on  the  image  can 
be  simulated  by  an  invertible  projective  transforma¬ 
tion  in  the  image  plane  [KanaSS].  In  homogeneous 
coordinates. 
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where 

„  a*e  +  b*  ot(c-l)  a*  +  h^c 


The  image  is  transformed  to  appear  as  it  would  have 
if  the  camera  had  been  pointing  directly  towards  the 
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object  plane.  The  lenilt  therefore  ii  a  frontal  riew  of 
the  object  plane,  aa  aeen  by  a  pinhole  camera,  Le.  a 
rectified  four  parameter  affine  view. 

Thia  mapping  can  be  naed  to  map  a  vaniahing  line  to 
infinity  even  when  the  intrinaic  calibration  parametera 
ate  not  known.  However,  when  the  original  image  ia 
not  a  pure  pinhole  image,  the  poaition  of  the  va^uh- 
ing  line  in  the  image  can  no  longer  be  interpreted  ge¬ 
ometrically  in  terma  of  3D  plane  orientation,  and  &e 
reaulting  nnwarped  image  will  be  in  general  aome  aiz 
parameter  affine  view  of  the  object  plane. 

3  Correspondence  Matching 

Plane  rectification  forma  the  eaaence  of  our  approach 
to  matching  perapective  imagea  of  coplanar  atmctnrea. 
The  idea  ia  to  find  the  vaniahing  line  of  an  object  plane 
in  the  image  by  any  meana  poaaible,  then  apply  a  pro¬ 
jective  transformation  that  mapa  it  to  the  line  at  in¬ 
finity,  thereby  producing  an  affine  frontal  view  of  the 
original  object  plane.  The  effect  ia  to  reduce  a  per¬ 
spective  matching  problem  to  a  simpler  afline  match¬ 
ing  problem. 

To  search  for  the  optimal  afiine  map  and  correspon¬ 
dence  between  two  seta  of  line  segments,  an  efficient 
and  robust  local  search  algorithm  ia  used  [BeveQO]. 
The  local  search  matching  algorithm  searches  the 
discrete  space  of  correspondence  mappings  between 
model  and  image  features  for  one  that  minimises  a 
match  error  function.  The  match  error  depends  upon 
the  relative  placement  implied  by  the  correspondence. 
More  particularly,  to  compute  the  match  error  the 
model  ia  placed  in  the  scene  so  that  the  appearance 
of  model  features  is  most  similar  to  the  appearance  of 
corresponding  image  features.  The  more  nmilM  the 
appearance  the  lower  the  match  error. 

To  find  the  optimal  match,  probabiUatic  local  search 
relies  upon  a  combination  of  iterative  improvement 
and  random  seunpling.  Iterative  improvement  refers 
to  a  repeated  generate-and-test  procedure  by  which 
the  algorithm  moves  from  an  initial  match  to  one  that 
is  locr^y  optimal.  Thia  is  done  by  repeatedly  testing 
a  local  neighborhood  of  matches  defined  with  respect 
to  the  current  match.  Each  neighbor  is  a  distinct  cor¬ 
respondence  mapping  between  model  and  image  fea¬ 
tures.  Tractable  neighborhood  aises,  for  instance  n 
neighbors  in  a  space  of  2*  possible  matches,  tend  to 
yield  tractable  algorithms.  However,  there  ia  an  art 
to  designing  small  neighborhoods  that  do  not  induce 
a  profusion  of  local  optima.  New  neighborhoods  defi¬ 
nitions  have  been  developed  that  are  particularly  well 
suited  to  shape-matching  [Beve90]. 

Despite  clever  neighborhood  definitions,  local  search 
can  become  stuck  on  local  optima.  Random  sampling 
offers  a  probabilistic  solution  to  this  problem.  The 
probability  of  finding  the  globally  optimal  match  start¬ 


ing  from  a  randomly  chosen  initial  match  ia  analogous 
to  the  probalnlity  getting  heads  when  flipping  an 
unfair  coin.  Even  with  an  unfair  coin  it  is  a  good  bet 
that  heads  will  appear  at  least  once  in  a  large  number 
of  throws.  For  instance,  nnng  a  coin  that  only  comes 
up  heads  in  1  out  of  10  throws,  the  odds  of  getting 
heads  1  or  more  times  in  SO  throws  are  99  out  of  100. 
Similarly  for  local  search  matching,  even  if  the  proba¬ 
bility  of  seeing  the  optimal  match  on  a  single  trial  is 
low,  the  probalnlity  of  seeing  the  optimal  match  in  a 
large  number  of  trials  is  high. 

The  combinatioa  of  iterative  refinement  and  random 
samiding  has  proven  to  be  very  effective.  This  ba¬ 
sic  form  of  algorithm  reliably  finds  excellent,  and 
usually  globaDy  optimal,  mattes  under  difficult  cir¬ 
cumstances.  The  algoritlun  performs  well  even  when 
scenes  are  highly  cluttered  and  significant  portions  of 
a  model  instance  are  fragmented  or  missing  entirely. 

4  Examples 

Although  other  methods  are  available  (see  discussion 
in  Section  5),  the  results  in  this  paper  rely  exclusively 
on  vanishing  pmnt  analysis  for  finding  vanishing  lines 
in  the  image.  This  simple  approach  works  surpris¬ 
ingly  weD  for  many  man-made  scenes,  both  indoor, 
outdoor,  and  aeriaL  Vaniahing  points  are  found  using 
a  standud  Hough  transform  approach  [Bam83].  Each 
line  in  the  image  is  entered  into  a  two  dimensional 
Hough  array  representing  the  surface  of  a  unit  hemi¬ 
sphere.  Each  image  line  "votes*  in  a  great  (seini)circle 
of  accumulators,  and  potential  vanishing  points  are  de¬ 
tected  as  peaks  in  the  array  where  several  great  circles 
intersect.  For  most  man-made  scenes,  either  two  or 
three  clusters  will  dominate  the  Hough  array;  clusters 
corresponding  to  the  vanishing  points  of  the  two  or 
three  dominant  line  directions  in  the  scene.  Each  pair 
of  vanishing  points  defines  a  vanishing  line  for  planes 
consistent  with  those  line  orientations. 

At  present,  only  a  four  parameter  affine  version  of 
the  local  search  system  is  implemented.  We 

therefore  needed  to  know  the  calibration  parameters  of 
the  camera  for  each  experiment.  It  should  be  stressed 
that  only  rough  knowledge  of  the  calibration  param¬ 
eters  is  gener^y  needed  to  find  acceptable  matches. 
The  most  important  parameters  to  determine  are  fo¬ 
cal  length  and  aspect  ratio.  We  assumed  for  all  our 
experiments  that  the  image  center  was  at  the  center 
of  the  image,  and  the  optical  X  and  Y  axes  were  per¬ 
pendicular  (no  skew)  and  aligned  with  the  row  and 
column  axes  of  the  raster  image.  Aspect  ratio  was 
determined  from  the  camera  manufacturer’s  specifica¬ 
tions,  when  available,  otherwise  it  was  assumed  to  be 
one-to-one  (square).  The  focal  length  for  each  experi¬ 
ment  was  computed  as  a  byproduct  of  vaniahing  point 
analysis  and  apriori  knowMge  that  the  actual  angle 
made  by  the  two  dominant  line  directions  is  90  degrees 
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Figure  1:  Model  to  image  matchiog  example  on  a 
potter  image:  a)  data  linet  from  potter  image,  b) 
potter  model,  c)  rectified  potter  data  linet,  d)  potter 
model  regittered  witb  tbe  image  data. 


[CaprfiO].  Tbit  focal  length  wat  choten  by  finding  the 
dittance  of  the  focal  point  from  the  image  that  re- 
enlted  in  perpendknlarity  of  the  two  Tccton  &om  the 
(variable)  focal  point  towardt  the  two  (fixed)  vanithing 
pointt  in  the  image. 

4.1  Model  to  Image  Matching 

Fignret  la)  and  b)  thow  a  tet  of  ttraight  line  tegmentt 
extracted  &om  an  image  of  a  wail  potter  niing  the 
Burnt  algorithm  [Bnm86],  and  a  let  of  model  linet  to 
be  matched  to  the  image.  The  firtt  ttage  in  match¬ 
ing  it  to  detect  two  clnitert  of  linet  converging  to  the 
two  main  vanithing  pointi  in  the  image,  and  from  the 
retulting  vanithing  line  rectify  the  image  to  pretent  a 
frontal  view  of  the  potter  (Figure  Ic). 

The  four  parameter  affine  match  found  by  the  lo¬ 
cal  learch  matching  algorithm  yielded  a  tet  of  cor- 
retpondencet  between  model  linet  and  image  Unet. 
Thete  corretpondencet  were  nted  to  ettimate  an  eight 
parameter  planar  projective  trantformation  to  bring 
the  model  linet  into  regittration  with  the  image  data 
linet,  nting  the  leatt-iqnaret  ettimation  procedure  of 
[Faug88].  Figure  Id  ihowt  the  trantformed  model 
overlaid  on  the  input  image  linet. 


Figure  2:  Image  to  image  matching  example  on  an 
aerial  image:  a)  image  linet  from  nadir  view,  b)  image 
linet  from  obHqne  view,  c)  nnwarped  oUiqne  view  ,  d) 
regittration  of  nadir  view  with  nnwarped  oblique  view. 


4.2  Image  to  Image  Matching 

Becauae  it  does  not  rely  on  computing  3D  object  poae. 
The  current  formaliim  extenda  eanly  to  image  to  im¬ 
age  corretpondence  matching.  In  t^  cate,  both  im- 
aget  are  rectified  naing  the  technique!  of  the  latt  lec¬ 
tion,  and  one  it  treat^  at  the  model  while  the  other 
become!  the  data  to  be  matched.  The  goal  it  to  dia- 
cover  a  tranaformation  that  mapa  one  aet  of  rectified 
image  Unet  into  another. 

When  both  cameraa  are  calibrated,  both  imaget  can  be 
rectified  into  foontal  viewa  of  the  object  plane.  Since 
the  mapping  horn  one  image  to  another  can  be  written 
by  inverting  one  tranaformation  and  compoaing  it  with 
the  other,  and  aince  the  four  parameter  affine  group  it 
doted  under  inveraion  and  compotition,  the  retulting 
image  to  image  tranaformation  can  be  deactibed  by 
a  four  parameter  affine  aimilarity  map.  At  may  be 
expected,  when  either  camera  it  nncalibrated,  the  re- 
anlting  tranaformation  between  nnwarped  viewa  it  a 
general  aix  parameter  afibie  mapping. 

Figure  2  ahowa  an  example  of  image  to  image  matching 
in  the  context  of  aerial  image  regittration.  Figurea  2a) 
and  b)  thow  aett  of  extracted  atraight  line  tegmenta 
from  two  aerial  photograph!.  In  the  firat  image,  the 


camera  it  nearly  perpcndicnlar  to  the  ground  plane, 
a  fut  Terined  by  vanidiing  pmnt  analytit,  which  findt 
two  orthogonal  teto  of  nearly  parallel  Knet.  The  second 
image  it  dearly  not  a  frontal  new.^  Figure  2c  thowt 
the  image  after  rectification  bated  on  vanithing  pmnt 
analytit. 

To  apply  the  local  search  matching  algorithm,  image  1 
was  assumed  to  be  the  modd  and  the  nnwarped  fines 
from  image  2  the  data.  Both  fine  sets  were  filtered 
to  only  indnde  lines  greater  than  100  pixels  long,  re¬ 
ducing  the  matching  problem  to  55  long  fines  in  one 
image  and  68  fines  in  ^e  other.  The  best  match  found 
is  displayed  in  Figure  2d. 

5  Issues  and  Extensions 

The  domains  we  antidpate  are  scenes  depicting  either 
indoor  or  urban  outdoor  environments  with  much  pla¬ 
nar  and  paraUd  linear  structure.  Such  scenes  often 
contain  fines  and  planes  in  two  or  three  dominant 
directions.  The  approach  to  matching  taken  in  this 
paper  requires  each  plane  to  be  matched  separatdy, 
therefore  there  needs  to  be  some  way  to  partition  fines 
in  the  image  into  sets  bdongjng  to  {danes  in  the  world. 
This  would  be  nearly  impossible  in  monocular  images, 
were  it  not  for  the  rich  structure  of  man-made  envi¬ 
ronments,  suggesting  domain-spedfic  heuristics  based 
on  comers  and  perpendicularity.  In  particular,  L- 
junctions  made  of  two  fines  from  different  vanishing 
point  dusters  are  good  candidates  for  coplanar  cor¬ 
ners.  We  are  currently  exploring  heuristic  geometric 
methods,  as  well  as  more  formal  approaches  based  on 
projective  invariance,  for  partitioning  image  fines  into 
coplanar  groups. 

We  are  also  exploring  other  methods  besides  vanish¬ 
ing  point  analyns  for  detecting  the  horison  fine  of  an 
object  plane’s  image  projection.  Some  possibilities  are 
analyns  of  texture  gradients  [Gard91],  and  exploita¬ 
tion  of  the  perspective  properties  of  convex  planar 
curves  [Ams89]. 

When  structures  are  present  in  the  scene  that  devi¬ 
ate  significaitly  from  coplanarity  with  respect  to  the 
viewing  distance,  then  their  correspondences  may  not 
be  adeqnatdy  found  using  the  above  techniques.  How¬ 
ever,  to  the  extent  that  son.  e  scene  features  are  copla¬ 
nar  and  are  found,  this  initial  set  of  planar  corre¬ 
spondences  provide  strong  constraints  on  the  remain¬ 
ing  features,  particularly  when  the  cameras  are  cali¬ 
brated,  in  which  case  the  relative  rotation  and  direc¬ 
tion  of  translation  between  the  two  camera  positions 
can  be  computed  from  the  planar  perspective  trans¬ 
formation  [Fang88].  This  reduces  the  problem  to  that 
of  induced  stereo,  where  point  correspondences  must 
lie  along  known  epipolar  fines.  Even  for  nncalibrated 


‘The  tenn  ’%ontal*  wm  coined  with  terrestzisl  robotics 
in  mind;  in  the  aerial  dom^  the  correct  term  is  *na^”. 


camera  qrstems,  knowledge  of  the  perspective  transp 
formation  relating  the  image  features  of  a  single  plane 
in  the  scene  constrains  the  positions  ci  point  features 
in  one  image  to  fie  along  epipolar  fines  in  the  other 
image. 

In  ita  current  form,  the  local  search  affine  matcher  de¬ 
scribed  in  this  paper  is  used  for  image  to  image  feature 
matrhim  amply  by  declaring  the  features  in  one  image 
to  be  a  mod^  This  is  not  ideal,  since  the  the  current 
treatment  of  modd  and  image  fines  is  not  symmet¬ 
ric.  Future  work  on  the  affine  matcher  may  indnde 
developing  a  more  symmetric  error  metric  for  image 
to  image  matching,  and  extending  the  range  of  the 
match  transformation  space  to  handle  six  parameter 
affine  matching  so  that  images  from  nncalibrated  cam¬ 
era  systenu  can  be  used. 

Referencen 

[AnaaST]  P.  Aaaadaa,  “Ifcasariag  Visnal  Motioa  from 
Image  ScqaeBocs,*  PhJ).  Thcris  mad  COINS  Tech 
Report  87-21,  Uaivcirity  of  Massachnaetts,  Amherst, 
MA,  1987. 

[Azas89]  J.  Araspaag,  “Moving  Towards  the  Horison  of 
a  Planar  Cvm^IBBB  Workshop  on  Visual  Motion, 
1989,  pp.  84-69. 

[Bam8S]  S.T.  Barnard,  “Interpreting  Perspective  Images,* 
AI  Jourrtal,  VoL  21,  No.  4,  November  1983,  pp.  435- 
482. 

[Bcve90]  J.R.  Beveridge,  R.  Weiss  and  EAf.  Risemaa, 
“C^hinatorial  Optimisation  Applied  to  Variable 
Scale  2D  Modd  Matching,*  Proceedings  IEEE  Inter¬ 
national  Conference  on  Pattern  Recognition,  Atlantic 
Gty,  June  1990,  pp.18-23. 

[Bnra86]  JJB.  Bums,  AJL  Hanson  and  E.M.  Risemaa, 
“Extracting  Strdght  Lines,*  IBBB  T^msactions  on 
Pattern  Analysts  and  Machine  Jntelliyence,  VoL  8, 
No.  4,  Jnly  1988,  pp.425AS8. 

[Capr90]  B.  Caprile  and  V.  Torre,  “Uring  Vanishing  Pdnts 
lor  Camera  Calibration,*  InUmationol  Journal  of 
Computar  Vision,  VoL  4, 1990,  pp.  127-140. 

[IWng88]  OJ).  Fkngeras  and  F.  Lnstman,  “Motioa  and 
Stmetnre  from  Motioa  in  a  Piecewise  Planar  Environ¬ 
ment,*  International  Journal  of  Pattern  Recognition 
and  Artificial  IntelUgenee,  VoL  2,  1988,  pp.  488-508. 

[Gard91]  J.  Carding,  “Shape  from  Surface  Markings,* 
PhJ).  dissertation.  Royal  Institute  of  T^-ohaology,  S- 
100  44  Stockholm,  Sweden,  May  1991. 

[Hora88]  BJCJ*.  Horn,  Robot  Vision,  MTT  Press,  Cam¬ 
bridge,  MA.,  1988. 

[Kaaa88]  K.  Kanatani,  “Constraints  on  Length  and  An¬ 
gle,*  Computer  Vision,  Crnphies,  and  Image  Process- 
ing,  Vol.  41,  1988,  pp.  28-42. 

[Sawh92]  H.S.  Sawhaey,  PhJ).  Thens,  Computer  Sdeace 
Department,  Univerdty  of  Massachusetts,  Amherst, 
MA.  1992. 


463 


Automatic  Finding  Of  Main  Roads  In  Aerial  Images  By  Using 
Geometric  -  Stochastic  Models  and  Estimation  * 

Meir  Barzohar,  David.  B.  Cooper 

Laboratory  for  Engineering  Man/Machine  Systems 
Division  of  Engineering,  Brown  University, 

Providence,  HI  02912 


Abstract 

This  paper  presents  an  automaied  approach  to  find¬ 
ing  main  roads  in  aerial  images.  The  approach  is  to 
build  grametric-probabilistic  models  for  road  image 
generation.  Then,  given  am  image,  roads  are  found 
by  map  (maximum  aposteriori  probability)  estima¬ 
tion.  The  map  estimation  is  handled  by  partitioning 
an  image  into  windows,  realizing  the  estimation  in 
eadi  window  through  the  use  oi  dynamic  program¬ 
ming,^  and  then,  starting  with  the  windows  contain¬ 
ing  high  confidence  estimates,  using  dynamic  pro¬ 
gramming  again  to  obtain  global  estimates  ot  the 
roads  present.  The  approau  is  model-based  from 
the  outset  and  is  completely  different  than  those  ap¬ 
pearing  in  the  published  literature.  It  produces  two 
boundaries  for  each  road,  or  four  boundary  when  a 
mid  road  barrier  is  present. 

1  Introduction 

In  this  paper  we  introduce  a  new  approach  to  build¬ 
ing  models  for  main  roads  in  aerisu  images  and  a 
new  computation  approach  to  finding  them.  The 
goal  is  a  completely  automated  sysyem.  The  idea  is 
Uiat  this  approach  could  then  be  extended  to  finding 
other  types  of  objects  in  areal  images  of  the  grouniT 
In  recent  years  a  number  of  papers  nave  appeared  in 
the  published  literature  dealing  with  semi-auiomaitc 
extraction  of  roads  from  aeriaJ  photos.  In  general  a 
human  operator  gives  the  road  starting  points  and 
the  road  directions  at  the  starting  points.  This  is 
a  Attffe  help  to  the  road  finding  algorithm.  This  in¬ 
teraction  has  been  necessa^  because  road  images 
can  be  very  complicated.  Examination  of  just  the 
two  images  in  this  paper.  Figs  6  and  8,  reveals  that 
image  intensity  accross  a  road  can  vary  noticeably. 
There  may  be  a  barrier  running  along  the  center 
of  the  road.  Road-  width  can  vary  appreciably,  es¬ 
pecially  when  a  barrier  is  present.  Image  intensity 
edges  along  a  road  boundary  may  disappear,  espe- 
cimly  when  there  is  a  building  entrance  with  a  very 
small  parking  area  alongside  the  road.  The  image 
intensity  structure  at  road  intersections  can  be  very 
complicated.  There  may  be  cars  or  markings  on 
roads,  or  shadows  cast  by  buildings  or  trees,  etc., 
etc..  Three  major  types  of  road  under  were  used: 
edge  linkers,  correlation  trackers,  region  based  fol¬ 
lowers. 

Edge  linkers,  based  on  an  edge  operator  output  for 
magnitude  and  angle  for  ea^  point  in  the  image 
and  then  linking  the  edges  according  to  some  crite¬ 
ria,  were  used  first  by  [6]  and  later  by  [8]  and  by  [2]. 

‘This  work  was  partially  supported  by  NSF-DARPA 
Grant  #IRl-8905436 


Correlation  trackers  based  on  the  assumption  that 
there  exists  a  pattern  or  texture  on  the  road  surface 
was  used  first  by  [7],  and  later  by  [2]  in  combina¬ 
tion  with  edge  linkers.  A  region  based  method  as¬ 
suming  constant  intensity  in  the  region  and  in  the 
background  was  used  first  by  [4]  with  a  correlation 
follower. 

In  [1]  and  [3],  a  Bayesian  approach  to  low-level 
bounda^  estimation  and  object  recognition  was  in- 
troducea.  The  problem  considered  in  this  paper  is 
the  automatic  extraction  of  main  roads  when  road 
curvature,  width,  image  intensity  and  edge  strength 
can  vary  considerably  and  when  a  barrier  along  the 
road  center  may  or  may  not  be  present.  The  ap¬ 
proach  is  genera],  and  we  feel  that  it  can  be  extended 
to  handle  the  full  range  of  road  image  variability. 
The  approach  is  to  build  geometric-stochastic  mod¬ 
els  for  representing  road  images,  and  then  use  maxi¬ 
mum  aposteriori  probability  estimation  for  estimat¬ 
ing  the  road  boundaries  (and  other  important  fea¬ 
tures)  in  an  image.  The  modeling  approach  forces 
the  designer  to  model  all  significant  phenomena,  and 
the  model  is  generative  so  that  its  representation 
power  can  be  assessed.  The  map  estimation  provides 
for  the  most  accurate  road  finding.  Global  map  es¬ 
timation  is  realised  in  a  computationally  reasonable 
way  by  using  dynamic  programming  to  search  small 
windows  to  obtain  initial  road  canmdates,  and  then 
dynamic  programming  again  with  small  windows  in 
order  to  obtam  globaTestimates. 

2  Road  Generation 

2.1  Road  Geometry  And  Internal  Grey 
Level  Model 

We  build  a  geometric-stochastic  road  model  based 
on  the  following  assumptions: 

1)  Road  width  variance  is  small  and  road  width 
change  is  likely  to  be  slow. 

2) Road  direction  changes  are  likely  to  be  slow. 

3) Road  grey  level  is  likely  to  vary  only  slowly. 

4)  Grey  level  variation  between  road  and  background 
is  likely  to  be  large. 

5)  Roads  are  unlikely  to  be  short. 

A  stochastic  process  model  is  built  exhibiting  the 
preceding  behavior.  Specifically,  autoregress  /e  pro¬ 
cesses  are  designed  to  model  road  center  line,  road 
width,  grey  level  within  the  road,  edge  strength  at 
the  road  boundary,and  regions  outside  the  roads  and 
a^acent  to  the  boundraies.  We  refer  to  these  re¬ 
gions  as  background.  Note  that  the  road  geometry 
processes  are  hidden,  i.e,  they  are  not  observed  di- 
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rectly  in  the  data.  For  this  purpose,  consideration  is 
restricted  to  an  N  x  N  pixel  window.  The  stochas¬ 
tic  processess  are  function  of  a  discrete  parameter 
i  which  can  be  thought  of  as  time  or  distance,  in 
pixels,  along  a  horrizontal  or  a  vertical  axis.  As  an 
example,  consider  Fig  1.  The  i  axis  here  is  horri¬ 
zontal.  The  road  center  line  at  i  is  Xi  which  takes 
values  in  the  vertical  direction.  This  variable  is  not 
quantized,  it  takes  arbitrary  real  values.  The  {x,} 
process  is  given  by  (la),  where  is  a  zero  mean, 
white.  Gaussian  driving  noise.  This  process  is  de¬ 
signee  to  generate  a  straight  line  if  the  driving  noise 
is  zero. 

=  2*i_i  -  Xi_a -h  e,.  (1) 

=  (2) 

Equation  (2)  describes  road  width,  where  di  is  the 
perpendicular  road  width  through  the  unquantized 
center  point  [xi,i].  The  perpendicular  direction  is 
measured  as  perpendicular  to  the  line  segment  from 
the  point  [x,_2,»  —  2]  to  the  point  [x,_i,i  —  1]. 
The  stochastic  processes  £« .  and  are  independent 
Gaussian  white  noise  sequences  with  zero  means  and 
variances  <7*^  and  rrj-,  respectively.  The  road  ob¬ 
tained  for  the  unforced  solution  (fx,-  =  0,£di  =  0 
)  will  be  two  parallel  line  as  seen  in  Fig  1.  Road 
boundary  location  are  uniquely  detemined  by  the  Xi 
and  di .  The  road  boundary  location  on  the  grid  lo¬ 
cations  are  determined  as  in  Figure  1;  where  x,-,i 
denotes  the  upper  unquantized  boundary  and  Xi,2 
denotes  the  lower  unquantized  boundary. 

We  use  a  second  order  Markov  Process  to  model 
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Assume  for  now  that  the  we  deal  with  a  step  edge  ( 
we  also  deal  with  the  blur  edge  but  this  is  more  in¬ 
volved).  The  observation  image  intensities  yr.j+i.i 
and  immediately  outside  the  lower  and  up¬ 

per  road  boimdary  respectivly  are  determined  by  : 

ysi3+l,i  i=  (5) 

=  ui  +  ri*  Ou,,  •  =  (6) 

where  r2  and  ri  are  random  variables,  taking  values 
±1  with  equal  probabilities,  Oi-  emd  6^,  are  a  white 
Gaussian  iid  sequence  having  some  nonzero  mean. 
The  purpose  of  r?  and  ri  are  to  model  whether  each 
of  the  b^ground  grey  level  is  lighter  or  darker  than 
that  in  the  road. 

Assume  that  the  image  intensities  in  the  background 
regions  above  and  below  the  road  boundaries  are 
modeled  by  different  Markov  processes.  These  are 
causal  Gaussian  AR  (  autoregressive)  models.  The 
lower  background  AR  model  is: 

1  1 

ksO  IsO 

where  ffn  are  the  model  parameters,  Uij  is  the  mean 
value  function,  and  Cy-  ,  is  a  Gaussian  white  noise 
driving  sequence  with  zero  mean  and  variance 
Note  that  this  model  generates  image  data  in  raster 
scan  mode  —  left  to  right  top  to  bottom.  The  con¬ 
ditional  distribution  of  the  process  at  pixel  j,i  dven 
the  previously  generated  process  depends  only  on 
the  three  pixels  immediately  above  and  to  the  left 
of  j,i.  A  similar  model  generates  the  backnound 
above  the  road,  but  here  the  proces  genersdion  is 
left  to  ri^t,  bottom  to  top.  The  reason  for  using 
causal  AR  models  is  that  they  are  computatioally 
well  suited  to  the  dynamic  programming  roewl  esti¬ 
mation  algorithm  used.  With  the  specified  road  edge 
and  backnound  models,  it  is  now  possible  to  specify 
the  probabilty  of  the  background  data.  We  are  ex¬ 
pecting  to  detect  roads  having  lengths  between  Lmiji 
w  Lmax  with  uniform  distribution.  That  feature  is 
very  usefully  in  the  high  level  processing,  whereas 
the  other  features  are  more  useful  in  the  low  level 
processing.  In  fig  2  are  displayed  various  synthetic 
images  which  were  generated  by  the  road  model  de¬ 
scribed  in  this  section. 


Figure  1:  (A)A  road  in  a  window  for  unforced  solu¬ 
tion.  (B)  Two  Roads  in  a  Window  . 

the  mean  intensity  of  the  image  data  in  sequential 
vertical  strips  of  the  road  to  be  consistent  with  fea¬ 
ture  3  of  our  road  model. 

1  1 
+  2“’“^ 

The  variable  Ui  is  the  mean  intensity  in  column  t 
of  the  window,  and  is  &  Gaussian  white  noise 
sequences  with  zero  mean  and  variance  <7u,,  and  is 
independent  of  the  processes  Cx,  and  £d,  ■  The  inten¬ 
sity  varies  across  the  vertical  direction  of  the  road 
too.  We  model  this  by  adding  an  additive  white 
noise.Therefore,  the  observed  picture  function  at  the 
(j,i)th  pixel  (jth  row,  ith  column  )  in  the  road  is  y,i, 
given  in  (4),  where  nji  is  Gaussian  white  noise  with 
zero  mean  and  variance  <r. 

yj.i  =  Vi  +  nj,„  i=l..N,  7  =  x,i..x,2  (4) 


3  Road  finding  as  map  estimation 
problem 

Our  genered  framework  for  viewing  road  finding  is 
to  estimate  the  geometry  of  the  road  by  formulating 
the  problem  as  map  estimation.  We  look  for  that 
road  for 

which  P{hypothesized  road  model\image  data) 
is  a  maximum  .  Since  this  posterior  likelihood  of  a 
hypothesized  road  model  given  the  image  data  can 
be  written  as 

P^hypothesized  road  model,  image  data)  . 

P{image  data) 

and  the  denominator  is  not  a  function  of  the  hy¬ 
pothesized  road  model  ^  the  road  model  estimate 
can  be  found  as  that  which  maximizes  the  numera¬ 
tor,  i.e,  the  joint  likelihood  of  a  hypothesized  road 
model  and  the  image  data.  The  map  estimation  is 
handled  by  partitioning  the  image  into  windows,  re¬ 
alizing  the  estimation  parameters  in  each  window 
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through  the  use  of  dynamic  programing  (detailes  are 
given  in  [5]).  Our  approach  also  includes  of  desion 
rules  for  starting  points  of  a  road  in  a  window,  and 
stopping  points  of  a  road  in  a  window. 

3.1  The  complete  low  level  processing 

In  the  low  level  processing  we  are  searching  for  seeds 
of  roaJsi.  The  image  is  divided  into  square  win¬ 
dows,  N  X  N  pixels  in  each  window,  and  the  sys¬ 
tem  searches  for  road  candidates  that  fit  the  road 
model  with  high  probability  by  using  the  dynamic 
programming  structure.  We  run  the  road  search  four 
times^  with  a  separate  search  starting  at  each  of  the 
four  sides  of  the  window,  to  pass  over  all  the  possi¬ 
bilities  of  road  geometries.  In  Fig  3a  we  represent 
those  examples.  There  msw  be  more  than  one  can¬ 
didate  road  within  a  window.  This  is  handled  by 
searching  a  window  for  the  best  road,  and  then  re¬ 
peating  a  window  search  for  the  next  best  road.  We 
do  not  want  the  second-best  road  to  be  a  variation 
of  the  best  road.  This  is  handled  by  not  permitting 
boundary  points  for  the  best  road  also  to  he  bound¬ 
ary  points  for  roads  in  subsequent  searches.  In  this 
way,  the  system  will  find  a  pair  of  roads  such  as  in 
figure  lb.  The  procedure  is  repeated  until  all  road 
candidates  are  found.  Fig  3b  illustrate  another  ex¬ 
ample,  road  Junction  where  a  main  road  forks  into 
two  roads.  In  a  first  search  starting  on,  the  right 
side  of  the  window  roadl  between  points  H,  K,  E, 
F  will  be  found,  then  road2  between  points  L,  M, 
F,  G  will  be  found,  and  road3  between  points  A,  B, 
C,  D  will  be  found,  1^  a  search  starting  at  the  left 
side  of  the  window.  The  splitting  area  represented 
by  points  C,  D,  E,  F  will  not  be  found  at  this  level 
.  The  hidden  structure  for  every  road  candidate  is 
found  and  observed  to  the  high-  level  processing. 


were  obtain  from  the  Radius  Program.  The  goal  is 
to  find  the  roacte  in  the  real  ancT  synthetic  images 
using  our  approach.  The  image  field  is  partitioned 
into  an  aurray  of  square  windows  (32x32).  The  road 
finder  runs  simultaneously  within  the  windows  to 
find  initial  road  seeds  in  the  images  (section  3).  It 
then  combines  all  the  local  results  to  obtain  the  fi¬ 
nal  main  roads  in  the  images  by  using  the  high-level 
approach.  Using  the  low  level  apmoach  only  for  the 
synthetic  images,  the  results  in  Figure  5  were  ob¬ 
tained.  The  recognized  road  boundary  points  are 
indicated  by  black  dots.  The  recognized  roads  for 
Radi  are  indicated  in  (Fig7),  (detailes  are  given  in 
[5]).  The  recognized  barriers  and  roads  for  Rad2  are 
indicated  in  (Fig9),  The  system  starts  first  with  rec¬ 
ognizing  the  roi^  barriers,  using  knowledge  that  the 
barrier  intensity  is  lighter  than  the  road  surrounding 
it.  The  initial  road  barrier  seeds  in  the  image  were 
obtain  by  low-level  processing  within  the  windows, 
and  with  the  high-level  following  stage  it  combines 
the  local  results  to  obtain  the  final  barrier  (detailes 
are  given  in  [5]).  Each  side  of  the  barrier  is  bordered 
by  a  ro^;  therefore,  by  knowing  the  boundaries  of 
the  barriers,  the  system  also  knows  the  correspond¬ 
ing  boundaries  along  one  side  of  each  main  road. 
The  other  boundaries  along  the  second  side  of  each 
road  are  estimated  by  using  the  high-level  approach 
for  road  finding. 


4  Combining  road  candidates 

A  high  level  processing  stage  is  now  used  to  extend 
each  road  cwdidate  produced  by  the  low  level  win¬ 
dow  search  in  order  to  obtain  g^lobal  road  estimates. 
This  is  done  by  using  shifting  windows,  as  illustrated 
in  Fig  4,  where  a  new  window  is  introduced  at  an  end 
of  a  rosid  candidate  centered  on  one  of  the  sides  of 
the  window.  The  best  extension,  of  the  road  candi¬ 
date,  through  the  window  is  estimated  by  using  the 
dynamic  pronamming  algorithm.  This  process  is 
repeated  until  the  stopping  criterion  stops  the  road 
search  or  until  the  estimated  road  hits  the  image 
border.  In  the  process  of  extending  a  road  through 
a  new  shifted  window,  the  dynamic  programming  al¬ 
gorithm  begins  by  using  the  last  ratimated  state  of 
the  road  and  the  associated  road  image  data.  Upon 
termination  of  the  road  search,  if  the  length  of  the 
estimated  road  from  the  initial  road  candidate  is 
greater  than  a  threshold,  the  estimated  road  is  ac¬ 
cepted.  If  the  length  is  less  than  the  threshold,  the 
estimated  road  is  rejected,  unless  there  is  other  sup¬ 
porting  evidence.  Supporting  evidence  we  have  used 
in  the  experiments  is  that  if  three  long  roads  enter 
an  intersection  and  the  short  road  is  a^acent  to  the 
intersection  and  appears  to  be  an  extension  of  one 
of  them,  i.e.,  has  roughly  the  direction  and  width  of 
one  of  them,  then  accept  the  short  road. 

5  Experimentel  Road  Results 

This  section  describes  the  results  of  road  finding  in 
the  synthetic  images  (Fig2),  and  two  different  real 
images  called,  Radi  (Fig6)  and  Rad2  (Fig8)  that 


Figure  2:  Various  synthetic  images  generated  by  our 
road  model. 
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Figure  3:  (a)  Disparate  roads  with  different  starting 
and  terminating  side  locations  in  a  window.,  (b)  A 
Main  Road  split  into  two  roads. 


r 


Figure  4:  Following  road  candidate  through  its 


Figure  6:  Real  Image  number  1  called  ”radl”  taken 
from  the  Radius  program. 


I 


Fi^re  5:  The  recognized  roads  signed  by  black  dots 


Figure  7:  The  recognized  and  estimated  roads  of  our 
road  finder,  in  image  ”Radl”. 


tT«1 


Figure  9:  The  recognised  and  estimated  roads  and 
barriers  of  our  roadnnder,  in  image  ”Rad2” . 
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Abstract 

Hie  use  of  artificial  neural  networks  in  the  domain  of  autonomous  vehicle  navigation  has  produced 
promising  results.  ALVINN  [Pomolean,  1991]  has  shown  that  a  neural  system  can  drive  a  vehicle 
reliably  and  safely  on  many  diCTeient  types  of  roads,  tanging  from  paved  paths  to  iitterstate  hi^- 
ways.  Even  with  these  impressive  results,  several  areas  within  the  neural  ptuadigm  for  autonomous 
road  following  still  need  to  be  addressed.  Ihese  include  tranqiarent  navigatkm  between  roads  of 
different  type,  simultaneous  use  of  different  sensors,  and  generalization  to  toad  types  which  the 
neural  system  has  never  seen.  The  system  presented  here  addresses  these  issue  with  a  modular  nen- 
tal  architecture  which  uses  pietrained  ALVINN  networics  and  a  connectionist  superstructure  to 
robustly  drive  on  many  different  types  of  roads. 


1.  Introduction 

ALVINN  (Autonomous  Land  Vehicle  In  A 
Neural  Network)  [POmerleau,  1S192]  has 
shown  that  neural  techniques  hold  much  pixMn- 
ise  for  the  field  of  autonomous  road  following. 
Using  sinqile  color  image  preprocessing  to 
create  a  grayscale  input  image  and  a  3  layer 
neural  network  architecture  consisting  of  960 
input  units,  4  hidden  units,  and  50  output  units, 
ALVINN  can  quickly  learn,  using  back-propa¬ 
gation,  the  correct  mapping  from  input  image 
to  output  steering  direction.  See  Figure  1.  This 
steering  direction  can  then  be  used  to  control 
our  testbed  vehicles,  the  Navlab  1  [Thoipe, 
1991]  and  a  converted  U.S.  Army 
called  the  Navlab  2. 

ALVINN  has  many  characteristics  which  make 
it  desirable  as  a  robust,  general  purpose  road 
following  system.  They  include: 

•  ALVINN  learns  the  features  that  are 


required  for  driving  on  die  particular 
road  type  for  which  it  is  trained. 

•  ALVINN  is  computationally  simple. 

•  ALVINN  leams  features  that  are  intu¬ 
itively  plausible  when  viewed  by  a 
human. 

•  ALVINN  has  been  shown  to  work  in  a 
variety  of  situations. 

These  features  make  ALVINN  an  excellent 
candidate  as  the  building  block  of  a  neural  sys¬ 
tem  which  can  overcome  some  of  the  problems 
which  limit  its  use.  The  major  {xoblem  this 
research  addresses  is  ALVlNN’s  lack  of  ability 
to  learn  features  which  would  allow  the  system 
to  drive  on  road  types  other  than  that  on  which 
it  was  trained.  In  addition  to  overcoming  this 
{ffoblem,  the  system  must  meet  the  current 
needs  of  the  autonomous  vehicle  community 
which  include: 
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•  The  ability  to  robustly  and  transparently 
navigate  between  many  different  road 
types. 

•  Graceful  performance  degradation. 

•  The  ability  to  iiKxxpante  many  different 
sensors  which  can  lead  to  a  much  wider 
range  of  operating  conditions. 

From  tiiese  requirements  we  have  begun 
develtqnng  a  modular  neural  system,  called 
MANIAC  for  Multiple  ALVINN  Networks  In 
Autonomous  OmtroL  MANIAC  is  composed 
of  several  ALVINN  networks,  each  train^  for 
a  single  road  type  that  is  expected  to  be 
encountered  during  driving.  See  Hgure  1.  This 
system  will  allow  for  transparent  navigation 
between  roads  of  different  types  by  using  these 
pretrained  ALVINN  networks  along  with  a 
connectionist  integrating  superstructure.  Our 
hope  is  that  the  superstructure  will  learn  o 
combine  data  from  each  of  the  ALVINN  nt 
works  and  not  simply  select  the  best  one. 
Additionally,  this  system  may  be  able  to 
achieve  better  perfcmnance  Aan  a  single 
ALVINN  network  because  of  the  extra  data 
available  from  the  different  ALVINN  net¬ 
works. 

2.  The  MANIAC  System 

The  MANIAC  system  consists  of  multiple 
ALVINN  networlu,  each  of  which  has  bMn 
pretrained  for  a  particular  road  type.  They 


serve  as  toad  feature  detectors.  Ouqrut  ffom 
each  of  the  ALVINN  networks  is  combined 
into  one  vector  which  is  placed  on  the  input 
units  of  the  MANIAC  network.  The  ooq>ut 
from  tire  ALVINN  networks  can  be  taken  from 
eidier  their  ouqmt  or  hidden  units.  We  have 
found  that  using  activatirm  levels  from  hidden 
units  provides  better  generalizatiem  results  and 
have  conducted  all  our  experimrats  widi 
this  connectivity.  The  MANIAC  system  is 
trained  off-line  using  the  back-propagation 
learning  algrxithm  [Rumelhart,  1986]  on 
image/steezing  direction  pairs  stored  from 
prior  ALVINN  training  sessims. 

2.1.  MANIAC  Network  Architecture 

The  architecture  of  a  MANIAC  system  which 
incorporates  multiple  ALVINN  networks  con¬ 
sists  of  a  30x32  input  unit  retina  which  is  con¬ 
nected  to  two  or  more  sets  of  four  hidden  units. 
(The  Ml  comiections  in  Hgure  2.)  This  hidden 
layer  is  connected  to  a  second  hidden  layer  by 
the  M2  cmuiections.  The  second  hidden  layer 
contains  four  units  for  every  ALVINN  network 
that  the  system  is  integrating.  Finally,  the  sec¬ 
ond  hidden  layer  is  amnected  to  an  output 
layer  of  SO  units  through  the  M3  connections. 
All  units  in  a  particular  layer  are  fully  con¬ 
nected  to  the  units  in  the  layer  below  it  and  use 
die  hyperbolic  tangent  function  as  their  activa¬ 
tion  function.  Also,  a  bias  unit  with  constant 
activation  of  1.0  is  connected  to  every  hidden 
and  output  unit.  The  architecture  of  a 
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MANIAC  system  incmporating  two  ALVINN 
networks  is  shown  in  Figure  2. 

The  topology  of  the  input  retina  and  Ml  con¬ 
nections  of  MANIAC  system  is  identical  to 
that  of  the  A1  connection  topology  of  an 
ALVINN  netwoik.  See  Figure  1.  This  allows 
us  to  incorporate  an  entire  MANIAC  system 
into  one  compact  network  because  the  A1  c(»- 
nection  weights  can  be  directly  loaded  onto  the 
Ml  connections  for  a  particular  set  of  first 
layer  hidden  units  of  the  MANIAC  network. 
Simulating  the  entire  MANIAC  system,  then, 
does  not  entail  data  transfer  finom  ALVINN 
hidden  units  to  MANIAC  input  units,  but  only 
a  basic  forward  propagation  through  the  net¬ 
work. 

It  is  the  A1  connection  weights  of  the 
ALVINN  network  that  extract  vital  features 
firom  the  input  image  for  accurate  driving.  So 
in  addition  to  allowing  easy  implementation  of 
the  MANIAC  network,  the  network  topology 
of  the  Ml  connections  allows  us  to  capture 
important  weight  information  in  the  MANIAC 
system  that  the  ALVINN  hidden  units  have 
learned.  These  features  can  be  interpreted 
graphically  in  two  dimensional  views  of  the 
A1  connection  weight  values,  lypically,  a  net- 


woik  trained  for  one  lane  roads  learns  a 
matched  filter  that  looks  for  the  toad  body, 
while  a  networic  trained  (mi  multi-lane  toads  is 
sensitive  to  painted  lines  and  shoulders. 

2J.  lYaining  the  MANIAC  Network 

To  train  the  MANIAC  network,  stored  image/ 
steering  direction  pairs  from  ALVINN  training 
runs  are  collated  into  a  large  training  sequence. 
These  pairs  consist  of  a  preprocessed  30x32 
image  which  has  been  shifted  and  rotated  to 
create  multiple  views  of  the  original  image 
along  with  the  appropriate  steering  direction  as 
derived  by  nxmitoring  the  human  driver  during 
ALVINN  training.  See  [Pomerleau,  92]  for  an 
in-depth  discussion  of  the  image  preprocessing 
and  transformation  techniques.  After  collation, 
the  sequence  of  pairs  is  randomly  permuted  so 
that  all  exemplan  of  a  particular  road  type  are 
not  seen  consecutively.  The  current  size  of  this 
training  sequence  for  a  two  ALVINN 
MANIAC  network  is  600.  If  additional 
ALVINN  networks  are  used,  300  images  per 
new  ALVINN  network  are  added  to  the  train¬ 
ing  sequence.  This  sequence  is  stmed  for  use 
in  our  neural  iretwork  simulator. 

Next,  weights  on  each  of  the  connections  in 
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the  MANIAC  network  must  be  initialized. 
Because  the  MANIAC  Ml  connections  consist 
of  precomputed  ALVINN  A1  connection 
weights,  they  must  be  loaded  from  stored 
weight  files.  After  this  is  dtme,  the  M2  and  M3 
connection  weights  in  the  MANIAC  network 
are  randomized.  This  weight  set  is  then  ready 
for  use  as  the  initial  starting  point  for  learning. 

To  do  the  actual  training,  the  netwoik  architec¬ 
ture  along  with  the  weight  set  created  as  dis¬ 
cussed  in  the  previous  paragraph  and  the 
stored  training  sequence,  are  loaded  into  our 
neural  network  simulator.  Because  the 
MANIAC  Ml  connection  weights  are  actually 
the  pretrained  ALVINN  weights  who  serve  as 
feature  detectors,  the  Ml  connections  are  fro¬ 
zen  so  that  no  modification  during  training  can 
occur  to  them.  See  Figure  2. 

Initially,  training  is  done  using  small  learning 
and  momentum  rates.  These  values  are  used 
for  10  epochs.  At  this  point  they  are  increased 
(approximately  doubl^)  for  the  remainder  of 
training.  This  technique  seems  to  prevent  the 
netwoik  from  getting  stuck  in  local  minima 
and  is  an  adaption  of  a  technique  used  in 
ALVINN  training. 

The  back-propagation  learning  algorithm  is 
used  to  train  the  netwoik.  The  stored  images 
are  placed  on  the  input  luiits  of  the  MANIAC 
network  while  a  gaussian  peak  of  activation  is 
centered  at  the  correct  steering  direction  on  the 
SO  output  units  of  the  netwoik.  After  about  60 
epochs,  the  netwoik  has  converged  to  an 
acceptable  state  and  its  weights  are  saved.  This 
takes  approximately  10  minutes  on  a  Sun 
Sparcstation  2. 

It  should  be  noted  that  MANIAC  uses  the 
same  output  vector  representation  as  ALVINN. 
This  allows  the  output  of  the  MANIAC  net¬ 
work  to  easily  be  compared  with  that  of 
ALVINN  for  quantitative  study  and  also 
allows  for  the  use  of  existing  software  in  the 
MANIAC-vehicle  interface. 

2.3.  Simulating  the  MANIAC  network 

Once  the  netwoik  has  been  trained,  we  use  it  in 
our  existing  neural  network  road  following 
system  to  produce  output  steering  directions  at 
approximately  10  Hz. 


3.  Results 

Empirical  results  of  a  MANIAC  system  com¬ 
post  of  two  ALVINN  networks  have  been 
enoiuraging.  For  this  system,  one  ALVINN 
netwoik  was  trained  to  drive  the  vehicle  on  a 
one  lane  path  while  die  other  teamed  to  drive 
(XI  a  two  lane,  lined,  city  street  The  resultant 
MANIAC  netwixk  was  able  to  drive  on  both 
of  these  road  types  satisfactorily. 

To  determine  ixxxe  quantitative  results,  image/ 
steering  direction  pairs  from  the  same  two  road 
types  as  well  as  from  a  four  lane,  lined,  city 
street  were  captuieiL  See  Figure  3.  Using  these 
st(xed  images,  ALVINN  netwinks  were 
trained  in  the  lab  to  drive  on  the  txie  lane 
paved  path  and  the  two  lane,  lined  city  street 
Also,  a  MANIAC  network  integrating  the 
same  two  ALVINN  netwcxks  was  trained  The 
results  of  these  experiments  are  summarized  in 
Table  1.  In  Table  1  the  columns  represent  the 
average  enor  per  test  image  for  a  particular 
road  type  and  the  rows  represent  the  type  of 
netwoik  that  is  being  used  The  errors  com¬ 
puted  are  of  two  types,  SSD  error  and  Output 
Peak  error.  SSD  error  is  the  sum  of  squa^ 
diffnences  error  while  Output  Peak  error  is  the 
absolute  distance  between  the  position  of  the 
gaussian  peak  in  the  desired  output  activation 
and  the  peak  in  the  actual  output  activation. 
SSD  error  can  be  thought  of  as  a  measure  of 
the  netwixk’s  ability  to  accurately  repnxiuce 
the  target  vector  while  Output  Peak  error  is  a 
measure  of  the  ability  of  the  netwmk  to  pro¬ 
duce  the  correct  steering  direction. 

The  initial  comparison  to  notice  in  the  table  is 
that  the  ALVINN  netwixk  trained  for  a  partic¬ 
ular  road  type  always  performs  significantly 
better  (>  50%)  than  the  ALVINN  netwoik 
trained  for  the  other  road  type  when  presented 
test  images  of  the  type  of  road  on  which  it  is 
trained  This  is  to  be  expected  Also  notice  that 
the  single  MANIAC  network,  which  has  been 
trained  to  respond  properly  to  both  road  types, 
typically  compares  well  to  the  cixrect 
ALVINN  network  (within  11%  in  all  cases). 
As  mentioned  earlier,  this  amount  of  error  is 
acceptable  to  properly  drive  the  vehicle. 

The  case  of  the  four  lane  road  is  unique  in  that 
ne’Acr  of  the  ALVINN  networks  nor  the 


:Figure  3:  Typical  road  input  images. 


Two  Lane  Road 


Four  Lane  Road 


MANIAC  network  saw  a  road  of  this  type.  In 
this  case,  the  response  of  the  one  lane  path 
ALVINN  network  is  nearly  identical  to  when  it 
was  presented  a  two  lane,  lined,  city  street. 
Because  this  type  of  network  typically 
responds  to  the  body  of  the  road  and  the  fact 
that  the  two  and  four  lane  roads  are  both  sig¬ 
nificantly  wider  than  the  one  lane  path,  i.e. 
have  a  larger  body  area,  this  response  was 
expected.  A  more  interesting  response  is  that 
of  the  two  lane  road  ALVINN  network.  It 
seems  to  respond  better  to  the  four  lane  road 
images  than  it  does  to  the  two  lane  road  test 
images.  A  possible  explanation  of  why  this  is 
occurring  can  be  seen  in  Figure  3.  The  four 
lane  road  and  the  two  lane  road  look  almost 
identical.  One  slight  difference,  though,  is  that 
the  contrast  of  the  road/offioad  boundary  is 
slightly  higher  in  the  four  lane  road  case  than  it 
is  in  die  two  lane  road  case.  This  difference 
could  help  the  network  localize  the  road  better, 
and  because  we  want  the  vehicle  to  drive  in  the 


left  lane,  close  to  the  yellow  line,  the  correct 
output  is  identical  to  the  two  lane  road  case. 

The  most  interesting  result,  though,  is  that 
when  presented  with  four  lane  road  images, 
the  MANIAC  network  actually  performs  better 
than  either  the  one  lane  path  ALVINN  network 
or  the  two  lane  road  ALVINN  network.  In  both 
the  prior  cases,  the  MANIAC  network  per¬ 
formed  slighdy  worse  than  the  best  ALVINN 
network  for  a  particular  road.  This  could  imply 
that  the  MANIAC  network  is  using  informa¬ 
tion  from  both  networks  to  create  a  reasonable 
steering  direction  at  its  output.  This  will  be 
discussed  more  in  the  following  section. 

4.  Discussion 

A  central  idea  that  this  research  is  trying  to 
examine  is  that  of  improving  performance  and 
making  connectionist  systems  mote  robust  by 
using  multiple  networks  -  some  of  which 


One  Lane  Path 

Two  Lane  Road 

Four  Lane  Road 

SSD 

Output 

Peak 

SSD 

Output 

Peak 

SSD 

Output 

Peak 

One  Lane 
Path 

ALVINN 

5.913 

2.045 

5.570 

2.228 

5.469 

2.225 

Two  Lane 

Road 

ALVINN 

11.360 

3.076 

3.621 

1.375 

1.287 

0.823 

MANIAC 

6.263 

2.167 

3.907 

1.532 

1.243 

0.774 

Table  1:  The  average  output  error  of  ALVINN  and  MANIAC  systems  are  shown  for  a  variety  of  road  types  using  two 
different  metrics.  TTie  lower  the  value  in  the  table,  the  better  the  accuracy  of  the  network. 
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might  be  producing  incorrect  results.  In  our 
system  the  key  point  to  notice  is  that  although 
a  particular  ALVINN  netwotk  may  not  be  able 
to  drive  accurately  in  a  given  situation,  its  hid¬ 
den  units  still  detect  useful  features  in  the  input 
image.  For  example,  consider  an  ALVINN  net¬ 
work  that  was  trained  to  drive  on  a  two  lane, 
lined  road.  The  features  that  it  learns  are 
important  for  accurate  driving  are  the  lines  on 
the  road  and  the  road/non-road  division.  Now 
present  this  network  with  a  paved,  unlined 
bike  path.  The  ALVINN  network  will  respond 
in  its  output  vector  with  two  steering  direction 
peaks,  liie  reason  for  this  is  that  one  of  the 
features  that  the  network  is  looking  for  in  the 
input  image  is  the  delineation  between  road 
and  non-road.  Because  this  occurs  at  two 
places  in  the  image  of  the  paved  bike  path,  the 
feature  detecting  hidden  units  produce  a 
response  which  indicates  that  the  road/non¬ 
road  edge  is  present  at  two  locations.  If  these 
hidden  unit  activations  were  allowed  to  propa¬ 
gate  to  the  output  of  the  netwoik,  the  charac¬ 
teristic  two  peak  response  would  appear. 
Although  in  reality  this  is  the  incorrect 
response,  it  is  a  consistent  response  to  this 
input  stimulus.  A  similar  scenario  holds  for 
oAer  ALVINN  networks  given  input  images  of 
road  types  for  which  they  haven’t  been  trained. 
Because  the  response  of  particular  ALVINN 
network  is  consistent  when  presented  with 
similar  images,  the  MANIAC  network  can  use 
this  ‘extra’  data  to  produce  a  correct,  perhaps 
better,  steering  direction  than  a  single 
ALVINN  network.  It  is  possible  that  this  is 
what  is  happening  in  the  case  of  MANIAC 
driving  better  on  the  four  lane  road  than  either 
of  the  ALVINN  networks. 

5.  Future  Work 

There  are  many  directions  this  research  can 
take  but  perhaps  the  most  interesting  is  that  of 
developing  self-training  systems.  In  the  cur¬ 
rent  implementation  of  the  MANIAC  system, 
ALVINN  networks  must  be  trained  separately 
on  their  respective  roads  types  and  then  the 
MANIAC  system  must  be  trained  using  stored 
exemplars  from  the  ALVINN  training  runs.  If 
a  new  ALVINN  network  is  added  to  the  sys¬ 
tem,  MANIAC  must  be  retrained.  It  would  be 


(tesirable  to  have  a  system  that,  when  given 
initial  or  new  ALVINN  networks,  created  its 
own  training  exemplars  and  was  able  to  auto¬ 
matically  learn  the  correct  MANIAC  network 
weights.  Creating  training  exemplars  from 
existing  network  weights  is  essentially  the  net¬ 
wotk  inversion  problem.  Techniques  such  as 
those  developed  by  [Linden,  1989]  may  pro¬ 
vide  clues  of  how  to  do  this  one  to  many  map¬ 
ping  that  can  create  an  input  exemplar  ^m  an 
output  target  It  can  be  argued  that  this  task  is 
extremely  difficult  even  impossible,  due  to  the 
high  dimensionality  of  most  t»etworks,  but  per¬ 
haps  it  is  worth  taking  a  hard  look  at  imple¬ 
menting  some  network  inversion  techniques 
because  of  the  benefits  that  can  be  obtained  by 
having  self  training  riKxlular  neural  networks. 

Another  area  in  which  modular  neural  systems 
such  as  MANIAC  may  be  useful  is  that  of 
incorporating  information  from  different 
sources.  An  example  of  this  idea  is  to  use 
MANIAC  as  a  framework  in  which  to  add 
sensing  modalities  other  than  video.  In  addi¬ 
tion  to  a  video  camera,  our  testbed  vehicle,  the 
Navlab  2,  is  equipped  with  an  infrared  camera 
and  two  laser  rangefinders.  If  these  devices  can 
be  used  as  input  to  ALVINN-like  systems 
which  produce  a  steering  angle  as  output,  it  is 
reasonable  to  assume  that  a  training  technique 
similar  to  the  one  used  in  the  current  vid^ 
only  MANIAC  system  will  result  in  a  network 
which  will  be  robust  in  all  of  the  component 
netwotk  domains.  This  could  lead  to  hignly 
robust  autonomous  systems  which  could  oper¬ 
ate  in  a  variety  of  situations  in  which  current 
systems  fail.  Driving  with  the  same  system  in 
both  daylight  and  at  night  is  an  example.  In 
this  scenario  video  images  provide  sufficient 
information  to  drive  in  the  daytime  but  at  night 
sensors  such  as  infrared  cameras  would  be 
necessary.  The  infrared  cameras  need  not  go 
unused  in  the  day  though,  as  their  output 
would  provide  addition  information  to  the 
nxxlular  network. 

In  addition  to  the  previous  areas  of  work,  there 
is  much  to  be  done  with  developing  systems 
which  can  allocate  their  resources  and  group 
relevant  features  together.  It  has  been  shown 
that  tiKxlular  neural  networks  can  learn  to  allo¬ 
cate  their  resources  to  match  a  given  problem. 


478 


such  as  locating  and  identifying  objects  in  an 
input  retina  [Jacobs,  1990],  while  the  cascade 
correlation  aJgorithin  provides  a  way  to  pro¬ 
duce  appropriately  sized  networks.  [Fahlman, 

1990]  By  using  similar  techniques  in  a  [Linden,  1989] 

MAI^C-like  system,  the  need  to  pretrain 

ALVINN  networks  would  be  eliminated.  It  is 

not  clear,  though,  how  new  information  would 

be  incorporated  into  this  type  of  system  once  it 

has  been  trained.  [Pomerteau,  1991] 


6.  Conclusion 

This  research  has  focused  on  developing  a 
nxidular  neural  system  which  can  transpar-  {Pomerieau,  1992] 
ently  navigate  different  road  types  by  incorpo¬ 
rating  knowledge  stored  in  pretrained 
networks.  Initial  results  from  the  autonomous 
navigation  domain  are  promising.  Although 
the  system  is  simplistic,  it  provides  a  starting  [Rumelhait,  1986] 
point  fiom  which  we  can  explore  many  differ¬ 
ent  areas  of  the  cormectioitist  paradigm  such  as 
self-training  noodular  networks  and  networic 
resource  allocation.  In  addition  to  these  areas, 
autonomous  navigation  tasks  such  as  multi¬ 
modal  perception  can  be  studied. 
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Abstract 

This  paper  summarizes  several  recent  results 
from  perceptual  experiments  that  have  poten¬ 
tial  relevance  for  image  understanding  methods 
in  vision-based  navigation. 

1  Introduction. 

Experiments  with  expert  map  users  have  provided  in¬ 
sights  into  strategies  useful  in  automated  systems  [Hein¬ 
richs  ti  al,  1992J.  More  recent  experiments  are  investi¬ 
gating  perceptual  competence  in  extracting  navigation- 
ally  salient  information  from  realistic  terrain.  Results 
are  helping  us  to  understand  the  relevance  of  human 
perception  to  the  development  of  machine  solutions  for 
localization  problems.  In  addition,  computational  anal¬ 
ysis  of  these  results  may  help  in  developing  better  navi¬ 
gational  assists  and  training  procedures. 

2  Perception  of  Distance  and  Slope. 

When  experts  solve  difficult  localization  problems,  they 
appear  to  describe  topographic  features  in  terms  of  dis¬ 
tinctive  properties  such  as  relative  size,  elevation,  slope, 
etc.  These  are  typically  specified  in  bipolar  qualitative 
terms  such  as  large  or  small,  narrow  or  wide,  steep  or 
shallow.  Comparison  among  features  is  common  with 
terms  such  as  larger,  broader,  and  steeper.  As  best  as 
we  can  determine,  metric  descriptions  in  terms  of  units 
of  distance  or  angle  of  slope  are  rarely  used,  despite  the 
fact  that  metric  information  is  readily  available  on  maps 
with  a  distance  scale  and  contour  lines  at  standard  in¬ 
tervals. 

The  psychophysical  literature  on  perception  of  spatial 
layout  suggests  that  at  least  in  the  case  of  distance  esti¬ 
mation,  judgments  can  be  quite  accurate.  Generally,  the 
results  of  studies  of  perceived  distance  are  represented  in 
the  form  of  a  power  function  relating  judgments  of  per¬ 
ceived  distance  to  actual  distance:  judged  distance  = 
K  *  (actual  distance)" ,  where  K  is  &  constant  which  de¬ 
pends  on  the  scale  used  and  n  is  an  exponent  which 
defines  the  form  of  the  function  as  decelerating,  linear, 
or  accelerating.  A  surprising  number  of  studies  yield 

'This  work  was  supported  by  National  Science  Foundation 
grant  IRI-9196146,  with  partial  funding  from  the  Defense  Ad¬ 
vanced  Research  Projects  Agency. 


results  with  exponents  very  close  to  unity,  indicating  a 
linear  function  with  a  high  degree  of  “size  constancy.” 
While  many  of  these  studies  were  conducted  in  small 
spaces  in  laboratory  environments  or  building  corridors, 
a  number  were  conducted  in  outdoor  natural  environ¬ 
ments.  For  example,  [Da  Silva,  1985]  obtained  functions 
with  exponents  ranging  on  the  average  from  0.87  to  0.98 
depending  on  the  particular  method  of  estimation  used. 

Although  the  studies  conducted  in  such  outdoor  en¬ 
vironments  suggest  impressive  precision  in  perception  of 
distance,  the  environments  used  have  without  exception 
been  flat  homogeneous  spaces  such  as  grassy  fields,  ath¬ 
letic  fields,  or  open  water.  We  conducted  three  experi¬ 
ments  to  determine  if  these  results  generalized  to  situ¬ 
ations  more  likely  to  occur  in  actual  navigation  tasks. 
The  terrain  used  was  part  of  a  ski  slope  area  and  a  large 
nature  park  with  distances  ranging  from  8.5m  to  357 
m.  Our  principal  conclusion  is  that  variability  was  very 
large,  with  the  previously  accepted  power  function  model 
accounting  for  only  40%  to  60%  of  the  variance  in  dis¬ 
tance  judgments  in  some  conditions. 

Scatter  diagrams  of  subjects’  judgments  of  distance 
under  two  of  the  conditions  used  provides  some  idea  of 
such  variability.  Figure  1  indicates  actual  and  estimated 
lateral  distances  between  two  points  which  lay  along  a 
single  line  of  sight  under  one  of  the  viewing  conditions. 
Figure  2  indicates  actual  and  estimated  radial  distances 
between  two  points,  both  at  approximately  the  same  dis¬ 
tance  from  the  observer. 

An  additional  experiment  was  carried  out  in  flat  ho¬ 
mogeneous  terrain  and  yielded  results  similar  to  those 
reported  in  the  literature.  In  addition,  all  four  experi¬ 
ments  replicated  a  result  often  reported  in  the  literature, 
that  of  relative  underestimation  of  radial  distance  com¬ 
pared  to  lateral  distance.  This  is  often  attributed  to  the 
foreshortening  of  visual  projections  of  distance  along  the 
line-of-sight. 

As  in  the  case  of  research  on  perceived  distance,  there 
is  considerable  laboratory  research  on  the  perception  of 
slope.  However,  there  is  almost  no  research  on  the  per¬ 
ception  of  slope  in  natural  terrain.  A  common  observa¬ 
tion  of  the  laboratory  research  is  an  overestimation  of 
the  slant  of  surfaces  from  the  horizontal.  Of  51  such 
studies  overestimation  was  reported  in  31  and  underes¬ 
timation  in  only  one,  with  the  others  reporting  approx¬ 
imately  veridical  judgments.  In  the  one  report  of  per- 


481 


0 


Subjaett 

Subi*el2 

Subiacia 

Subfacta 

SubjaeiS 

Sub(aet6 


Figure  1:  Estimated  vs.  actual  lateral  distances. 


1 

I 


Figure  3:  Actual  and  estimated  lateral  slopes. 
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Figure  2;  Estimated  vs.  actual  radial  distances. 


Figure  4:  Actual  and  estimated  radial  slopes. 


ceived  slant  in  natural  terrain  a  similar  observation  was 
madj  [Smith  and  Smith,  1965].  Adults  were  asked  to 
estimate  the  slope  of  a  road  going  up  a  hillside.  The 
hill,  several  miles  away,  faced  the  observers.  All  subjects 
grossly  overestimated  the  slope  of  the  hill;  the  measured 
slope  was  3*  but  the  modal  estimate  was  45*! 

We  carried  out  three  experiments  to  assess  the  percep¬ 
tion  of  slope  in  natural  terrain.  Subjects  were  asked  to 
judge  the  average  slope  of  the  ground  between  two  tar¬ 
get  locations.  The  slopes  were  on  hillsides  and  level  areas 
near  a  local  ski  area,  on  a  river  bank,  and  on  streets  in 
a  residential  environment.  The  particular  slopes  varied 
from  horizontal  to  50* .  Included  were  hills  rising  in  the 
medial  plane  (e.g.,  lateral  slopes  in  which  the  slope  of  the 
hill  cut  across  the  line  of  sight)  hills  rising  in  the  fronto- 
parallel  plane  (e.g.,  radial  slopes  which  were  slanting  into 
the  line-of-sight). 

Results  of  all  three  experiments  indicated  that  sub¬ 
jects  markedly  overestimated  the  slopes  with  amount  of 
overestimation  varying  up  to  30*.  Figure  3  shows  the 
best  linear  fit  between  actual  and  observed  slope  esti¬ 
mates  for  one  experimental  condition.  In  general  the 
amount  of  overestimation  was  less  for  slopes  close  to  hor¬ 
izontal.  Furthermore,  radial  slopes  tended  to  be  overes¬ 
timated  to  a  greater  degree  than  lateral  slopes.  There 
was  also  some  indication  that  there  was  greater  overesti¬ 
mation  of  radial  slopes  when  they  were  far  from  the  ob¬ 
server  than  when  they  were  close  to  the  observer.  Both 
the  greater  overestimation  of  radial  as  opposed  to  lateral 


slopes  and  the  greater  overestimation  of  more  distant  ra¬ 
dial  slopes  may  be  due  to  inaccuracy  in  radial  distance 
judgments.  If  the  underestimation  of  radial  distances 
is  greater  than  the  underestimation  of  lateral  distances, 
exactly  this  sort  of  foreshortening  will  occur.  A  similu 
effect  may  also  account  for  the  overestimation  of  radial 
slopes  in  laboratory  studies  were  reduced  viewing  condi¬ 
tions  are  typically  used. 

In  summary,  the  perception  of  metric  characteristics 
of  natural  terrain  is  quite  variable  and  subject  to  consid¬ 
erable  error.  Map  users  may  know  this  from  experience 
and  not  rely  on  strategies  employing  such  information. 
It  is  not  known  to  what  degree  systematic  training  could 
reduce  these  errors  and  provide  map  users  with  another 
useful  tool. 

3  Localization  Based  on  Visual  Angle. 

[Levitt  tt  o/.,  1987]  describe  how  the  apparent  order  of 
landmarks  constrains  the  viewpoint  to  a  half  plane  de¬ 
fined  by  landmark-paiT-boundariea.  On  a  plane,  non¬ 
linear  triads  of  landmarks  constrain  the  viewpoint  to 
one  of  seven  possible  orientation  regions.  If  in  addition, 
quantitative  information  about  the  visual  angles  between 
landmarks  is  available,  two  landmarks  constraint  the 
viewpoint  to  a  partial  circle  and  in  most  cases,  thrfe 
landmarks  uniquely  determine  the  viewpoint.  [Suther¬ 
land,  1992]  extends  the  analysis  by  showing  how  uncer¬ 
tainty  in  viewpoint  determination  is  related  to  errors  in 


482 


visual  angle  determination  and  the  actual  positions  of 
the  landmarks  used.  Uncertainty  in  viewpoint  increases 
with  uncertainty  in  visual  angle,  though  not  necessarily 
in  a  linear  manner.  For  a  given  accuracy  in  angle  deter¬ 
mination,  uncertainty  can  vary  significantly  for  different 
conflgurations  of  landmarks.  For  example,  the  uncer¬ 
tainty  associated  with  three  landmarks  on  a  line  and  the 
actual  viewpoint  off  to  the  side  is  much  greater  than  if 
the  center  landmark  were  closer  to  the  viewpoint. 

The  only  study  in  the  literature  investigating  human 
localization  of  viewpoint  based  on  ordering  of  landmarks 
and  something  equivalent  to  landmark-pair-boundaries 
is  that  of  [Peruch  and  Pailhous,  1986].  Subjects  were 
given  a  map  with  22  distinctive  object  locations  iden¬ 
tified.  They  were  then  given  a  series  of  trials  in  which 
they  were  shown  a  left-right  ordering  of  a  subset  of  the 
map  objects,  as  if  from  an  eye  level  view  of  the  space 
of  the  map.  The  ordering  of  the  objects  was  equivalent 
to  the  order  that  would  be  seen  if  one  were  at  partic¬ 
ular  location  on  the  map  and  the  distance  between  the 
different  pairs  of  objects  corresponded  to  the  visual  an¬ 
gles  that  would  be  subtended  by  them  from  that  loca¬ 
tion  the  map.  The  task  W2is  to  find  in  each  case  this 
viewpoint  on  the  map.  Subjects  found  correct  solutions 
in  about  90%  of  the  trials.  Their  behavior  and  com¬ 
ments  indicated  that  they  were  using  information  about 
which  landmarks  were  in  view  in  the  ordering  and  about 
at  least  large  differences  in  visual  angle  subtended  by 
pairs  of  object  locations.  The  study  demonstrates  that, 
in  principle,  human  subjects  can  operate  on  the  basis 
of  such  qualitative  navigation  techniques.  However,  the 
large  number  of  target  locations  available  on  any  given 
trial  makes  possible  localization  on  the  basis  of  intersec¬ 
tion  of  landmark-pair-boundaries  without  necessary  at¬ 
tention  to  visual  angle.  In  addition,  performance  on  this 
paper-and-pencil  assessment  doesn’t  necessarily  indicate 
what  people  might  do  in  estimating  outdoor  locations. 

An  initial  attempt  was  made  to  assess  whether  people 
were  sensitive  to  the  information  provided  by  the  visual 
angle  separation  of  landmarks  and  whether  their  accu¬ 
racy  would  be  subject  to  the  configuration  constraints 
identified  by  [Sutherland,  1992].  Subjects  were  asked  to 
locate  on  a  map  their  own  viewpoint,  in  relation  to  three 
environmental  landmarks.  The  landmarks  were  build¬ 
ings  in  a  metropolitan  skyline,  viewed  from  a  vantage 
point  a  mile  or  more  across  a  river.  The  river  and  in¬ 
tervening  urban  clutter  prevented  accurate  perception  of 
even  the  relative  distance  of  the  individual  landmarks, 
leaving  visual  angle  as  potentially  the  best  information 
for  localization.  The  maps  were  blank  sheets  of  paper  ex¬ 
cept  for  marks  indicating  the  location  of  the  three  land¬ 
marks.  The  three  buildings  on  the  skyline  were  identified 
to  a  subject  and  then  they  were  given  a  map  with  three 
marks  corresponding  to  the  buildings.  The  subject’s  task 
was  to  mark  their  own  position  on  the  map.  In  half  of 
the  configurations  the  landmarks  were  colinear.  In  the 
other  half,  one  of  the  landmarks  was  displaced  toward  or 
away  from  the  observer  to  see  if  this  increased  the  preci¬ 
sion  of  localization  as  might  be  implied  by  Sutherland’s 
analysis.  There  were  ten  sets  of  the  three  landmarks  of 
each  type. 


Results  indicated,  first  of  all,  that  people  are  rather 
poor  at  using  visual  angle  information  in  this  way.  The 
average  error  in  estimating  one’s  position,  relative  to  the 
distance  to  the  configuration  of  landmarks,  ranged  from 
30%  for  a  configuration  whose  distance  was  on  the  av¬ 
erage  1,700m  to  65%  for  a  configuration  approximately 
1,100m  away.  Thus  observers  were  making  errors  of  the 
order  of  500m  to  700m.  For  a  sample  of  visual  an¬ 
gles  across  the  various  configurations  ranging  from  14” 
to  70”,  the  visual  angles  associated  with  the  estimated 
viewpoints  differed  from  the  actual  angles  in  the  scene 
by  an  average  of  almost  30%.  Either  the  observers  where 
not  able  to  accurately  make  use  of  the  visual  angle  in¬ 
formation  or  they  had  significant  difficulty  in  estimating 
the  angles  with  precision. 

Nevertheless,  there  was  some  indication  that  subjects 
could  use  the  visual  angle  information.  In  the  first  place, 
there  was  a  high  correlation  between  the  actual  visual  an¬ 
gles  and  the  angles  associated  with  the  estimated  view¬ 
points.  Secondly,  in  spite  of  the  large  variability  in  per¬ 
formance,  for  eight  of  the  ten  sets  of  landmarks  the  av¬ 
erage  error  was  greater  for  the  configurations  with  the 
colinear  landmarks  than  for  the  corresponding  sets  with 
the  medial  landmark  non-colinear.  Performance  of  occa¬ 
sional  individual  subjects  is  very  congruent  with  visual 
angle  information.  It  is  also  interesting  to  note  that  in 
no  case  were  landmark-pair-boundaries  violated.  Thus 
while  performance  is  quite  variable  across  subjects  there 
are  individual  subjects  who  perform  at  a  very  high  level 
and  on  the  average  subjects’  performance  does  seem  to 
reflect  use  of  visual  angle.  Training  of  human  attention 
to  visual  angle  information  for  position  may  be  feasible 
and  is  worth  trying. 
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Abstract 

Robot  navigation  is  often  based  on  determining  the 
relative  position  of  visible  landmarks  in  the  envi¬ 
ronment  and  establishing  a  match  between  those 
landmarks  and  features  on  a  map  to  determine 
map  position.  Exact  measurements  are  seldom 
known,  and  combinations  of  approximate  measures 
can  lead  to  large  errors  in  self-localization.  The 
conventional  approach  to  this  problem  has  been  to 
deal  with  the  errors  after  they  occur.  We  show 
how  a  careful  choice  of  landmark  configurations  on 
which  to  base  localization  will  lead  to  significant 
improvement  in  robot  performance. 

1  Introduction 

A  robot,  using  a  map  to  navigate,  is  faced  with  a  double 
challenge:  it  must  be  able  to  determine  its  own  posi¬ 
tion  on  the  map  at  any  given  time  as  well  as  stay  close 
to  a  specified  path  as  it  moves.  This  is  usually  done 
by  establishing  a  match  between  landmarks  in  the  en¬ 
vironment  and  map  features  and  then  using  geometric 
relationships  between  the  navigator  and  the  landmarks 
to  determine  map  position.  Errors  in  position  determi¬ 
nation  are  a  function  of  errors  in  the  measurement  of 
landmark  properties  and  the  particular  method  used  to 
estimate  the  viewpoint.  To  the  extent  that  such  errors 
have  been  considered  at  all,  it  has  usually  been  in  a  con¬ 
text  of  developing  ways  to  estimate  uncertainty  regions 
associated  with  particular  viewpoints.  Our  interest  is  in 
selecting  landmarks  in  such  a  way  that  the  errors  which 
can  occur  are  minimized,  rather  than  having  to  accomo¬ 
date  uncertainty  regions  after  the  fact.  This  is  partic¬ 
ularly  important  when  sensory  processing  is  expensive 
in  time  or  other  resources,  making  it  desirable  to  utilize 
only  those  landmarks  which  are  most  useful. 

An  unstructured  outdoor  environment  complicates  the 
localization  process  significantly.  Actual  distances  to 
landmarks  are  often  difficult  or  impossible  to  estimate, 
forcing  solutions  that  depend  heavily  on  visual  bearing. 
Absolute  bearings,  registered  to  a  map,  are  frequently 
unavailable.  In  this  case,  triangulation,  which  requires 
absolute  bearings  to  two  or  more  landmarks,  cannot  be 

'This  work  was  supported  by  National  Science  Foundation 
grant  IRI-9196146,  with  partial  funding  from  the  Defense  Ad¬ 
vanced  Research  Projects  Agency. 


used.  In  this  paper,  we  therefore  utilize  more  generally 
applicable  methods  based  only  on  relative  angular  mea¬ 
surements  between  landmarks. 

We  define  the  visual  angle  from  a  navigator  to  two 
point  features  as  the  angle  formed  by  the  rays  from  the 
navigator  to  each  feature.  A  perfect  estimate  of  visual 
angle  between  two  points  in  three-dimensional  space  con¬ 
strains  viewpoint  to  a  surface  of  revolution  somewhat  re¬ 
sembling  a  torus  [Levitt  ei  al.,  1987].  When  the  points 
can  be  ordered  with  respect  to  viewpoint  position,  the 
viewpoint  is  restricted  to  half  the  surface.  It  follows  that, 
for  a  navigator  traveling  on  terrain,  exact  knowledge  of 
the  visual  angles  between  three  points  constrains  view¬ 
point  to  the  intersection  of  three  surfaces  and  the  terrain. 
In  most  cases,  this  intersection  is  a  single  point  [Suther¬ 
land  and  Thompson,  1993].  For  an  analysis  of  error 
free  localization  when  a  two-dimensional  approximation 
of  the  environment  is  assumed,  see  [Levitt  et  ai,  1987, 
Sugihara,  1988,  Krotkov,  1989,  Sutherland,  1992]. 

We  are  making  the  assumption  in  this  paper  that  land¬ 
marks  are  point  features  and  can  be  ordered.  Mountain 
peaks  are  one  example  of  such  features,  though  the  re¬ 
sults  hold  for  any  point  landmarks  or  beacons.  We  are 
also  assuming  that  the  navigator  is  traveling  on  terrain 
(as  opposed  to  being  in  space),  and  that  perfect  mea¬ 
surement  of  visual  angles  to  three  landmarks  will  provide 
exact  localization.  When  visual  angle  measure  is  not  ex¬ 
act  but  within  a  given  range,  location  is  constrained  to 
an  area  on  the  terrain.  We  define  the  ares  of  uncertainty 
to  be  the  area  in  which  the  navigator  may  self-locate  for 
any  given  error  range  in  visual  angle  measure.  We  have 
shown  that  a  wise  choice  of  landmarks  will  lead  to  a  de¬ 
crease  in  the  resulting  area  of  uncertainty  and  significant 
improvement  in  localization. 


When  a  two-dimensional  approximation  of  the  envi¬ 
ronment  is  assumed,  any  given  error  bound  on  vi¬ 
sual  angle  estimate  will  constrain  viewpoint  to  a  thick¬ 
ened  ring,  the  thickness  of  the  ring  determined  by  the 
amount  of  error  [Levitt  et  al.,  1987,  Levitt  et  ai,  1988, 
Krotkov,  1989].  When  three  i  ndmarks  are  used,  any 
given  error  in  estimate  constrains  viewpoint  to  the  in¬ 
tersection  of  two  such  rings^  [Kuipers  and  Levitt,  1988, 

*  A  third  ring  passing  through  the  two  landmarks  lying  at 
greatest  distance  from  each  other  can  be  computed,  but  it 


2  Choosing  Good  Landmarks 


Sutherland,  1992].  In  Figure  la,  the  visual  angles  from 
the  observer  V  to  AB  and  BC  are  both  45°.  The  dark 
lines  surround  the  area  of  uncertainty  which  represents 
an  error  bound  of  ±13.5°  or  ±30%  in  both  visual  angles. 
Although  the  landmarks  will  not  always  be  in  a  straight 
line,  the  visual  angles  will  not  always  be  identical  and 
the  navigator  will  not  always  make  the  same  error  in  es¬ 
timate  of  each  angle,  the  resulting  area  of  uncertainty 
will  always  equal  the  intersection  of  these  two  rings. 


Figure  1:  In  a  two-dimensional  approximation  of  the  en¬ 
vironment,  error  in  visual  angle  estimate  to  two  points 
constrains  viewpoint  V  to  a  thickened  ring.  When  three 
points  are  used,  viewpoint  is  constrained  to  (a)  the  in¬ 
tersection  of  two  such  rings.  In  a  three-dimensional  en¬ 
vironment  (b),  landmark  elevation  affects  the  size  of  this 
intersection. 

In  a  three-dimensional  environment,  landmarks  may 
differ  in  elevation  from  each  other  as  well  as  from  the 
navigator.  This  difference  will  affect  the  size  and  shape 
of  the  area  of  uncertainty.  Figure  lb  shows  an  example 
of  the  difference  that  elevated  landmarks  can  make  in  the 
area  of  uncertainty.  The  visual  angles  to  AB  and  BC  are 
both  45°.  The  smaller  area  on  the  plane  is  the  area  of 
uncertainty  for  planar  angles  of  45°  and  error  bound  of 
±10°  or  ±22%  if  the  landmark  points  were  at  the  same 
elevation  as  the  viewpoint.  The  larger  area  is  the  actual 
area  of  uncertainty  for  this  configuration  given  the  same 
error  bound. 

In  order  to  make  the  best  use  of  available  informa¬ 
tion,  the  successful  navigator  must  choose  landmarks 
which  will  give  the  least  localization  error  regardless 
of  amount  of  error  in  visual  angle  measure.  The  area 
of  uncertainty  corresponding  to  a  given  visual  angle 
and  error  in  that  visual  angle  varies  greatly  for  dif¬ 
ferent  configurations  of  landmarks  [Sutherland,  1992, 
Sutherland  and  Thompson,  1993).  We  claim  that  knowl¬ 
edge  of  landmark  map  location  and  landmark  viewing 
order,  together  with  knowing  that  viewpoint  is  located 
somewhere  on  the  map,  is  sufficient  for  choosing  good 
configurations. 

It  has  been  shown  [Levitt  et  al.,  1987,  Levitt  et  al., 
1988,  Kuipers  and  Levitt,  1988]  that  the  lines  joining 
pairs  of  landmarks  divide  navigation  space  into  distin¬ 
guishable  areas  (orientation  regions).  Levitt  ei  al.  called 
these  lines  LPB’s  (linear  pair  boundaries).  When  the 
robot  passes  from  one  orientation  region  to  another, 

does  not  affect  area  size. 


landmark  order  changes.  Since  the  computation  of  the 
area  of  uncertainty  is  dependent  on  landmark  order,  the 
area  will  always  be  bounded  by  the  orientation  region 
formed  by  those  three  landmarks  used  for  localization. 
Our  algorithm  begins  by  picking  the  triple  of  landmarks 
which  produce  the  smallest  orientation  region  on  the 
map.  (See  Figure  2.) 

We  have  previously  shown  [Sutherland,  1992]  that  the 
closer  a  configuration  is  to  single  circle  (i.e.,  all  land¬ 
marks  and  viewpoint  on  one  circle),  the  greater  the  er¬ 
ror  in  localization.  An  ongoing  rule  of  thumb  is  to  avoid 
anything  near  a  single  circle  configuration.  This  can  be 
done  by  avoiding  any  configuration  which  results  in  a  cir¬ 
cle  which  passes  through  all  three  landmarks  also  passing 
through  the  orientation  region  in  which  the  viewpoint  is 
required  to  lie.  If  all  triples  produce  the  same  orientar 
tion  region  (e.g.  all  landmarks  lie  on  a  straight  line),  the 
most  widely  spaced  landmarks  should  be  chosen. 


Figure  2:  Lines  joining  the  landmark  points  divide  space 
into  orientation  regions  such  as  the  shaded  area  in  the 
foreground. 

Incorporating  these  constraints,  we  have  developed  a 
"goodness”  function  to  weight  configurations.  It  uses 
the  locations  of  landmarks  A,  B,  C  and  estimated  view¬ 
point  Vq.  The  larger  the  function  value,  the  better  the 
configuration.  Although  Vo  is  not  necessarily  the  true 
viewpoint,  our  experiments  have  shown  that  this  func¬ 
tion  discriminates  in  such  a  way  that  the  best  configura¬ 
tion  to  be  used  for  localization  can  be  determined  using 
this  estimate.  Let  A  =  {Ax,  Ay,  Az),  B  =  {Bx,  By,  Bz), 
C  =  {Cx,Cy,Cz),  V  =  (Vx,  Vy,  V z)  be  the  projections 
of  the  landmark  points  and  Vo  on  a  horizontal  plane. 
Let  /  be  point  of  intersection  of  the  line  through  V  and 
B  with  the  circle  through  A,  C,  and  V,  L  he  point  of 
intersection  of  the  line  through  A  and  C  with  the  line 
through  V  and  B\  and  d{p,q)  be  distance  between  any 
two  points  p  and  q.  (See  Figure  3.) 

Then; 


G(A,fl,C,Vb)  =  h  +  / 

where 


h  =  {k 


{Az  +  Bz  +  Cz) 


-  Vz 


+  !)■ 


/  = 


i/d(V,B)>d(V,I) 

•7  d{V,L)<d{V,B)<d{V,l) 
if  d{V,B)<d{V,L) 
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Figure  3:  Simple  geometric  relations  can  be  used  to  rank 
landmark  configurations. 


The  function  consists  of  two  parts.  The  h  function 
weighs  the  elevation  of  the  landmarks  compared  to  the 
elevation  at  point  Vq.  It  is  non-negative  and  attains  its 
maximum  of  1  when  the  average  elevation  of  the  land¬ 
marks  is  equal  to  the  elevation  at  V^.  The  constant  k 
was  set  to  .005  in  the  experiments  we  describe  here. 

The  /  function,  also  non-negative  and  defined  piece- 
wise,  has  the  major  effect  on  the  goodness  measure.  It 
is  based  on  the  size  of  the  area  of  uncertainty  for  the 
projected  points.  Note  in  Figure  3  that  as  B  approaches 
the  circle,  the  measure  approaches  zero.  If  B  lies  on  line 
AC,  the  measure  is  the  ratio  The  function  in¬ 

creases  in  value  as  B  is  pulled  away  from  the  circle  and 
estimated  viewpoint.  The  factor  of  ^  in  the  first  piece  of 
/  causes  this  increase  to  occur  at  a  rate  such  that  when 
B  is  pulled  back  to  the  point  that  the  area  of  uncertainty 
is  the  same  size  as  for  a  straight  line  configuration,  the 
function  value  is  the  same.  The  function  increases  in 
value  as  B  moves  nearer  the  viewpoint. 

3  Experimental  Results 

We  compared  the  performance  of  our  algorithm  with  ran¬ 
domly  choosing  landmarks  to  be  used  for  localization. 
All  experiments  were  run  in  simulation  using  real  topo¬ 
graphic  data.  It  was  assumed  that  the  navigator  had  a 
map  of  the  area  and  knew  map  locations  of  points  which 
defined  both  the  path  and  the  landmarks  as  well  as  the 
order  of  landmarks  with  respect  to  initial  navigator  lo¬ 
cation.  Results  for  one  example  are  shown  in  Figure  4 
and  in  the  sequence  of  frames  in  Figure  5.  Each  frame 
in  this  example  represents  an  area  approximately  18  by 
12  kilometers  with  the  lower  left  corner  corresponding 
to  UTM  coordinates  427020E,  4497780N,  southeast  of 
Salt  Lake  City,  UT.  North  is  to  the  left  of  each  frame, 
and  east  is  toward  the  top.  All  landmarks  are  mountain 
peaks  which  are  visible  from  the  given  path.^  Identifying 
landmarks  in  large-scale  space  is  difficult  and  time  con¬ 
suming  [Thompson  and  Pick,  1992].  For  that  reason,  we 
use  a  small  set  of  landmarks.  Additional  landmarks  can 
be  identified  if  they  are  needed.  The  eight  landmarks 
used  for  these  trials  provided  56  different  combinations 
of  ordered  landmark  triples. 

^Landmark  locations  and  elevations  were  taken  from 
uses  30m  DEM  data. 


Consider  two  navigators  moving  along  a  path  toward 
a  goal.  They  have  identified  visible  landmarks  on  a  map 
and  know  the  left  to  right  order  of  those  landmttrks. 
Both  begin  by  using  their  knowledge  of  landmark  or¬ 
der  to  determine  the  smallest  orientation  region  in  which 
they  are  located.  They  use  the  landmarks  which  form 
that  region  to  estimate  their  initial  location.  Those  three 
landmarks  are  shown  as  triangles  in  Figure  4.  The  esti¬ 
mated  location  (same  for  both  navigators)  is  shown  by 
the  empty  square.  The  desired  path  is  shown  by  a  dot¬ 
ted  line.  The  goal  is  marked  by  a  star.  The  sequence 
of  frames  in  Figure  5  show  each  step  as  the  navigators 
progress  toward  the  goal.  A  configuration  of  three  land¬ 
marks  to  use  for  localization  (triangles)  is  chosen.  View¬ 
point  (empty  square)  is  estimated  and  a  move  is  made 
toward  the  next  path  point  (line  ending  in  solid  square). 
The  sequence  on  the  left  shows  a  wise  choice  of  land¬ 
marks.  Landmarks  are  chosen  randomly  in  the  sequence 
on  the  right. 


•  ^  * 


Figure  4:  The  eight  points  at  the  top  of  the  figure  rep¬ 
resent  the  eight  landmarks  used  for  localization.  Both 
navigators  start  at  the  solid  square  on  the  lower  left. 
Viewpoint  is  estimated  (empty  square)  using  the  three 
landmarks  (triangles)  which  produce  the  smallest  orien¬ 
tation  region.  Desired  path  is  shown  as  a  dotted  line. 
The  goal  is  marked  by  a  star. 

Landmarks  used  by  the  navigator  on  the  right  in  the 
first  frame  are  not  as  widely  spaced  its  those  used  on  the 
left.  In  addition,  the  center  landmark  lies  behind  (with 
respect  to  the  navigator)  the  line  joining  the  outer  two 
landmarks  whereas  the  center  landmark  on  the  left  lies 
in  front  of  that  line.  These  conditions  result  in  a  larger 
area  of  uncertainty  for  the  configuration  on  the  right  and 
somewhat  poor  localization.  This  error  is  made  up  for  in 
the  second  frame,  but  a  large  error  in  estimation  occurs 
in  the  last  frame.  The  reason  for  this  is  that  actual 
navigator  location  (from  which  the  estimate  was  made) 
and  the  three  landmarks  chosen  are  very  close  to  being 
on  a  single  circle.  The  visual  angles  themselves  in  the 
corresponding  third  frames  'e  quite  similar;  28*  and 
45*  on  the  left  and  42*  and  18*  on  the  right. ^ 

^Landmark  elevation  affects  visual  angle  measure.  That 
is  why  the  sums  of  the  angles  are  not  equal  even  though  the 
outer  landmarks  are  the  same. 
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Figure  5:  The  sequence  on  the  left  shows  the  path  taken  by  the  navigator  using  our  algorithm.  The  sequence  on 
the  right  shows  the  path  taken  when  landmarks  used  for  localization  are  chosen  randomly.  Landmarks  used  for 
localization  are  shown  as  triangles.  Desired  path  is  a  dotted  line.  Path  taken  is  a  solid  line.  Viewpoint  is  estimated 
at  empty  square,  and  navigator  moves  to  next  path  point  (end  of  solid  line  furthest  to  right). 
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Figure  6:  After  fifty  trials,  clustering  on  the  left  shows  how  better  localization  results  when  landmarks  are  chosen 
wisely.  Error  bounds  were  ±20%  in  visual  angle  for  the  top  pair  of  frames,  ±30%  in  visual  angle  for  the  second  pair 
of  frames,  and  ±20%  in  boifa  visual  angle  and  direction  and  distance  of  move  for  the  third  set  of  frames. 


489 


— tm  nro  to 

Error  Bounds 

±20%  Angle 

0  Move 

±30%  Angle 

0  Move 

344 

2883 

474 

4273 

18657 

4576 

452 

513 

387 

1106 

1227 

861 

Me&n  Distance  to  Goal 

711 

1166 

769 

3239 

"4781 

3290 

Table  1:  Results  after  100  trials.  Total  path  length  was  11352  meters.  All  distances  have  been  rounded  to  the  nearest 
meter. 


The  statistical  distribution  of  sensor  errors  and 
whether  they  affect  measurements  in  an  additive,  mul¬ 
tiplicative,  or  non-linear  manner  are  heavily  dependent 
on  the  specific  sensor  technologies  used.  In  navigation, 
these  can  range  from  very  wide  field  image  sensors  to 
sensors  which  mechanically  move  to  scan  for  landmarks. 
In  order  to  illustrate  our  approach,  we  ran  a  simula¬ 
tion  experiment  using  multiplicative  error,  uniformly  dis¬ 
tributed  over  a  fixed  range.  Error  amounts  were  gen¬ 
erated  using  an  implementation  of  the  Wichmann-Hill 
algorithm  [Wichmann  and  Hill,  1982]. 

The  three  pairs  of  frames  in  Figure  6  show  navigator 
positions  for  50  trials,  assuming  uniform  distribution  of 
error  within  ±20%  in  visual  angle  measure  and  no  error 
in  movement,  error  within  ±30%  in  visual  angle  measure 
and  no  error  in  movement,  and  error  within  ±20%  in 
both  visual  angle  and  direction  and  distance  of  move.^ 
The  clustering  around  the  path  points  is  quite  marked 
on  the  left,  the  result  of  using  our  algorithm  to  choose 
landmark  configurations. 

Table  1  gives  results  for  all  three  cases  after  100  tri¬ 
als  each.  Distances  have  been  rounded  to  the  nearest 
meter.  “Mean  Extra  Distance  Traveled”  is  the  average 
number  of  meters  ±  total  path  length  that  each  naviga¬ 
tor  traveled.  Due  to  the  fact  that  paths  in  unstructured 
environments  are  seldom  straight,  total  distance  traveled 
does  not  necessarily  reflect  how  well  the  navigator  stayed 
on  the  desired  path.  For  that  reason,  we  also  recorded 
distance  of  each  path  segment  of  the  desired  path  to  the 
corresponding  path  taken.  The  perpendicular  distance 
of  the  midpoint  of  the  desired  path  segment  to  the  path 
segment  taken  was  computed  for  each  segment.  The  av¬ 
erage  of  all  these  distances  is  given  in  the  table  as  “Mean 
Distance  to  Path” .  This  gives  an  indication  of  the  lateral 
distance  of  each  navigator  to  the  desired  path.  “Mean 
Distance  to  Goal”  is  the  average  distance  to  the  goal. 
The  navigator  which  used  our  algorithm  traveled  less, 
remained  closer  to  the  path  and  ended  closer  to  the  goal 
than  the  second  navigator.  It  is  important  in  this  type 
of  environment  that,  when  better  localization  at  the  goal 
is  needed,  the  navigator  is  close  enough  to  that  goal  to 
exploit  local  constraints.  The  navigator  who  chose  land¬ 
marks  wisely  is  close  enough  to  use  local  constraints  in  all 
three  sets  of  trials.  It  is  questionable  if  the  second  navi¬ 
gator,  averaging  a  minimum  of  two  miles  away  from  the 
goal,  will  be  able  to  take  advantage  of  such  constraints. 


point  was  picked  from  a  nniform  distribution  within 
a  circle  of  radius  20%  of  path  segment  length  around  the 
desired  path  point. 
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Abstract 

Localization  based  on  visual  landmarks  re¬ 
quires  feature  extraction  from  views  and  map, 
matching  of  features  between  views  and  map, 
and  viewpoint  hypothesis  generation  and  ver¬ 
ification.  In  this  paper,  we  describe  lower- 
level  image  and  map  understanding  procedures 
for  extracting  features  and  higher-level  problem 
solving  methods  for  establishing  feature  corre¬ 
spondences  and  making  inferences  about  the 
viewpoint.  Each  of  these  processes,  including 
the  interaction  of  high-level  and  low-level  sub¬ 
systems,  is  demonstrated  on  real  data. 

1  Introduction. 

An  essential  aspect  of  map-based  navigation  is  the  deter¬ 
mination  of  an  agent’s  current  location  based  on  sensed 
data  from  the  environment.  Formally,  this  amounts  to 
specifying  the  current  viewpoint  in  some  world  model 
coordinate  system.  This  localization  process  has  two 
distinct  components:  one  involving  the  establishment  of 
correspondences  between  aspects  of  the  sensed  data  and 
the  map  or  model  and  the  other  involving  derivation 
of  constraints  on  the  viewpoint  based  on  the  correspon¬ 
dences  that  have  been  determined. 

Correspondences  can  be  established  at  the  signal  or 
feature  level.  Signal-level  matching  correlates  sensed 
data  with  predictions  of  how  the  sensed  data  should  ap¬ 
pear.  It  works  best  when  the  uncertainty  in  the  view¬ 
point  is  small  and  when  it  is  relatively  easy  to  accu¬ 
rately  generate  expected  sensor  data.  For  example,  in 
the  TERCOM  and  SITAN  cruise  missile  guidance  sys¬ 
tems,  a  digital  elevation  model  is  matched  against  a 
downward  looking,  radar  sensed  elevation  profile  [An¬ 
dreas  ei  ai,  1978,  Baird  and  Abramson,  1984}.  Several 
researchers  have  addressed  the  more  difficult  problem 
of  signal-based  localization  at  or  near  ground  level  us¬ 
ing  horizontally  oriented  imaging  systems  and  passive 
sensing.  In  [Ernst  and  Flinchbaugh,  1989],  deviations 
between  expected  and  observered  views  are  determined 
using  curve  matching  algorithms  and  a  weak  perspective 
model  is  used  to  update  the  position  estimate.  [Yacoob 

'This  work  was  supported  by  National  Science  Foundation 
grant  IRI-9196146,  with  partial  funding  from  the  Defense  Ad¬ 
vanced  Research  Projects  Agency. 


and  Davis,  1991]  and  [Talluri  and  Aggarwal,  1992]  de¬ 
termine  viewpoint  under  the  assumption  that  viewpoint 
elevation  is  known  with  high  precision  in  the  reference 
frame  of  the  map,  a  situation  which  dramatically  re¬ 
duces  complexity  but  is  unfortunately  not  likely  to  hold 
in  practice.  [Stein  and  Medioni,  1992]  proposed  an  al¬ 
ternate  method  for  determining  viewpoint  based  on  the 
observed  horizon  line  which  is  similar  to  the  character¬ 
istic  view  approach  in  object  recognition. 

Vision-based  navigation  in  unstructured  terrain  can 
violate  many  of  the  assumptions  used  in  the  approaches 
described  above.  Often  there  is  limited  a  priori  knowl¬ 
edge  about  the  viewpoint  due  to  travel  through  indis¬ 
tinct  terrain,  temporary  occlusion  of  landmark  features, 
or  errors  in  position  updating  processes.  The  view  of 
the  world  at  or  near  ground  level  is  difficult  to  generate 
from  map  data  with  sufficient  fidelity  to  allow  signal-level 
matching.  Furthermore,  available  digital  cartographic 
data  sets  often  contain  inaccuracies  that  can  cause  seri¬ 
ous  problems  for  correlation-based  analysis.  For  exam¬ 
ple,  in  one  of  the  USGS  OEMs  that  make  up  our  test 
data,  the  location  of  the  high  point  of  a  significant  peak 
is  off  by  over  200m.  It  is  not  surprising  that  most  of 
the  published  work  on  vision-based  localization  from  a 
ground-level  perspective  has  been  demonstrated  only  on 
synthetic  data,  where  these  problems  do  not  occur. 

With  signal-based  techniques,  actual  viewpoint  deter¬ 
mination  is  done  using  the  same  types  of  methods  in¬ 
volved  in  photogrammetry  (which  solves  the  same  prob¬ 
lem)  [Sanso,  1973,  Thompson,  1958j  or  in  alignment  ap¬ 
proaches  to  object  recognition  [Huttenlocher  and  Ull- 
man,  1987,  Crimson,  1990].  The  principal  shortcoming 
in  both  these  methods  is  the  difficulty  of  introducing 
realistic  error  models  or  effective  representations  of  the 
uncertainty  in  viewpoint  estimates. 

Feature-based  approaches  hold  the  potential  for  avoid¬ 
ing  many  of  these  problems.  Features  are  extracted  inde¬ 
pendently  from  sensed  data  and  maps  and  then  matched 
symbolically.  As  a  result,  there  is  no  longer  a  need  to 
be  able  to  synthesize  an  accurate  rendition  of  expected 
sensed  data.  The  symbolic  nature  of  matching  and  view¬ 
point  inference  allows  the  introduction  of  sophisticated 
problem  solving  methods  which  are  able  to  deal  with 
issues  such  as  ambiguity  and  complex  error  models. 

In  the  remainder  of  this  paper,  we  describe  one  pos¬ 
sible  approach  to  feature-based  localization  in  unstruc- 
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tured,  outdoor  terrain.  We  outline  methods  for  extract¬ 
ing  terrain  features  from  maps  and  image  data,  show  how 
matching  can  be  performed,  and  describe  a  collection  of 
qualitative  geometric  reasoning  procedures  for  determin¬ 
ing  viewpoint  while  maintaining  an  explicit  representa¬ 
tion  of  the  uncertainty  associated  with  that  determina¬ 
tion.  The  approach  is  demonstrated  on  a  real  example 
involving  imagery  obtained  with  a  video  camera  and  map 
data  provided  by  the  USGS. 

2  Feature  Extraction. 

Three  classes  of  entities  are  central  to  the  localization 
process:  Terrain  is  the  physical  layout  of  the  land.  Maps 
are  geometric  representations  of  a  particular  region  of 
terrain,  typically  from  a  downward-looking  perspective 
and  possibly  augmented  with  information  about  culture 
and/or  vegetation.  Views  are  visually  sensed  images  of 
a  particular  region  of  terrain. 

Each  class  of  entities  can  be  described  in  terms  of  fea¬ 
tures.  In  the  case  of  terrain,  features  are  commonly  used 
geographic  properties:  hills,  valleys,  ridges,  etc.  These 
features  can  exist  across  a  range  of  scales,  specified  in 
terms  of  physical  extent.  (We  never  actually  deal  with 
terrain  features,  only  with  manifestations  of  such  fea¬ 
tures  in  the  map  and  view.)  In  the  case  of  maps  and 
views,  we  need  to  distinguish  between  data-level  and 
terrain-level  features:  Data-level  features  are  distinctive 
patterns  in  the  data  (e.g.,  a  configuration  of  edge  frag¬ 
ments  in  a  view  or  a  locally  defined  topographic  struc¬ 
ture  in  a  map).  Terrain-level  features  are  patterns  of 
data-level  features  likely  to  correspond  to  some  partic¬ 
ular  terrain  feature.  Terrain  features,  terrain-level  map 
features,  and  terrain-level  view  features  are  distinct,  even 
though  they  may  have  common  names. 

2.1  View  features. 

Currently,  we  are  concentrating  on  those  view  features 
associated  with  occluding  contours.  Because  the  im¬ 
agery  is  acquired  from  a  horizontal  perspective,  these 
typically  correspond  to  ridge  lines.  Ridge  line  extraction 
is  a  classical  segmentation  problem.  The  type  of  data 
we  are  working  with,  however,  causes  significant  difficul¬ 
ties.  Image  contours  corresponding  to  actual  ridge  lines 
should  be  long,  connected,  and  relatively  smooth.  Ex¬ 
cept  in  pathological  cases,  they  should  never  fold  back 
on  themselves.  While  this  might  suggest  an  approach 
which  looks  for  “large  scale”  image  features,  things  are 
not  so  simple.  Contrast  variations  across  edges  that  cor¬ 
respond  to  actual  ridge  lines  can  be  small  and  of  limited 
spatial  extent.  For  portions  of  many  ridge  lines,  con¬ 
trast  variations  can  be  lacking  altogether.  As  a  result, 
scale-space  approaches  will  not  succeed. 

Instead,  we  use  an  approach  similar  to  [Sha’ashua  and 
Ullman,  1988,  Nevatia  et  ai,  1992].  An  initial  edge  map 
is  computed  using  a  zero-crossing  edge  detector.  Edge 
segments  are  alternately  filtered  to  remove  portions  in¬ 
consistent  with  the  geometric  properties  of  ridge  lines 
and  augmented  using  properties  of  good  continuation  to 
account  for  locally  indistinct  ridge  segments.  Extraction 
of  longer  ridge  lines  is  done  using  A*  search  [Martelli, 


Figure  1:  Original  Image 
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V  -  peaks,  A  -  saddles,  □  -  junctions,  o  -  endpoints 


Figure  4:  Extracted  features 


1972]  which  allows  the  specification  of  more  global  op¬ 
timization  criteria.  This  is  particularly  important  for 
the  horizon  line,  which  can  be  difficult  to  hnd  due  to 
clouds  or  aerial  perspective.  Once  these  operations  are 
completed,  junctions,  end  points,  and  vertical  contour 
extrema  are  located,  since  these  often  correspond  to  to¬ 
pographically  relevant  features. 

Figure  1  shows  an  image  of  mountainous  terrain.  Fig¬ 
ure  2  shows  the  actual  ridge  lines  apparent  from  the 
viewpoint  associated  with  Figure  1,  as  determined  from 
DEM  data.  Figure  3  shows  four  stages  in  the  edge  filter¬ 
ing/gap  filling  process.  The  first  frame  is  the  output  of  a 
hysteresis  thresholded  zero-crossing  edge  detector.  The 
next  two  frames  show  intermediate  results,  with  filled 
gaps  indicated  by  darker  lines.  The  last  frame  shows 
the  final  edges.  Figure  4  shows  extracted  line  segments, 
peaks,  saddles,  T-junctions,  and  end  points. 

2.2  Map  features. 

The  extraction  of  map  features  involves  different  prob¬ 
lems  than  those  associated  with  the  view,  but  many  of 
the  processing  steps  are  similar.  (The  cartographic  com¬ 
munity  has  done  related  work,  but  not  specifically  in 
support  of  localization.)  Since  we  are  operating  directly 
on  elevation  data,  we  do  not  need  to  deal  with  the  am¬ 
biguity  associated  with  low-level  contrast  features  in  the 
view.  However,  we  do  have  to  find  long  ridge  contours 
that  may  not  be  immediately  apparent  at  a  given  scale. 
The  analysis  starts  with  a  characterization  of  local  sur¬ 
face  shape  of  the  map  in  terms  of  ridges,  valleys,  peaks, 
and  saddles  using  the  method  described  in  [Haraiick  ct 
al.,  1983].  Instead  of  resampling  to  produce  precise  ridge 
lines,  we  found  it  sufficient  to  impose  thresholds  when 
extracting  ridge  lines  and  use  a  thinning  algorithm  to 
extract  ridge  contours. 

Navigationally  salient  ridge  lines  and  peaks  cannot 
be  detected  from  an  analysis  of  features  extracted  from 
local  differential  properties  alone.  Visually  prominent 
ridge  lines  often  contain  broad  sections  where  the  spa¬ 
tial  derivatives  of  elevation  are  low,  resulting  in  a  clas¬ 
sification  as  flat  ground  and  so  creating  breaks  in  the 
ridge  lines.  Local  maxima  in  elevation  may  or  may  not 
correspond  to  visually  identifiable  peaks,  depending  on 
the  nature  of  surrounding  peaks  and  saddles.  Relatively 
simple  gap  filling  and  filtering  operations  can  signifi¬ 
cantly  improve  the  utility  of  features  extracted  using  lo¬ 
cal  methods. 

The  features  resulting  from  this  process  span  a  wide 


Figure  5:  Extracted  peaks  and  ridges. 


Figure  6:  Extracted  peaks  and  ridges  at  coarser  scale. 


range  of  spatial  scales.  Again,  the  linear  nature  of 
ridge  lines  limits  the  value  of  a  straightforward  scale- 
space  analysis.  A  local  analysis  of  ridge  line  junctions 
has  proven  adequate  for  distinguishing  between  dom¬ 
inant  and  subsidiary  ridges.  This  allows  the  creation 
of  a  graph-like  description  of  ridge  structure,  since  spur 
ridges  can  in  turn  contain  sub-spurs.  Access  to  this  hier¬ 
archy  can  prove  significantly  beneficial  in  feature  match¬ 
ing.  At  initial  stages  of  the  matching  process,  only  main 
ridges  should  be  considered.  When  precise  localization 
hypotheses  are  being  evaluated,  however,  the  detailed 
structure  of  the  ridge  line  may  become  relevant.  The 
hierarchical  description  makes  it  easy  to  avoid  this  level 
of  detail  unless  needed. 


Figure  7:  Alternate,  viewpoint  dependent  ridge  hierar¬ 
chies. 
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Local  analysis  at  times  is  unable  to  distinguish  with 
confidence  the  major  and  minor  ridges  coming  into  a 
junction.  Rather  than  being  a  deficiency  of  the  ap¬ 
proach,  it  may  provide  useful  information.  These  sit¬ 
uations  are  exactly  those  where  there  is  a  viewpoint  de¬ 
pendence  in  the  ridges  which  needs  to  be  attended  to.  As 
a  result,  we  can  isolate  the  viewpoint  dependent  aspects 
of  the  representation,  use  with  greater  confidence  those 
parts  that  show  no  obvious  ambiguity,  and  distinguish 
between  dominant  and  subsidiary  features  if  and  when 
hypotheses  about  the  viewpoint  are  available. 

Figure  5  shows  significant  peaks  and  ridges  extracted 
at  one  particular  spatial  scale  overlaid  onto  the  cor¬ 
responding  contour  map.  Figure  6  shows  peaks  and 
ridges  extracted  at  a  coarser  spatial  scale.  Figure  7  illus¬ 
trates  the  hierarchical  nature  of  extracted  ridge  features. 
“Dominant”  ridge  lines  are  often  viewpoint  dependent, 
as  shown  in  the  two  parts  of  the  figure,  each  with  a  view 
position  indicated  by  a  black  dot. 

3  Matching  and  Geometric  Inference. 

Feature-based  localization  involves  problem  solving 
[Heinrichs  et  at.,  1992].  The  integration  of  symbolic 
problem  solving  with  signal-level  image  analysis  has  long 
been  a  goal  for  many  in  the  computer  vision  commu¬ 
nity.  Few  successful  examples  exist,  however.  In  our 
case,  we  are  able  to  effect  this  integration  by  restricting 
ourselves  to  a  specific  task  and  establishing  a  protocol 
for  the  interaction  between  high  and  low  level  analysis 
routines  that  is  tailored  to  that  task.  The  problem  solv¬ 
ing  component  of  the  system  interacts  with  the  feature 
extraction  modules  as  if  they  were  databases.  Query 
and  response  languages  were  defined  that  make  it  possi¬ 
ble  to  easily  express  relevant  information  about  terrain 
features.  Geometric  inference  is  integrated  in  a  similar 
manner.  The  result  is  a  system  in  which  the  individual 
components  can  be  constructed  in  a  nearly  independent 
manner,  without  a  need  to  understand  the  details  of  in¬ 
ternal  representations  and  algorithms  of  other  modules. 
Figure  8  shows  the  basic  organization. 

Overall  control  is  determined  by  the  high-level  match¬ 
ing  and  inference  system.  Both  top-down  and  bottom- 
up  feature  extraction  is  easily  accomplished,  however. 
For  example,  early  in  the  localization  process  reconnais¬ 
sance  queries  can  request  a  general  examination  of  map 
or  view  to  determine  significant  features.  Later,  expecta¬ 
tions  can  be  verified  in  a  top-down  manner  by  generating 
highly  constrained  queries  and  examining  whether  or  not 
any  items  are  returned. 

3.1  Matching. 

One  key  observation  arising  from  our  study  of  how  expert 
map  users  solve  difficult  localization  problems  is  that 
they  organize  map  and  view  features  into  configuraiions 
before  attempting  to  match  them  [Pick  et  al.,  in  press]. 
Configurations  are  small  groupings  of  features  (typically 
two  or  three)  that  are  close  together  and  often  satisfy 
particular  topographic  and/or  geometric  properties  that 
make  them  distinctive.  Matching  configurations  rather 
than  individual  features  significantly  reduces  the  com¬ 
binatorics  in  two  ways:  there  are  fewer  candidates  for 
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Figure  8:  Interaction  of  high-level  and  low-level  subsys¬ 
tems. 

matching  and  each  potential  match  involves  a  richer  set 
of  properties  which  can  be  evaluated  for  compatibility. 

Complexity  and  possible  ambiguity  are  further  re¬ 
duced  by  forming  target  configurations  that  are  likely 
to  be  matched.  Each  map  and  view  feature  has  a  set  of 
associated  properties  which  constitute  a  geometric  de¬ 
scription  of  its  shape  and  position.  Particular  property 
values  such  as  high  or  sharp  peaks  or  near  level  ridges  are 
an  indication  that  the  feature  is  likely  to  be  easy  to  find 
in  both  map  and  view.  Specific  combinations  of  feature 
properties  are  used  to  compute  a  set  of  prominence  val¬ 
ues,  represented  using  a  number  in  the  range  [0.0—  1.0]. 
Prominence  alone  is  a  poor  criterion  by  which  to  select 
features  for  forming  configurations,  however,  since  it  is 
computed  on  a  per-feature  basis  and  it  may  turn  out  for 
a  particular  case  that  there  are  many  features  high  in 
some  particular  property  prominence.  The  distinctive¬ 
ness  of  a  particular  prominence  type  is  characterized  by 
a  value  that  is  large  when  the  population  of  features  is 
such  that  a  few  have  large  values  for  the  prominence  in 
question  and  the  rest  have  small  values.  The  saliency 
of  each  feature  property  is  computed  by  multiplying  the 
property  prominence  by  the  property  distinctiveness.  Fi¬ 
nally,  the  overall  saliency  for  features  is  computed  using 
a  simple  product  rule  that  favors  features  with  several 
highly  salient  prominences; 

5„v.r.Il(/.)  =  1.0  -  n(l  0  -  Si(fi)) 

3 

where  5over«ii(/i)  overall  saliency  of  feature  (/,) 

and  Sj{fi)  is  the  individual  saliency  of  the  j-th  property 
of  feature  (/,  ). 

The  formation  of  configurations  is  implemented  by 
first  sending  a  reconnaissance  query  to  either  the  map  or 
view  feature  extraction  subsystems,  requesting  that  indi¬ 
vidual  features  with  high  prominence  be  returned.  These 
are  filtered  to  remove  all  but  the  features  with  the  high¬ 
est  overall  saliency.  Configurations  are  then  formed  with 
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a  combination  query  to  the  s^une  iow-level  subsystem, 
requesting  any  sets  of  features  that  contain  at  least  one 
of  the  salient  individual  features  and  satisfy  particular 
geometric  and/or  topographic  properties.  Any  configu¬ 
rations  that  result  from  this  process  can  be  searched  for 
using  a  similar  combination  query  to  the  other  low-level 
feature  extraction  subsystem. 

3.2  Inference. 

Localization  involves  the  determination  of  viewpoint 
constraints  based  on  possible  correspondences  between 
image  and  map  features.  These  constraints  are  used  in 
geometric  reasoning  operations  that  either  hypothesize  a 
possible  viewpoint  or  evaluate  a  hypothesis  by  predicting 
additional  constraints  that  should  be  satisfied.  There  are 
distinct  categories  of  information  about  feature  position 
that  in  turn  lead  to  distinct  constraints  on  viewpoint: 

Absolute  bearing:  This  is  the  “standard”  way  to  solve 
localization  problems.  It  requires  an  accurate  compass 
registered  to  the  map  coordinate  system.  Determination 
of  viewpoint  is  done  using  straightforward  trigonometry. 

Relative  bearing:  Relative  bearings  between  three  or 
more  image  features  with  known  map  positions  lead  to  a 
classical  “pose  estimation”  problem.  Well  established 
numerical  techniques  exist  for  solving  such  problems. 
[Levitt  et  al.,  1987]  describe  an  alternate  method  in 
which  only  two  features  are  considered  at  a  time.  The  vi¬ 
sual  angle  between  the  two  features  constrains  the  view¬ 
point  to  lie  on  a  particular  circle  on  the  map.  Using 
multiple  pairs  of  features  usually  allows  a  unique  view¬ 
point  to  be  found  by  intersection. 

Ordinal  view:  (Levitt  et  ai,  1987]  show  how  ordinal  po¬ 
sition  of  two  features  (e.g.,  “A  is  left-of  B”)  can  be  used 
to  constrain  the  viewpoint  to  lie  on  one  side  of  a  line 
through  the  positions  of  A  and  B  [Levitt  et  ai,  1987, 
Levitt  et  ai,  1988].  They  suggest  intersecting  this  con¬ 
straint  for  many  different  pairs  of  features. 

Exact  Alignment:  If  two  features  line  up  along  a  line  of 
sight,  then  the  viewpoint  is  constrained  to  lie  on  a  line 
connecting  the  two  features.  In  almost  all  circumstances 
encountered  in  outdoor  navigation,  it  is  possible  to  de¬ 
termine  which  of  the  two  features  is  more  distant  and  as 
a  result  the  viewpoint  can  be  constrained  to  a  half-line. 

Approximate  Alignment:  If  two  features  are  much  closer 
laterally  (i.e.,  perpendicular  to  the  line  of  sight)  than  in 
depth  (i.e.,  parallel  to  the  line  of  sight),  then  not  only  is 
the  viewpoint  constrained  to  lie  on  one  side  of  a  line  con¬ 
necting  the  two  features,  but  it  will  be  “near”  this  line. 
This  constraint  appears  to  be  used  with  some  frequency 
by  expert  map  users  solving  real  navigation  problems. 

Viewpoint  terrain  type:  A  locomoting  agent  often  has 
more  complete  information  about  its  immediate  envi¬ 
ronment  than  it  does  about  more  distant  aspects  of  the 
terrain.  This  information  relates  in  a  direct  manner  to 
the  determination  of  viewpoint.  (E.g.,  “I’m  standing  on 
a  hill.  Therefore,  the  viewpoint  must  be  located  at  one 
of  the  hills  found  represented  on  the  map.”) 

Constraints  arising  from  the  information  sources  listed 
above  can  be  divided  into  two  categories.  All  but  the 


Figure  9:  Reasoning  about  viewpoint  involves  interact¬ 
ing  constraints. 


Figure  10:  Distant  constraints  on  viewpoint;  a)  abso¬ 
lute  bearing,  b)  relative  bearing  if  approximate  depth 
information  is  available,  c)  relative  bearing  if  no  depth 
information  is  available,  d)  ordinal  position,  e)  exact 
alignment,  f )  approximate  alignment 

last  constitute  distant  constraints,  since  they  are  based 
on  features  distant  from  the  viewpoint.  Trigonometric 
relations  are  required  to  relate  distant  constraints  to 
viewpoint,  although  qualitative  as  well  as  quantitative 
solutions  exist.  Viewpoint  terrain  type  leads  to  local 
constraints  which  limit  the  viewpoint  to  compatible  ter¬ 
rain  features  on  the  map.  Distant  and  local  constraints 
can  be  used  in  three  kinds  of  reasoning  about  viewpoint 
(Figure  9); 

Distant  constraints  ^  constraints  on  viewpoint:  Map- 
view  feature  correspondences  for  sets  of  distant  features 
can  be  used  to  determine  constraints  on  viewpoint  using 
geometric  reasoning  methods  applied  to  any  combination 
of  the  information  sources  described  above. 

Local  constraints  =>  constraints  on  viewpoint:  Local  con¬ 
straints  allow  for  the  enumeration  of  possible  viewpoints. 
Such  an  enumeration  can  be  intersected  with  the  con¬ 
straint  regions  that  usually  arise  from  consideration  of 
distant  features. 

Constraints  on  viewpoint  =>  expectations  about  distant 
constraints:  Hypotheses  about  viewpoint  can  be  evalu¬ 
ated  by  examining  distant  features  using  the  geometric 
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Figure  11;  View. 


reasoning  methods  applied  to  any  combination  of  the 
information  sources  described  above.  Positional  cind/or 
orientational  constraints  on  viewpoint  can  be  exploited. 

In  order  to  implement  the  constraint  satisfaction 
shown  in  Figure  9,  geometric  representations  are  needed 
for  viewpoint  regions  (areas  in  the  map  corresponding 
to  possible  viewpoints),  map  search  regions  (map  re¬ 
gions  possibly  containing  terrain  features  visible  in  the 
view),  and  view  search  regions  (portions  of  the  view  in 
which  particular  terrain  features  indicated  by  the  map 
are  expected  to  be  found).  A  variety  of  representational 
formalisms  are  possible,  each  with  advantages  and  dis¬ 
advantages.  [Sutherland  and  Thompson,  1993]  use  an 
analytical  description  of  the  bounding  curve  associated 
with  the  region  in  which  there  is  any  chance  that  the 
viewpoint  is  located.  The  localization  example  shown 
in  section  4  uses  a  much  simpler  convex  polygon  repre¬ 
sentation.  This  provides  a  compact  description  that  is 
efficient  to  manipulate.  It  also  fits  fairly  well  with  peo¬ 
ple’s  intuition  about  the  geometry  involved.  With  a  few 
exceptions,  convex  polygons  have  proven  to  be  an  ade¬ 
quate  basis  on  which  to  build  the  geometric  constraint 
satisfaction  algorithms,  though  they  sometimes  lead  to 
a  very  conservative  approach  such  as  describing  the  rela¬ 
tive  bearing  constraint  using  a  circle  rather  than  a  cres¬ 
cent  (see  Figure  10).  Figure  10  indicates  how  distant 
constraints  on  viewpoint  can  be  represented  using  the 
convex  polygon  approach.  Similar  regions  have  been  de¬ 
fined  for  using  viewpoint  to  develop  constraints  on  view 
and  map  positions  [Thompson,  1993]. 

G'  ometric  inference  is  implemented  using  a  query  lan¬ 
guage  similar  in  principle  to  that  used  to  interact  with 
the  low-level  feature  extraction  subsystems.  Viewpoint 
regions  are  hypothesized  or  refined  using  a  query  con¬ 
taining  a  (possibly  empty)  current  viewpoint  region  hy¬ 
pothesis,  a  set  of  corresponding  map  and  view  feature 
locations,  and  a  particular  inference  method  to  use.  The 
geometric  inference  subsystem  applies  the  method  to 
produce  a  viewpoint  region  constraint,  intersects  this 
with  the  initial  regions  supplied,  and  returns  the  result. 
Map  search  regions  result  from  queries  which  specify  a 
current  viewpoint  region  hypothesis,  a  set  of  view  fea¬ 
ture  locations,  and  an  inference  method.  View  search 
regions  are  obtained  in  an  analogous  manner. 

4  Example. 

We  have  demonstrated  the  sufficiency  of  the  approach 
described  above  by  applying  it  to  a  real  example  from 
the  mountains  just  southeast  of  Salt  Lake  City,  UT.  View 
features  were  extracted  from  the  panorama  image  shown 
in  Figure  11.  Note  that,  rather  than  the  synthesized 
views  commonly  used  in  much  of  the  reported  research 
on  outdoor  localization,  we  are  using  an  actual  terrain 
image.  Map  features  were  extracted  from  USGS  30m 


DEM  data  covering  the  equivalent  of  four  1:24000  7.5' 
quadrangles  (approximately  21.4  km  by  28  km),  the  up¬ 
per  half  of  which  appears  in  Figures  5  and  6.  The  com¬ 
pass  orientation  of  the  view  was  known,  but  no  infor¬ 
mation  about  viewpoint  was  provided  other  than  that  it 
was  somewhere  within  the  available  map  area. 

The  problem  solving  subsystem  responsible  for  feature 
matching  and  hypothesis  generation  and  evaluation  was 
implemented  in  Lisp.  The  geometric  inference  subsys¬ 
tem  was  also  implemented  in  Lisp  and  was  interfaced 
via  function  calls.  Both  of  the  low-level  feature  extrac¬ 
tion  subsystems  were  implemented  in  C,  ran  on  different 
machines  than  the  Lisp  processes,  and  were  interfaced 
using  simple  database-like  query/response  techniques. 

The  example  used  four  of  the  six  high-level  reasoning 
strategies  described  in  [Heinrichs  ei  al.,  1992]:  concen¬ 
trate  on  the  view  first,  organize  features  into  configu¬ 
rations,  pursue  multiple  hypotheses,  and  evaluate  hy¬ 
potheses  using  a  disconfirmation  process.  Two  types  of 
distant  constraints  on  viewpoint  (section  3.2)  and  one 
type  of  constraint  for  determining  map  search  regions 
based  on  hypothesized  viewpoint  were  used.  Execution 
proceeded  in  five  stages; 

View  reconnaissance:  The  view  (image)  was  searched  for 
significant  features.  The  highest  peak  in  the  image  stood 
out  well  above  other  features  in  overall  saliency. 

Form  view  configurations:  View  configurations  were 
formed  from  the  selected  peak  feature  and  other  nearby 
features  that  were  prominent.  Prominence  rather  than 
saliency  was  used,  since  the  configuration  is  more  dis¬ 
tinct  than  its  components.  Two  dual-feature  configura¬ 
tions  resulted,  one  involving  the  horizontal  ridge  segment 
to  the  left  of  the  peak  and  one  involving  the  ridge  seg¬ 
ment  to  the  right.  In  fact,  the  second  of  these  was  a 
bad  choice.  The  ridge  to  the  right  is  actually  quite  dis¬ 
tant  from  the  peak,  but  this  was  not  detected  by  the  low 
level  image  analysis  routines  since  the  corresponding  T 
junction  was  not  found.  Simultaneous  consideration  of 
multiple  hypotheses  combined  with  the  disconfirmation 
strategy  resulted  in  a  system  tolerant  of  such  errors. 

Search  for  configurations  in  map:  Configurations  con¬ 
sisting  of  a  high,  sharp  peak  and  a  nearby  horizontal 
ridge  were  searched  for  in  the  map.  Three  such  configu¬ 
rations  were  found. 

Generate  initial  hypotheses:  Six  configuration  match¬ 
ing  hypotheses  were  postulated  (two  view  configurations 
times  three  map  configurations).  Each  configuration 
match  specified  two  feature  matches  which  were  used 
to  generate  a  hypothesized  viewpoint  region  using  the 
relative  bearing  constraint,  as  shown  in  Figure  12. 

Refine  hypotheses  and  evaluate  using  disconfirmation 
strategy:  For  each  hypothesis,  highly  salient  view  fea- 
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Figure  12;  Viewpoint  regions  corresponding  to  the  six 
initial  hypotheses. 

tures  were  considered  in  turn  and  searched  for  in  the 
map.  If  exactly  one  map  feature  of  the  correct  type  was 
found  in  the  expected  location,  a  match  was  established 
and  the  absolute  bearing  constraint  was  used  to  refine 
the  viewpoint  region.  If  two  or  more  map  features  were 
found,  no  inferences  were  drawn  and  the  next  view  fea¬ 
ture  was  processed.  If  no  map  feature  Wcis  found  where 
one  was  expected,  the  hypothesis  was  disconfirmed. 

Figure  13  shows  the  refinement  of  the  hypothesis  cor¬ 
responding  to  the  actual  viewpoint.  Four  view  fea¬ 
tures  were  searched  for  in  the  map:  the  high  peak  men¬ 
tioned  previously,  two  other  peaks  towards  the  left  of 
the  panorama,  and  the  long  ridge  line  that  wraps  around 
from  the  right  edge  to  the  left  edge  of  the  panorama  im¬ 
age.  Three  unique  matches  were  found,  involving  two  of 
the  peaks  and  the  long  ridge.  The  remaining  peak  was 
ambiguous,  with  two  possible  peaks  in  the  map  located 
in  positions  that  could  plausibly  correspond  to  the  lo¬ 
cation  of  the  view  feature.  On  the  left  of  the  figure  is 
shown  the  search  for  corresponding  map  features.  The 
current  viewpoint  hypothesis  is  show  together  with  the 
search  region  predicted  from  the  bearing  to  the  chosen 
view  feature.  Black  dots  indicate  map  features  that  were 
found.  For  the  first  three  view  features  considered,  a 
unique  map  feature  was  found.  On  the  right  is  shown 
the  current  viewpoint,  the  constraint  regions  dissociated 
with  the  map  feature  just  found,  and  the  intersection 
which  becomes  the  refined  viewpoint  region.  The  last 
map  search  returned  an  ambiguous  result,  as  can  be  seen 
by  the  two  features  present  within  the  search  region.  As 
a  result,  no  refinement  of  the  viewpoint  region  was  pos¬ 
sible.  The  plot  on  the  lower  right  of  the  figure  shows 
the  final  region,  with  the  actual  viewpoint  marked.  The 
original  viewpoint  hypothesis  had  an  area  of  approxi¬ 
mately  1.489  km^.  After  the  first  absolute  bearing  con¬ 
straint  was  imposed,  the  size  of  the  region  was  down  to 
72,800  m^.  The  second  absolute  bearing  constraint  left 
this  unchanged.  The  final  constraint  reduced  the  area  to 
less  than  71,700  rn^,  or  about  .07  km^. 


Figure  13:  Viewpoint  refinement  '  '  final  region. 


Tigure  14  shows  an  attempt  to  refine  one  of  the  in¬ 
correct  hypotheses.  The  upper  left  panel  shows  the  map 
search  region  used  to  look  for  the  most  salient  view  peak. 
The  feature  being  searched  for  is  shown  as  an  open  cir¬ 
cle.  Due  to  the  fact  that  the  hypothesized  viewpoint  is 
wrong,  a  different  feature  was  found,  as  indicated  by  the 
black  dot.  The  viewpoint  region  was  refined  based  on 
this  match  as  shown  to  the  right.  The  next  most  salient 
view  feature  was  the  long  ridge  line.  As  shown  in  the 
lower  left  panel,  this  was  not  found  where  expected  and 
so  the  hypothesis  was  rejected  All  five  hypotheses  not 
including  the  true  viewpoint  were  rejected  in  a  similar 
manner. 
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Figure  14:  Validation  of  “incorrect”  hypothesis. 
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Abstract 

In  this  paper  I  will  discuss  a  system  which  uses 
vision  to  guide  a  mobile  robot  through  corridors 
and  freespace  channels.  The  system  runs  in  an 
unmodified  office  environment  in  the  presence 
of  both  static  and  dynamic  obstacles  {t.g.  peo¬ 
ple).  The  system  is  among  the  simplest,  most 
effective,  and  best  tested  systems  for  vision- 
based  navigation  to  date.  The  performance  of 
the  system  is  dependent  on  an  analysis  of  the 
special  properties  of  robot’s  environment.  I  will 
describe  these  properties  and  discuss  how  they 
simplify  the  computational  problems  facing  the 
robot.' 

1  Introduction 

Navigation  is  one  of  the  most  basic  problems  in  robotics. 
Since  the  ability  to  safely  move  about  the  world  is  a  pre¬ 
requisite  of  most  other  activities,  navigation  has  received 
a  great  deal  of  attention  in  the  AI,  robotics,  and  com¬ 
puter  vision  communities.  One  of  the  limiting  factors  in 
the  design  of  current  navigation  systems,  as  with  many 
other  robotic  systems,  has  been  the  availability  of  reli¬ 
able  sensor  data.  Most  systems  have  relied  on  the  use 
of  sonar  data  [7][8],  or  on  vision  [9][3][5][6][13][1].  In  all 
cases,  the  unreliability  of  the  available  sensor  data  was 
a  major  concern  in  the  research.  Some  researchers  have 
even  avoided  the  use  of  sensor  data  entirely  in  favor  of 
precompiled  maps  [10]. 

In  this  paper,  I  will  discuss  a  very  simple  vision-based 
corridor  following  system  which  is  in  day-to-day  use  here 
at  MIT.  The  system  is  notable  in  that  it  is  very  fast 
(15  frames  per  second  in  the  current  system),  very  well 
tested,  and  uses  only  cheap  “off-the-shelP  hardware. 
A  major  source  of  this  simplicity  is  an  analysis  of  the 
agent’s  niche.  Such  an  analysis  helps  make  clear  the  de¬ 
pendence  of  an  agent  on  its  environment  and  provides 
guidance  for  the  design  of  future  systems. 

1.1  The  Polly  project 


'Support  for  this  research  was  provided  in  part  by  the 
University  Research  Initiative  under  Office  of  Naval  Research 
contract  N00014-86-K-0685,  and  in  part  by  the  Advanced 
Research  Projects  Agency  under  Office  of  Naval  Research 
contract  N00014-85-K-0124. 


Figure  1:  Hardware  architecture  of  Polly. 


Polly  is  a  low  cost,  vision-based,  autonomous  robot  built 
to  help  study  how  properties  of  the  environment  can  sim¬ 
plify  the  computational  problems  facing  an  agent. ^  The 
theoretical  goal  of  the  project  is  to  articulate  a  num¬ 
ber  of  useful  computational  properties  of  office  environ¬ 
ments,  and  to  develop  a  theory  of  how  those  properties 
can  simplify  the  design  of  an  agent.  For  example,  one 
such  property  is  that  office  environments  do  not  gener¬ 
ally  have  any  moving  objects  other  than  people;  thus, 
motion  is  a  cue  to  agency.  Rather  than  having  to  do 
a  complicated  analysis  of  the  various  objects  in  view  to 
determine  if  they  look  or  act  like  agents,  the  robot  can 
simply  check  if  they’re  moving,  a  much  simpler  test.  This 
property  does  not  hold  of  all  habitats-trees  blow  in  the 
wind  and  waves  crash  upon  the  shore,  but  they  are  not 
agents-and  so  the  property  partly  determines  of  the  set 
of  habitats  in  which  an  agent  that  assumes  it  may  sur¬ 
vive.  For  this  reason,  I  will  refer  to  such  properties  as 
habitat  constraints  (see  Horswill  [4]  for  a  more  detailed 
discussion). 

In  this  paper,  I  will  discuss  how  habitat  constraints 
can  allow  “optimizations”  to  be  performed  on  the  con¬ 
trol  systems  of  simple  agents.  The  optimizations  we  will 
examine  take  the  form  of  replacing  a  subsystem  of  the 
agent  with  another  system  which  is  in  some  way  less  ex¬ 
pensive,  but  which  nonetheless  does  the  same  job,  pro¬ 
vided  that  the  habitat  constraint  holds.  The  motion 
constraint  outlined  above  allows  a  motion  detector  to  be 
substituted  for  a  more  complicated  recognition  system. 
This  substitution  is  an  optimization  in  the  sense  that 
the  motion  detector  is  less  expensive  than  the  recogni¬ 
tion  system.  Depending  on  the  context,  we  may  mea¬ 
sure  expense  as  the  actual  monetary  cost  of  building  a 


’Polly  wu  built  for  S20K  (puts  cost  in  the  U.S.),  but 
today,  a  roughly  comparable,  or  perhaps  even  faster,  system 
could  be  bought  for  roughly  SlOK  US. 
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Figure  2:  Conceptual  architecture  of  the  current  version  of  the  navigation  system,  as  implemented  in  software. 


robot,  or  the  biological  cost  of  growing  and  feeding  extra 
neurons,  or  some  other  measure  entirely.  By  analyzing 
an  agent’s  specialization  in  terms  of  habitat  constraints 
and  optimizations,  we  can  place  specific  properties  of 
the  environment  into  correspondence  with  specific  com¬ 
putational  problems  facing  the  agent  and  the  solutions 
which  those  properties  allow.  Such  an  analysis  makes 
very  explicit  the  way  in  which  an  agent  is  adapted  to  its 
environment,  the  possible  consequences  of  changing  that 
environment,  and  what  facets  of  the  agent  would  have 
to  change  to  adapt  to  the  new  environment. 

The  implementation  goal  of  the  project  is  to  use  this 
approach  to  design  to  develop  an  efficient  visual  system 
which  will  allow  the  robot  to  run  unattended  for  ex¬ 
tended  periods  (hours)  and  to  give  primitive  “tours”  of 
the  MIT  AI  lab.  Our  general  approach  to  design  has 
been  to  determine  what  particular  pieces  of  information 
are  needed  by  the  agent  to  perform  it’s  activities,  and 
then  to  design  complete  visual  systems  for  extracting 
each  piece  of  information.  Thus,  the  agent  might  have 
distinct  systems  for  answering  questions  such  as  “am  I 
about  to  hit  something?”  and  “what  is  the  axis  of  this 
corridor?”  If  these  systems  are  simple  enough,  they  can 
be  run  in  parallel  very  efficiently.  In  the  system  pre¬ 
sented  here,  they  are  indeed  simple  enough  so  that  each 
one  can  be  run  for  each  image  frame,  using  an  inexpen¬ 
sive  computer. 

The  computational  hardware  on  Polly  consists  of 
a  16MIP  digital  signal  processor  (Texas  Instruments 
TMS320C30)  with  64K  32-bit  words  of  high  speed  ram^, 
a  video  frame  buffer /grabber,  a  simple  8-bit  microcon¬ 
troller  (M68HC11)  for  I/O  tasks,  and  commercial  mi¬ 
crocontrollers  for  voice  synthesis  and  motor  control  (see 
figure  1).  Nearly  all  computation  is  done  on  the  DSP. 
The  fact  that  Polly  uses  only  vision  for  sensing  was  due 
to  lack  of  engineering  time  and  experience,  not  because 
we  feel  that  vision  should  be  used  for  everything. 

1.2  Work  to  date 
At  present,  Polly  consists  of: 

•  A  system  for  navigating  corridors  and  relatively  un¬ 
cluttered  spaces  (reported  on  here) 

•  A  rudimentary  place  recognition  system 

^The  DSP  inclndes  an  additional  iMb  of  low  speed  lam, 
which  is  not  presently  in  nse.  At  present,  only  approximately 
lOKW  of  RAM  are  in  nse. 


•  A  plan  executive  for  forcing  the  robot  to  perform 
fixed  sequences  of  actions  (useful  for  debugging) 

•  A  simple  person  detector  based  on  bilateral  symme¬ 
try 

•  An  “unwedger,”  which  pilots  the  robot  out  of  cul- 
de-sacs  and  dead  ends. 

•  A  unit  which  overrides  the  corridor  follower  to  per¬ 
form  open-loop  turns 

•  A  carpet-boundary  detector  (see  section  7) 

The  connectivity  of  the  navigation  components  is  shown 
in  figure  2.  All  these  components  are  run  in  pseudo¬ 
parallel  fashion  on  the  DSP:  at  each  moment  in  time, 
the  robot  grabs  a  new  frame  from  the  camera,  runs  each 
of  the  components,  yielding  a  motor  command,  issues 
the  motor  command,  and  repeats  the  cycle. 

The  corridor  follower  nearly  always  controls  the  robot 
when  it  is  moving.  Even  when  the  robot  is  not  in  a 
corridor,  the  corridor  follower  is  still  used  to  attempt  to 
go  forward  without  hitting  obstacles.  All  other  modules 
are  built  upon  the  corridor  follower:  the  turn  box  can 
override  the  corridor  follower  to  turn  a  corner;  the  place 
recognition  system,  the  unwedger,  and  the  plan  execu¬ 
tive  can  issue  turn  requests  to  the  turn  box.  As  of  this 
writing,  the  corridor  follower,  unwedger,  carpet  bound¬ 
ary  detector,  and  plan  executive  are  essentially  finished. 
The  place  recognition  system  and  person  detector  are 
still  under  active  development. 

2  Corridor  following 

Corridor  following  is  a  common  navigation  task  in  office 
environments.  Office  buildings  tend  to  consist  of  long 
corridors  lined  with  rooms  on  either  side,  thus  much  of 
the  work  of  getting  from  one  room  to  another  consists  of 
driving  along  a  series  of  corridors.  The  corridor  follower 
described  here  is  intended  to  be  used  as  one  component 
among  many  which  cooperate  to  allow  the  robot  to  par¬ 
ticipate  in  its  projects. 

Corridor  following  can  be  broken  into  the  complemen¬ 
tary  problems  of  keeping  aligned  with  the  axis  of  the 
corridor  and  keeping  away  from  the  walls.  This  amounts 
to  keeping  the  variable  0  in  figure  3  small,  while  simul¬ 
taneously  keeping  /  and  r  comfortably  large.  Since  Polly 
can  only  move  in  the  direction  in  which  it’s  pointed, 
these  variables  are  coupled.  In  particular,  if  the  speed  of 
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possible  architectures.  The  choice  of  what  architecture  is 
best  will  depend  on  the  resources  available  to  the  agent, 
the  other  tasks  which  the  agent  may  have  to  perform, 
and  the  relative  expense  of  the  design. 


Figure  3:  Corridor  following. 


the  robot  is  s,  then  we  have  that: 


dl  dr 

so  moving  away  from  a  wall  requires  that  Polly  tem¬ 
porarily  turn  away  from  the  axis  of  the  corridor.  Thus 
the  problem  for  the  control  system  amounts  to  control¬ 
ling  s  and  II  so  as  to  keep  /  and  r  large  and  0  small  and 
the  problem  for  the  visual  system  amounts  to  determin¬ 
ing  when  one  of  these  conditions  is  violated. 

There  is  a  huge  space  of  possible  solutions  to  this  prob¬ 
lem.  Consider  the  subtask  of  aligning  with  the  corridor. 
An  obvious  way  of  performing  this  task  would  be  to  use 
a  system  that  first  constructs  a  3D  model  of  the  environ¬ 
ment,  then  finds  the  walls  of  the  corridor  in  the  model, 
then  computes  the  axis,  0,  of  the  corridor  from  the  walls, 
and  finally,  multiplies  the  axis  by  some  gain  to  drive  the 
turning  motor.  We  can  represent  this  schematically  as; 


♦motor 


Here  the  double  arrow  at  be  beginning  represents  input 
from  the  sensors  and  the  boxes  are  successive  transfor¬ 
mations  of  the  sensory  data.  This  is  not  a  particularly 
efficient  design  however,  since  3D  models  are  both  diffi¬ 
cult  and  computationally  expensive  to  build.  Intuitively, 
building  an  entire  model  of  the  environment,  only  to 
compress  down  to  a  single  number,  6,  seems  a  waste  of 
energy. 

Of  course,  smy  system  which  computes  $,  in  whatever 
manner,  and  then  turns  to  minimize  0,  that  is,  any  sys¬ 
tem  of  the  form. 


gam 


♦motor 


will  do  the  trick.  Furthermore,  it  can  be  shown  that 
given  the  right  conditions,  we  don’t  even  need  to  com¬ 
pute  9  per  se\  we  can  substitute  any  monotonic  function 
of  0  which  is  zero  when  0  is  zero.^  Thus  we  can  use  any 
system  of  the  form: 


m 


gam 


♦motor 


where  /  is  the  monotonic  function. 

The  point  of  this  analysis  is  simply  that  the  constraints 
imposed  by  the  task  on  the  architecture  of  agent  are  ac¬ 
tually  quite  modest,  and  that  they  admit  a  huge  space  of 

*The  conditions  are  that  the  motor  must  be  controlled  in 
velocity  space,  meaning  that  we  have  direct  control  over  its 
speed,  rather  than  just  its  acceleration,  and  that  the  control 
system  must  be  fast  enough  so  that  we  can  treat  it  as  having 
zero  delay.  If  these  conditions  are  not  met,  then  the  system 
may  oscillate. 


3  The  control  system 

At  any  given  time,  the  corridor  follower  has  to  make  a 
decision  about  what  direction  to  turn,  if  at  all,  how  fast 
to  turn,  and  how  fast  to  move  forward.  Since  our  robot 
has  independent  motors  for  turning  and  driving  forward, 
we  can  treat  these  problems  separately.  In  this  section,  I 
will  discuss  how  the  corridor  follower  computes  the  turn 
and  drive  rates  from  five  numbers  describing  the  situa¬ 
tion;  /',  r',  and  9“ ,  which  are  estimates  of  I,  r,  and  0, 
respectively,  o',  a  measure  of  the  distance  to  the  nearest 
obstacle  in  front  of  the  robot,  and  tr,  a  measure  of  the 
visual  system’s  confidence  in  its  estimate  of  0.  The  con¬ 
trol  problem  is  actually  much  easier  than  the  perceptual 
problem  of  estimating  these  numbers.  In  section  4, 1  will 
discuss  two  special  properties  of  the  environment  which 
can  make  it  easier  for  the  visual  system  to  estimate  these 
numbers,  and  then  in  section  5,  I  will  discuss  the  actual 
design  of  the  visual  system. 

3.1  Steering  for  corridor  following 

the  steering  rate  ^  is  controlled  by  0',  (t,  I',  and  r', 
which  are  measured  by  the  visual  system.  The  system 
drives  the  steering  motor  at  a  rotational  velocity  of 

if  it  is  confident  of  it’s  measure  of  0',  otherwise  with  a 
velocity  of  a(l'  —  r').  Here  a  and  0  are  gains  (constants) 
adjusted  empirically  for  good  results. 

In  practice,  we  have  not  measured  the  variable  0'  di¬ 
rectly,  but  instead  have  used  the  image  plane  x  coordi¬ 
nate  of  the  projection  of  the  axis  of  the  corridor,  which 
is  more  easily  measured.  The  projection  x  is  equal  to 
ibtan”*  0,  where  k  is  determined  by  the  focal  length  and 
resolution  of  the  camera  Thus  the  actual  control  law  used 
in  our  system  is: 

^  =  a(/'  -  r')  -b  /3tan~'  0' 
at 

One  could  solve  for  0  given  x  and  an  accurately  cali¬ 
brated  camera,  but  in  our  experience,  this  control  system 
has  worked  perfectly  well. 

3.2  Controlling  forward  motion 

The  two  constraints  on  forward  velocity  are  that  the 
robot  should  move  forward  when  there’s  nothing  in  its 
way,  and  that  it  should  stop  when  there  is  something 
close  to  it.  We  want  the  robot  to  move  at  full  speed 
when  the  nearest  obstacle  is  more  than  some  safe  dis¬ 
tance,  d,ajt,  Sind  to  stop  when  it’s  less  than  some  dis¬ 
tance  d,top-  We  use  the  rule 

I  *^moi  /  /  j  >  \ 

8  —  min  I  Vmax,  ,  j  ^ttopj  I 

\  Ojoye  “ilop  / 

which  causes  the  robot  to  smoothly  decelerate  as  it  ap¬ 
proaches  an  obstacle,  and  to  back  up  if  it  gets  too  close 
to  an  obstacle.  Backing  up  is  useful  because  it  allows 
one  to  herd  the  robot  from  place  to  place. 
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Figure  4:  Habitat  constraints  assumed  by  the  visual  system  and  the  problems  they  helped  to  simplify.  Note  that 
“known  camera  tilt”  is  more  a  constraint  on  the  agent,  than  on  the  habitat. 


4  Computational  properties  of  office 
environments 

Estimating  distance  and  parsing  the  visual  world  into 
objects  are  both  very  difficult  problems  in  the  general 
case.  For  example,  figure/ground  separation  can  be  ar¬ 
bitrarily  difficult  if  we  consider  pathological  situations 
such  as  camouflage  or  crypsis,  which  can  require  pred- 
itors  to  learn  to  recognize  prey  on  a  case  by  case  basis 
(see  Roitblat  [12],  p.  260).  Fortunately,  office  environ¬ 
ments,  Polly’s  habitat,  are  actively  structured  by  both 
designers  and  inhabitants  so  as  to  facilitate  their  legi¬ 
bility  (see  Passini  [11]  for  an  interdisciplinary  discussion 
of  the  navigational  properties  of  buildings).  The  special 
properties  of  office  environments  allow  much  simpler  ar¬ 
chitectures  to  be  used  than  would  be  necessary  for  an 
agent  which  was  to  follow  arbitrary  paths  in  arbitrary 
environments. 

One  such  property  is  that  office  environments  have  a 
flat  ground  plane,  the  floor,  upon  which  most  objects 
rest.  For  a  given  height  and  orientation  of  the  camera, 
the  distance  of  a  point  P  from  the  camera  will  be  a 
strictly  increasing  function  of  the  height  of  P’s  projection 
in  the  image  plane,  and  so  image  plane  height  can  be 
used  as  a  measure  of  distance  to  objects  resting  on  the 
floor*.  While  this  is  not  a  linear  measure,  and  the  exact 
correspondence  between  heights  and  distances  cannot  be 
known  without  first  knowing  the  specifics  of  camera,  it  is 
still  a  perfectly  useful  measure  for  determining  which  of 
two  objects  is  closer,  whether  an  object  is  closer  than  a 
certain  threshold,  or  even  as  an  (uncalibrated)  measure 
of  absolute  distance.  This  property  was  referred  to  as 
the  ground  plane  constraint  in  [4]. 

Another  important  property  of  office  environments  is 
that  they  are  generally  carpeted,  and  their  carpets  gen¬ 
erally  have  only  fine-scale  texture.  That  is  to  say,  that 
from  a  distance,  the  carpet  will  appear  to  have  a  uniform 
reflectance.  If  this  is  true,  and  if  the  carpet  is  uniformly 
illuminated,  then  the  areas  of  the  image  which  corre¬ 
spond  to  the  floor  should  have  uniform  image  bright¬ 
ness,  and  so  any  violation  of  this  uniformity  must  be  an 
object  other  than  the  floor.  This  property,  called  the 
background  texture  constraint  in  [4],  can  greatly  simplify 
the  computational  problem  of  figure-ground  separation. 

5  Design  of  the  visual  system 

The  visual  system  estimates  the  axis  of  the  corridor  and 
three  distance  measures  from  64  x  48  pixel  grey-scale  im- 

*This  observatioB  goes  back  at  least  to  Euclid.  See  [2]. 


ages  covering  a  field  of  view  of  110  degrees  (1.9  radians). 
All  vision  computations  are  performed  for  each  frame. 
The  overall  structure  of  the  visual  system  is  shown  in 
figure  5.  The  constraints  used  in  the  optimization  of  the 
system  are  given  in  figure  4. 

5.1  Computing  the  vanishing  point 

As  was  mentioned  above,  the  axis  of  the  corridor  is  repre¬ 
sented  by  the  x  coordinate  of  it’s  image-plane  projection. 
This  can  be  estimated  by  finding  the  vanishing  point  of 
the  parallel  lines  forming  the  edges  of  the  corridor.  Bel- 
lutta  et  al  [1]  report  on  a  such  a  system  which  extracts 
vanishing  points  by  running  an  edge  finder,  extracting 
straight  line  segments,  and  performing  2D  clustering  on 
the  pairwise  intersections  of  the  edge  segments.*  We  can 
represent  this  schematically  as: 


This  algorithm,  while  less  computationally  expensive 
than  3D  modeling,  is  still  rather  expensive.  We  can  sim¬ 
plify  the  system  if  we  make  stronger  assumptions  about 
the  environment.  We  can  remove  the  step  of  grouping 
edge  pixels  into  segments  by  treating  each  edge  pixel  as 
its  own  tiny  segment: 


Note  that  this  system  will  effectively  weight  segments  by 
their  length.  This  will  be  fine  provided  that  the  long  lines 
in  the  scene  are  the  lines  directed  toward  the  vanishing 
point. 

Edge  detectors  can  also  be  extremely  expensive.  Since 
the  edges  we’re  looking  for  should  be  very  strong  and 
straight,  we  should  be  able  to  use  a  very  simple  edge 
detector,  such  as  a  gradient  threshold.  Computing  the 
intensity  gradient  (the  rate  of  change  in  image  bright¬ 
ness,  denoted  by  V)  at  a  pixel  and  testing  its  magnitude 
can  be  done  using  only  a  few  machine  instructions.  The 
resulting  system  is  then: 


If  the  tilt-angle  of  the  camera  is  held  constant  by  the 
camera  mount,  then  the  vanishing  point  will  always  have 

*The  algorithm  of  Bellntta  et  al.  is  actually  more  compli¬ 
cated  than  this  in  that  it  extracts  multiple  vanishing  points, 
but  for  our  purposes  we  can  treat  it  as  extracting  only  the 
forward  vanishing  point. 
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Figure  5:  The  portion  of  the  visual  system  devoted  to  corridor  following.  Note  that  smoothing  is  performed  prior 
to  edge  detection  to  remove  noise.  The  vanishing  point  unit  is  shown  as  a  single  box  because  all  the  steps  of  the 
vanishing  computation  are  performed  simultaneously.  The  carpet  boundary  detector  is  discussed  in  section  7. 


the  same  y  coordinate,  that  is,  it  will  lie  in  the  line 
y  =  yo  for  some  yo.  We  can  then  replace  the  pair¬ 
wise  intersections  with  the  intersections  of  the  edges  with 
y  =  yo.  This  reduces  the  number  of  points  to  consider 
from  O(n^)  to  0{n),  where  n  is  the  number  of  edge  pix¬ 
els,  and  also  reduces  the  clustering  to  a  ID  problem, 
which  is  also  more  efficient; 


Finally,  if  we  assume  that  the  intersections  of  the  edges 
not  pointing  toward  the  vanishing  point  are  uniformly 
distributed,  then  we  can  replace  the  clustering  opera¬ 
tion,  which  looks  for  modes,  with  the  mean  of  the  x 
coordinate; 


This  does  have  the  disadvantage  that  if  there  are  many 
non-corridor  edges  in  view,  then  the  mean  will  tend  to 
move  toward  the  center  of  the  screen.  The  sign  will  still 
be  correct  however,  and  so  the  robot  will  still  steer  in 
the  correct  direction. 

5.2  Computing  distances 

There  are  two  problems  involved  in  estimating  the  left, 
right,  and  center  distances;  figure/ground  separation  to 
find  the  walls  in  the  image,  and  depth  estimation  to  de¬ 
termine  the  distances  to  them.  Suppose  we  had  com¬ 
puted  a  radial  depth  map,  RDM,  of  the  scene.  A  radial 
depth  map  gives  the  distance  of  the  nearest  object  in 
each  direction.  Then  we  could  find  the  distance  to  the 
left  wall  by  finding  the  minimum  distance  given  in  the 
left  side  entries  of  the  radial  depth  map; 

=>  RDM  ->  minieft_,ide  — ► 

If  we  had  already  somehow  solved  the  figure-ground 
problem,  that  is,  if  we  had  already  labeled  every  pixel  as 
being  either  “floor”  or  “not  floor” ,  then  we  could  use  the 
ground-plane  constraint  to  generate  the  RDM;  the  dis¬ 
tance  to  the  closest  object  in  a  given  direction  is  a  mono¬ 
tonic  function  of  the  height  in  the  image  plane  of  the 
lowest  non-floor  pixel  in  the  image  plane.  Since  columns 
of  the  image  correspond  to  directions,  the  distance  is 


then  simply  the  height  of  the  lowest  non-floor  pixel  in  a 
given  column.  Thus; 

RDM{x)  =  min{y|the  point  (r,y)  isn’t  floor} 
so  now  our  system  looks  like  this; 


where  “F/G”  is  the  figure/ground  system.  If  the  floor 
is  textureless  but  the  walls  generate  edges  where  they 
meet  with  the  floor,  then,  by  the  background-texture 
constraint,  it  too  can  be  replaced,  in  this  case,  by  an 
edge  detector  {e.g.  thresholded  gradients); 


A  number  of  things  are  worth  noting  here.  First  of  all, 
and  r'  are  not  necessarily  the  distances  to  the  walls. 
They  are  simply  the  distances  to  the  nearest  non-floor 
objects  on  the  left  and  right  sides  of  the  image.  This  is 
not  actually  a  problem  however,  since  if  there  are  other 
objects  in  the  way  it  will  simply  cause  the  robot  to  steer 
around  them,  thus  conferring  on  the  robot  a  limited  ob¬ 
ject  avoidance  capability.  If  there  are  no  such  objects, 
then  there  is  no  difference  anyhow.  Thus  having  the  sys¬ 
tem  make  no  distinctions  between  walls  and  other  obsta¬ 
cles  is  actually  advantageous  in  this  situation.  The  sec¬ 
ond  thing  worth  noting  is  that  the  distance  measures  are 
nonlinear  functions  of  the  actual  distances^.  For  some 
applications  this  might  be  unacceptable,  but  for  this  ap¬ 
plication  we  are  mostly  just  concerned  with  whether  a 
given  object  is  too  close  or  whether  the  left  side  or  the 
right  side  is  closer,  for  which  these  nonlinear  measures 
are  quite  adequate.  Finally,  since  no  camera  has  a  180 
degree  field  of  view,  /'  and  r'  are  not  even  measures  of 
the  perpendicular  distances  to  the  walls,  I  and  r,  but 
rather  are  measure  of  the  distance  to  the  closest  point  in 
view.  Again,  this  is  not  a  problem  in  practice,  partly  be¬ 
cause  our  camera  hets  a  relatively  wide  field  of  view,  and 
partly  because  for  a  given  orientation  of  the  robot,  the 
perpendicular  distance  is  another  monotonic  and  strictly 
increasing  function  of  the  measured  distance,  and  vice 
verso. 

^The  actual  function  is  a  quotient  of  linear  equations 
whose  coefficients  are  determined  by  the  camera  parameters. 
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Figure  6:  Execution  times  for  1000  frames.  “F^li  sys¬ 
tem”  is  all  code  presently  implemented,  including  the 
person  detector.  “No  I/O”  is  the  corridor  follower  with¬ 
out  any  frame  grabbing  or  output  to  the  base  (a  single 
frame  is  grabbed  at  the  beginning  and  processed  repeat¬ 
edly).  “No  VP”  is  the  collision  avoidance  system  run 
without  I/O  or  the  vanishing-point  box.  All  execution 
times  are  for  a  Texas  Instruments  TMS-320C30-based 
DSP  board  (a  Pentek  4283)  running  with  no  wait  states. 
The  processor  has  a  60ns  instruction  time.  The  first 
three  lines  are  the  same  because  the  system  cannot  dig¬ 
itize  freunes  faster  than  15  FPS. 


6  Evaluation 

The  corridor  follower  has  been  running  for  nine  months 
and  is  quite  stable.  It  has  seen  at  least  200  hours  of 
service  with  many  continuous  runs  of  one  hour  or  more. 
This  makes  it  one  of  the  most  extensively  tested  and 
reliable  visual  navigation  systems  to  date.  We  have  been 
able  to  run  the  system  as  fast  as  our  robot  base  could 
run  without  shaking  itself  apart  (approximately  1  m/s). 
While  there  are  cases  which  will  fool  the  braking  system 
(see  below),  we  have  found  the  system  to  be  quite  reliable 
in  general. 

6.1  Efficiency 

The  system  is  very  efficient  computationally.  The 
present  implementation  runs  at  15  frames  per  second, 
which  is  a  fast  as  the  system  can  read  data  from  the 
camera  (see  figure  6).  All  computation  boxes  in  figure 
5  are  run  on  every  fr«ne  in  (simulated)  parallel.  This 
implementation  is  heavily  I/O  bound  however,  and  so  it 
spends  much  of  its  time  waiting  for  the  serial  port  and 
doing  transfers  over  the  VMEbus  to  the  frame  grabber 
and  display.  We  expect  that  performance  would  be  no¬ 
ticeably  better  on  a  system  with  a  more  tightly  coupled 
DSP  and  frame  grabber.  This  efficiency  of  the  system  al¬ 
lows  it  to  be  implemented  with  a  relatively  simple  and  in¬ 
expensive  computer  such  as  the  DSP.  The  modest  power 
requirements  of  the  computer  allow  the  entire  computer 
system  to  run  off  of  a  single  motorcycle  battery  for  as 
long  as  nine  hours. 

The  simplicity  and  efficiency  of  the  system  make  it 
quite  inexpensive  compared  to  other  real-time  vision  sys¬ 
tems.  C30  DSP  boards  are  now  available  for  personal 
computers  ter  approximately  $1-2K  US  and  frame  grab¬ 
bers  can  be  obtained  for  as  little  as  $400  US.  Thus  the 
corridor  follower  would  be  quite  cheap  to  install  on  an 
existing  system.  We  are  also  working  on  a  very  inexpen¬ 
sive  hardware  platform  for  the  system  which  we  hope 
will  cost  less  than  $200  US. 


Figure  7;  A  forest  environment. 


6.2  Failure  modes 

The  system  runs  on  all  floors  of  the  AI  lab  building  on 
which  it  has  been  tested  (floors  3-9)  except  for  the  9th 
floor,  which  has  very  shiny  floors.  There  the  system 
brakes  for  the  reflections  of  the  overhead  lights  in  the 
floor.  The  present  system  also  has  no  memory  and  so 
cannot  brake  for  an  object  unless  it  is  actually  in  its 
field  of  view.  This  sometimes  causes  problems.  The  sys¬ 
tem  also  cannot  brake  for  an  object  unless  it  can  detect 
an  edge  on  or  around  it,  but  this  can  more  or  less  be  ex¬ 
pected  of  all  vision  systems.  The  system’s  major  failure 
mode  is  braking  for  shadows.  If  shadows  are  sufficiently 
strong  they  will  cause  the  robot  to  brake  when  there  is  in 
fact  no  obstacle.  This  is  less  of  a  problem  than  one  would 
expect  because  shadows  are  generally  quite  diffuse  and 
so  will  not  necessarily  trigger  the  edge  detector.  Finally, 
the  7th  floor  of  the  lab,  where  the  robot  spends  most 
of  its  time,  does  not  have  a  single  carpet,  but  several 
carpets,  each  with  a  different  color.  The  boundaries  be¬ 
tween  these  carpets  can  thus  be  mistaken  for  obstacles. 
This  problem  was  dealt  with  by  explicitly  recognizing 
the  boundary  (see  section  7). 


6.3  Performance  outside  corridors 

While  the  system  was  designed  to  navigate  corridors, 
it  is  also  capable  of  moving  through  more  complicated 
spaces.  Its  major  deficiency  in  this  situation  is  that 
there  is  no  way  of  specifying  a  desired  destination  to 
the  system.  Effectively,  the  system  acts  in  a  “point  and 
shoot”  mode:  it  moves  forward  as  far  as  possible,  veer¬ 
ing  away  from  obstacles,  and  continuing  until  it  reaches 
a  dead  end  or  is  stopped  externally.  The  system  is  also 
non-deterministic  in  these  situations.  When  the  robot 
is  blocked  by  an  object,  it  will  turn  either  left  or  right 
depending  on  the  exact  position  and  orientation  of  the 
robot  and  object.  Since  these  are  never  exactly  repeat- 
able,  the  robot  is  effectively  non-deterministic.  Thus 
in  a  “forest  environment”  such  as  figure  7,  the  robot 
could  emerge  at  any  point  or  even  get  turned  around 
completely.  The  system’s  performance  is  good  enough 
however  that  a  higher-level  system  can  lead  it  though 
a  series  of  corridors  and  junctions  by  forcing  the  robot 
to  make  a  small  open-loop  turn  when  the  higher-level 
system  wants  to  take  a  new  corridor  at  a  junction.  The 
corridor  follower  then  realigns  with  the  new  corridor  an 
continues  on  its  way. 
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7  Extensions 

A  number  of  minor  modifications  to  the  algorithm  de¬ 
scribed  above  are  worthwhile. 

Vertical  biasing 

As  discussed  above,  shadows  and  bright  lights  radiat¬ 
ing  from  office  doors  can  sometimes  be  sufficiently  in¬ 
tense  to  trigger  the  edge  detector.  Since  these  shadows 
always  radiate  perpendicular  to  the  wall,  they  appear 
horizontal  in  the  image  when  the  robot  is  successfully 
aligned  with  the  corridor.  By  biasing  the  edge  detector 
toward  vertical  lines,  we  can  make  it  less  sensitive  to 
these  shadows.  A  previous  version  of  the  system  dealt 
with  the  problem  by  weighting  vertical  lines  twice  as 
much  as  horizontal  lines.  The  system  now  explicitly 
searches  for  carpet  boundaries  and  briefly  disables  the 
detection  of  horizontal  lines  when  a  carpet  boundary  is 
found.  The  criterion  for  a  carpet  boundary  is  that  it 
must  be  a  weak  horizontal  line  with  no  surrounding  tex¬ 
ture. 

Fear  of  the  dark 

The  system,  as  described  above,  will  happily  drive 
through  a  dark  room  and  hit  the  nearest  obstacle.  Simi¬ 
larly,  if  the  robot  somehow  misses  the  boundary  between 
the  floor  and  a  blank  wall,  and  drives  close  enough  to  the 
wall  for  it  to  fill  the  robot’s  visual  field,  then  it  will  treat 
the  blank  wall  as  being  an  empty  floor  and  attempt  to 
drive  up  the  wail.  Polly  avoids  these  problems  by  treat¬ 
ing  dark  pixels  as  obstacles  to  be  avoided,  and  by  refus¬ 
ing  to  drive  forward  if  there  is  insufficient  texture  in  the 
image.  A  better  solution  would  be  to  add  other  sensory 
modalities  to  the  system,  such  as  touch  or  sonar,  but 
those  sensors  were  not  available  on  the  robot. 

Wall  following 

When  the  system  described  above  reaches  a  large  open 
space,  the  single  wall  which  is  in  view  will  act  as  a  repul¬ 
sive  force,  causing  the  robot  to  turn  away  from  it  until 
there  is  nothing  in  view  whatsoever.  Thus  it  naturally 
tends  toward  situations  in  which  it  is  effectively  blind. 
While  this  is  sufficient  for  following  corridors,  and  is  in 
fact  a  very  good  way  of  avoiding  obstacles,  it  is  a  very 
bad  way  of  actually  getting  to  a  destination  unless  the 
world  consists  only  of  nice  straight  corridors.  This  prob¬ 
lem  can  be  fixed  by  modifying  the  steering  control  so 
that  when  only  a  single  wall  is  in  view,  the  robot  will  try 
to  keep  that  wall  at  a  constant  distance.  Thus,  in  the 
case  where  only  the  left  wall  is  in  view,  the  control  law 
becomes; 

Where  d  is  the  desired  distance  to  the  wall. 


8  Conclusions 

Curiously,  the  most  significant  things  about  the  system 
are  the  things  which  it  does  not  do.  It  does  not  build 
or  use  detailed  models  of  its  environment;  it  does  not 
use  carefully  calibrated  depth  data;  it  does  not  use  high 
resolution  imagery;  and  it  is  not  designed  to  run  in  ar¬ 
bitrary  environments.  Indeed,  much  of  its  power  comes 
from  its  specialization. 

One  may  be  tempted  to  object  that  this  system  is  too 
domain  specific  and  that  more  complicated  techniques 
are  necessary  to  build  practical  systems  for  use  in  the 
real  world.  I  think  that  this  is  misguided  however.  To 
begin  with,  even  if  one  had  a  single  truly  general  naviga¬ 
tion  algorithm,  its  very  generality  would  likely  make  it 
much  slower  than  the  system  discussed  here.  The  gen¬ 
eral  system  may  also  require  allocating  scarce  cognitive 
or  attentive  resources  which  would  be  better  used  for 
other  concurrent  tasks.  One  approach  would  be  to  build 
a  hybrid  which  used  the  simple  system  when  possible, 
and  the  more  cumbersome  system  only  when  necessary. 

Another  possibility  would  be  to  build  a  system  which 
could  rapidly  switch  between  a  number  of  domain- 
specific  strategies.  Ullman’s  Visual  Routine  Processor 
[14]  is  a  particularly  attractive  architecture  for  this  ap¬ 
proach.  A  VRP  could  be  quickly  configured  by  the  cen¬ 
tral  system  to  use  different  strategies  for  different  sit¬ 
uations.  Ideally,  such  a  system  would  be  able  to  recog¬ 
nize  and  learn  to  use  domain-specific  strategies  for  visual 
tasks,  thus  making  it  truly  adaptive  (see  Whitehead  and 
Ballard  for  an  interesting  example  of  learning  visual  rou¬ 
tines  [15]). 

It  remains  to  be  seen  how  far  we  can  go  with  simple, 
domain-specific  strategies  which  rely  on  special  prop¬ 
erties  of  the  environment.  I  suspect  that  quite  a  lot 
can  be  done  with  them.  In  either  case,  the  system  de¬ 
scribed  here  is  a  demonstration  that  it  is  practical  to 
build  simple,  inexpensive  vision  systems  which  perform 
useful  tasks,  and  that  the  solutions  to  vision  problems  do 
not  necessarily  involve  buying  better  cameras  and  bigger 
computers. 
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ABSTRACT 

Least-sqnaies  and  tobnst  methods  for  determining  oni* 
liers  have  been  eifectiTe  in  the  location  and  orientation  of 
a  mobile  robot  &om  visnal  meararements  of  modeled  3D 
landmarks.  However,  building  the  initial  3D  landmark 
models  is  a  time  consuming  and  tedious  process.  For 
landmark-based  navigation  methods  to  be  widely  appli- 
cable,  automatic  methods  have  to  be  developed  to  build 
new  3D  models  and  enhance  the  existing  models.  Ide¬ 
ally,  a  robot  would  continuously  build  and  update  its 
world  model  as  it  explores  the  environment.  This  pa¬ 
per  presents  techniques  to  determine  the  3D  location  of 
image  features  &om  a  sequence  of  monocular  2D  images 
captured  by  a  camera  mounted  on  the  robot. 

The  approach  adopted  here  is  to  first  build  a  partial 
model  (possibly  noisy)  by  tracking  and  reconstructing 
ihallov  structures  over  a  sequence  of  images  using  the 
constraint  of  affine  trackability.  This  model  is  subse¬ 
quently  used  to  compute  the  pose  that  relates  the  model 
coordinate  system  and  the  camera  coordinate  system  of 
the  image  frames  in  the  sequence.  Unmodeled  3D  fea¬ 
tures  are  then  tracked  over  the  image  sequence  and  their 
3D  locations  recovered  by  a  psendo-triangulation  pro¬ 
cess,  a  form  of  'Hnduced  stereo”.  The  triangnlation  pro¬ 
cess  is  also  used  to  make  new  3D  measurements  of  the 
initial  model  points.  These  measurements  are  then  fused 
with  the  previous  estimates  to  refiue  the  set  of  initial 
model  points.^ 

1  Introduction 

Techniques  are  presented  for  initial  model  acquisition, 
and  then  model  extension  using  a  partial  model  to  re¬ 
late  the  camera  and  world/model  coordinate  systems. 
The  partial  model  is  derived  &om  the  reconstruction 
of  shallow’  environmental  structure  [Sawhney  and  Han- 

'This  work  was  supported  in  part  by  DARPA  (via 
TACOM)  under  contract  number  DAA£07-91-C-1U)3S,  and 
by  NSF  under  grant  number  CDA-9822572. 

’Shallow  structures  have  small  extent  in  depth  compared 
to  thdr  distance  from  the  camera. 


son,  1992a].  Model  extension  results  are  also  presented 
for  one  sequence  where  the  partial  model  was  wiMnally 
built. 

1.1  Related  Work 

Previous  research  on  multi-frame  3D  reconstruction  can 
be  categorised  into  two  broad  rlasses  The  first  class  as¬ 
sumes  that  a  model  of  3D  inter-frame  motion  is  known, 
rather  than  independent  motion  parameters 

between  consecutive  frames.  Broida  [Broida  and  Chd- 
lappa,  1991]  assumes  constant  velocity  motion  and  esti¬ 
mates  the  3D  location  of  a  set  of  points  tracked  over  a 
monocular  image  sequence.  [Chandrasekhar,  1991]  ex¬ 
tended  Broida's  technique  to  deal  with  data  sets  where 
the  3D  location  of  a  few  points  is  known.  The  objec¬ 
tive  function,  which  Broida  and  Chandrasekhar  et.  aL 
minimise,  has  the  motion  model  parameters  and  the  un¬ 
known  structure  location  parameters  as  unknowns.  Thus 
the  dimension  of  the  objective  function  grows  with  the 
number  of  unknown  points.  An  even  more  basic  Umitar 
tion  of  this  approach  lies  in  the  model  of  motion  being 
adopted  and  its  suitability  to  the  motion  being  observed. 

The  second  clan  of  techniques  does  not  assume  any 
model  of  motion.  The  rigid  structure  of  the  world  is 
carried  forward  by  the  depth  estimates  from  frame  to 
frame.  These  techniques  are  sequential  in  nature  and 
tyincally  use  Kalman  Filtering  to  compute  the  depth 
estimates  [Cui,  et  al,  1990],  [Oliensis,  1991],  [Sawhney, 
1991].  Oliensis  and  Thomas  [Olienris,  1991]  solve  for  the 
motion  parameters  between  consecutive  image  frames  in 
a  monocular  image  sequeuce.  With  each  image  pair, 
new  measurements  are  made  for  depth  values  of  features 
and  these  are  integrated  with  previous  estimates  in  the 
Kalman  Filter  framework.  The  new  observation  Olien¬ 
sis  and  Thomas  [Oliensis,  1991]  make  is  that  the  depth 
estimate  of  different  feature  points  are  correlated  since 
the  same  noisy  motion  parameters  are  used  to  compute 
the  depth.  Because  of  this  correlation,  they  estimate 
the  depth  parameters  of  all  points  simultaneously.  This 
pves  them  fairly  good  depth  estimates  for  camera  mo¬ 
tions  having  some  T,  (i.e.  tran»iation  along  the  optical 
axis)  component.  The  cost,  however,  is  that  for  estimat- 
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ing  the  depths  of  m  points,  a  covariance  matrix  of  sise 
3m  X  3m  must  be  inverted  with  each  new  frame. 

1.2  Overview 

All  of  these  approaches  rely  on  the  basic  principle  of  tri¬ 
angulation  to  reconstruct  new  3D  points.  However,  re¬ 
construction  by  triangulation  is  highly  sensitive  to  errors 
in  estimating  the  relative  orientation  between  consecu¬ 
tive  camera  frames.  In  this  paper,  the  reconstruction  of 
3D  structure  is  accomplished  in  two  steps  to  overcome 
this  limitation. 

The  first  step  is  a  partial  reconstruction  of  a  scene  in 
terms  of  shallow  3D  environmental  structure;  structures 
whose  extent  in  depth  is  small  compared  to  their  dis¬ 
tance  from  the  camera.  The  3D  motion  and  structure  of 
a  shallow  object  in  motion,  relative  to  the  camera,  can 
be  weQ  approximated  by  an  affine  transformation.  In 
[Sawhney,  1991],  a  framework  was  presented  for  tracking 
shallow  objects  over  time  under  the  affine  constraint,  and 
in  [Sawhney,  1992b]  an  algorithm  for  identification  and 
3D  reconstruction  of  these  structures  is  presented.  An 
important  advantage  of  this  approach  is  that  3D  struc¬ 
ture  is  derived  reliably  without  the  intermediate  step  of 
explicit  computation  of  the  3D  motion  parameters. 

Shallow  structure  reconstruction  provides  only  a  partial 
3D  model  for  the  scene.  However,  this  partial  model  is 
adequate  for  the  second  part  of  the  technique  presented 
in  this  work,  namely  model  extension  and  refinement. 
The  partial  model  is  used  to  compute  the  pose  that  re¬ 
lates  the  model  coordinate  system  and  the  camera  coor¬ 
dinate  system  of  the  image  &ames  in  the  sequence.  The 
unmodeled  3D  features  (those  not  recovered  by  the  shal¬ 
low  structure  reconstruction)  are  tracked  over  the  image 
sequence  using  an  optic-flow-based  line  tracking  algo¬ 
rithm  [Williams,  1989].  Using  correspondences  of  image 
features,  and  the  poses  computed  &om  model-to-image 
feature  correspondences  for  a  sequence,  new  3D  points 
are  located  by  triangulation  (see  Figure  1).  The  esti¬ 
mation  of  the  new  3D  ^^oints  is  done  using  both  batch 
and  quasi-batch  or  seq’:iential  methods.  The  triangular 
tion  process  is  also  used  to  make  new  3D  measurements 
of  the  initial  model  points,  which  are  then  fused  with 
the  previous  estimates  to  refine  the  set  of  initial  model 
points.  Results  are  presented  for  real  sequences  where 
new  3D  points  are  located  with  average  errors  of  1.76  %. 

2  Shallow  Structure  Reconstruction 

This  section  presents  a  brief  summary  of  identification, 
tracking  and  3D  recorutruction  of  shallow  structures. 
The  details  can  be  found  in  [Sawhney,  1992a]. 

Given  a  3D  structure  that  can  be  well  approximated  by  a 
Honto-parallel  plane  (shallow  structure),  its  image  pro¬ 
jections  at  two  closely  spaced  time  iiutants  are  related 


through: 

-zff  ^ -zsRmP i,  ^  (1) 

J  J  "0 

where,  p  and  pf  ate  the  corresponding  imaged  pmnts  of 
a  shallow  structure  at  times  n  and  n  1  respectively, 
s  is  the  scale  defined  as  the  ratio  of  average  depths  at 
the  two  time  iiutants,  R,  is  the  2x2  rotation  matrix 
Cor  the  rotation  around  the  optical  axis  (x-axis),  t  is 
the  translation  in  the  image  plane,  and  are  the 
vectors  representing  the  z  and  y  components  of  the  3D 
rotational  and  translational  vectors  respectively,  Zg  is 
the  average  depth  at  the  second  time  instant,  and  /  is 
the  focal  length  of  the  camera. 

A  set  of  noisy  line  correspondences  are  used  to  compute 
the  best  affine  motion  parameters  in  the  image  plane. 
An  error  measure  that  is  a  weighted  sum  of  the  parallel 
and  perpendicular  components  of  the  vectors  joining  the 
corresponding  endpoints  of  the  line  in  frame  n  -I- 1  and 
the  affine  transformed  line  in  Crame  n  u  formulated: 

Si  =  53wi,[(A,r,-»-«-p^)n[]*-(-ujy,[(A,r,+<-p(^.)i;]* 
i=i 

(2) 

where  t  is  the  Uh  corresponding  pair,  j  refers  to  endpoint 
1  or  2,  wxi  and  tog,  are  the  weights  for  the  perpendicular 

and  parallel  error  components,  D  =  ^  ^ 

data  matrix  which  is  constructed  using  the  endpoint  p  — 

{z  yp  in  frame  n,  vector  =  [scoau;,  ssin<o,]^  is  the 
product  of  scale  s  and  rotation,  to,,  around  the  optical 
axis,  and  nj  and  1^  are  the  unit  normal  and  direction, 
respectively,  of  the  line  in  frame  n  -I- 1. 

For  a  set  of  line  correspondences,  the  unknown  parame¬ 
ters  Tf  and  t  can  be  found  by  minimising  Ei  which 
leads  to  a  linear  system: 

■ifietw#//  =  Vtat  (3) 

where  Mm  and  vtot  are  the  data  matrix  and  vector,  re¬ 
spectively,  and  Vtff  is  the  vector  of  the  unknown  four 
affine  parameters  (for  full  details,  see  [Sawhney,  1992a]. 

Given  the  model  of  uncertainty  of  the  constituent  lines 
in  a  structure,  the  covariances  of  the  output  affine  pa¬ 
rameters  can  be  expressed  as  follows  [Strang,  1986]: 

Ar.<  =  MrJ  (4) 

where  A^^(  is  the  4x4  covariance  matrix  of  the  affine 
parameters  r,  and  i. 

2.1  Tracking  Shallow  Structures 

The  affine  motion  constraint  is  used  in  a  dynamic  model 
to  predict  and  track  shallow  structures  over  time.  Track¬ 
ing  requires: 

1.  A  dynamic  model  of  the  motion  of  a  structure. 
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2.  A  match  measiue  to  choose  good  matches  fbt  a 
stractoK  in  every  newly  acquit  frame.  The  con* 
stxaints  on  the  search  for  the  potential  matches  axe 
provided  by  the  dynamic  modeL 

3.  A  mechanism  for  fusing  the  current  estimate  of  the 
affine  motion  and  the  3D  location  parameters  of  a 
structure  with  those  obtained  from  the  newly  ac¬ 
quired  data. 

The  affine  motion  parameters  derived  in  Equation  3  pro¬ 
vide  a  dynamic  model  of  prediction  of  the  motion  of 
a  shallow  structure  in  the  image  plane.  Kalman  filter¬ 
ing  is  used  for  prediction  and  recursive  estimation,  and 
the  Mahalanobis  distance  [Mahalanobis,  1936]  is  used 
fox  matching  the  predictions  with  potential  matches  in 
a  newly  acquired  frame  [Sawhuey,  1991]. 

2.2  Shallow  Structure  Identification 
and  Reconstruction 

In  [Sawhney,  1992b]  affine  tracking  is  embedded  in  an 
algorithm  to  automatically  identify  shallow  structures. 
The  essential  idea  is  that  if  a  hypothesised  structure 
can  be  consistently  tracked  and  its  3D  depth  over  time 
is  consistent  with  a  shallow  structure  model,  then  the 
structure  is  identified  as  shallow;  otherwise  it  is  labeled 
non-shallow.  The  depth  of  structures  identified  as  shal¬ 
low  is  computed  from  the  scale  parameter  in  the  affine 
transformation  of  Equation  3.  It  is  represented  in  the 
coordinate  system  of  the  first  frame  in  the  sequence. 

3  Pose  Determination 

Using  the  depths  of  the  shallow  structures  recovered  by 
the  affine-based  algorithm,  a  partial  model  of  the  en¬ 
vironment  can  be  built.  This  model  has  the  same  co¬ 
ordinate  system  as  that  of  the  first  frame’s  coordinate 
system.  Given  correspondences  between  model  and  im¬ 
age  tokens  in  subsequent  image  frames,  the  pose  parame¬ 
ters  (rotation  and  translation)  that  relate  the  subsequent 
frames’  coordinate  systems  to  the  model  coordinate  sys¬ 
tem  can  be  computed.  In  an  earlier  paper  [Kumar,  1989] 
least-squares  techniques  for  pose  determination  were  de¬ 
veloped.  These  techniques  are  optimal  with  respect  to 
gaussian  noise  in  the  input  image  measurements.  In  this 
section,  the  least-squares  techniques  are  extended  to  also 
handle  gaussian  noise  in  the  3D  model.  The  techniques 
presented  in  this  section  assume  point  correspondences 
but  are  easily  modified  for  line  correspondences. 

The  rigid  body  transformation  from  the  world  coordinate 
system  to  the  camera  coordinate  system  can  be  repre¬ 
sented  as  a  rotation  (R)  followed  by  a  translation  (T). 
A  point  pin  world  coort^ates  gets  mapped  to  the  point 
Pc  in  camera  coordinates  as: 

p.  =  fi(^-ff  (5) 


Using  equation  (S)  and  assuming  perspective  {nojection, 
the  pose  constraint  equations  for  the  ^  point  ^  in  a  set 
of  *m*  pdnts  can  be  written  in  the  following  manner: 


— Cri(Rw  +  3^  =  0 

(«) 

Pesf 

-^C,<(ilW  +  T)  =  0 

Put 

(7) 

—  (*«i®i“'f«<) 

(8) 

Ar»  ~  (®> 

(») 

Put  =  (Rft+2^, 

(10) 

where  is  the  image  projection  of  the  point  and 

(#•,  Sy)  is  the  focal  length  in  pixels  along  each  axis. 

The  non-linear  system  of  constraint  equations  for  the 
pose  parameters  R  and  T  is  solved  nnng  the  gnass- 
newton  technique  [Strang,  1986].  Given  a  current  esti¬ 
mate  R,  T,  the  constraint  equations  (6,7)  are  linearised 
about  the  estimate: 

- (Csi  •  AT-I- •  bxt)  = - Cti’Pci  +  r}M  (11) 

Pcjt  Pest 

— (C^.AT-»-i«.V)  = - —C^pd  +  v,  (12) 

Pest  Pest 

where  ^  Cgi  and  b^i  —  Rpi  x  C^.  The  above 

equations  relate  the  pose  increments  Su  (rotation)  and 
AT  (translation)  to  the  computed  measurement  errors 
using  the  current  pose  estimate.  The  noise  terms  in  the 
two  equations,  q,  and  q,  axe  functions  of  both  the  3D 
model  noise  Ap^  and  the  image  noise  AJT,  AT: 

q.  =  AX-I-— C.i(R(ApJ)  (13) 

Pest 

q,  =  Ay-H;^C^.(R(Aft))  (14) 

Pest 

Therefore  for  the  ith  point,  two  such  equations  (11  and 
12)  can  be  written  and  for  a  set  of  "m”  points,  a  total 
of  '’2m’'  equations  are  obtained.  At  each  iteration  in 
the  minimisation  process,  the  linear  system  of  equations 
is  solved  using  equation  (23)  to  find  the  best  increment 
vector^.  This  increment  is  added  to  the  current  pose 
estimate  emd  the  process  repeated  until  there  is  conver¬ 
gence. 

If  the  correct  estimate  of  pose  were  known,  the  measure¬ 
ment  noise  terms  q^  and  q^  would  be  equal  to  the  sum  of 
the  measurement  error  of  the  image  point  location  and 
the  projection  of  the  error  in  the  model  point  along  the 
image  x-axis  and  y-axis  respectively.  The  measurements 
of  the  image  point  locations  are  assumed  to  be  corrupted 
with  identical,  independent,  sero-mean  gaussian  noise. 
The  3D  model  points  are  alM  assumed  to  be  corrupted 

’The  appendix  reviews  some  salient  information  on  solv- 
ing  over  construed  Unear  equations. 
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by  leio-mean  independent  gaussian  noise.  Thus  the  co- 
variance  matrix  "V”  corresponding  to  the  noise  in  the 
linear  system  of  equations  (22)in  the  Appendix  is  a  band 
matrix  in  which  the  non-sero  entries  are  2x2  matrices 
about  the  diagonal.  The  output  covariance  matrix  for 
the  pose  rotation  and  translation  parameters  is  given  by 
equation  (24)  evaluated  at  the  final  pose  estimate. 

4  Induced  Stereo 

In  this  section,  we  present  techniques  for  computing  3D 
estimates  of  new  points  in  the  world  coordinate  sys¬ 
tem  from  their  tracked  image  locations  over  a  multi¬ 
frame  sequence.  The  mathematics  for  both  extending 
the  model  and  refining  the  initial  modeled  points  is  pre¬ 
sented.  Computed  with  the  estimate  of  each  new  model 
point  is  an  estimate  of  the  covariance  of  its  error.  These 
covariances  are  functions  of  the  input  image  measure¬ 
ment  covariances  and  the  initial  3D  model  point  covari¬ 
ances. 


WORLD 


Figure  1:  Model  Extension  and  Refinement. 

The  matching  of  image  features  to  the  partial  model  is 
obtained  by  the  tracking  method  described  in  Section 
2.  Given  these  correspondences,  pose  estimation  is  per¬ 
formed  for  each  frame  using  the  method  presented  in  the 
previous  section.  Image  tokens  corresponding  to  new  fear 
tures  are  also  tracked  over  a  sequence  of  frames  using  the 
computed  optic  flow  between  pairs  of  successive  frames 
[Williams,  1989].  Typically  comer  points  (defined  by  the 
intersection  of  two  image  lines)  are  tracked  although  any 
image  feature  which  can  be  re^bly  tracked  may  be  used. 
The  3D  estimate  of  the  corner  point  is  obtained  by  the 
pseudo-intersection  of  all  the  image  projection  rays  for  a 
tracked  image  point.  A  nice  property  of  the  system  de¬ 
scribed  here  is  that  the  pose  estimation  proceu  provides 
the  world  coordinate  frame  as  a  stable  coordinate  frame 
in  which  3D  measurements  from  a  sequence  of  frames 
can  be  combined.  Independent  measurements  can  be 


made  to  relate  the  coordinate  system  of  each  frame  in 
the  sequence  to  the  world  coordinate  frame. 

Points  are  located  by  the  psnedo-intersection  process  in 
two  steps.  In  the  first  step,  a  3D  error  function  is  mini¬ 
mised  to  find  an  initial  estimate  of  the  point’s  location. 
This  step,  however,  does  not  yield  the  optimal  estimate 
since  the  various  error  terms  are  not  weighted  by  the 
input  covariances.  In  the  second  step,  an  image-based 
error  function  is  optimised  in  which  the  error  terms  are 
inversely  weighted  by  a  combination  of  the  input  covari¬ 
ances  of  the  pose  estimate  and  the  image  measurements. 

Let  ri  be  the  unit  vector  corresponding  to  the  image  pro¬ 
jection  ray  for  an  image  point  in  the  ith  frame.  The  pose 
estimation  for  this  frame  is  given  by  the  rotation  JZ,  and 
translation  Ti.  Since  the  image  projection  rays  do  not 
intersect  at  a  unique  point^,  the  3D  pseudo-intersection 
pmnt  p  is  obtained  by  minimising  an  error  function  E: 

(15) 

t=l 

which  is  the  sum  of  squares  of  the  perpendicular  dis¬ 
tances  from  the  psuedo-intersection  point  p  to  the  image 
projection  rays.  Differentiating  E  with  respect  to  the 
unknown  variable  p  leads  to  a  set  of  linear  equations, 
which  are  then  solved  to  give  the  initial  estimate  for  p. 

In  the  second  step,  the  pose  constraint  equations  6  and 
7  ate  used  to  formulate  image-based  error  equations  for 
the  X  and  Y  projections  of  the  model  points. 

— .R<(P)  =  fi-l-Cx  (16) 

Pes  Pet 

=  ~—C^-fi  +  Cr  (17) 

Pet  Pet 

where  Cx  <uid  Cv  noise  terms  that  are  func¬ 

tions  of  both  noise  in  pose  AT*  and  Sun  and  image  nmse 
(AX,  AY): 

Cx  =  AX-h—C,i.ATi +  —««<•  hi  (18) 

Pet  Pet 

(r  =  AY-l-— (7^A7\  +  — hwi-hi  (19) 

Pet  Pet 

In  this  case  the  3D  model  point  p  is  the  unknown  vari¬ 
able.  The  denominator  pc  in  the  equations  (16  and  17) 
corresponds  to  the  depth  of  the  point  and  is  a  function 
of  the  unknown  variable  p.  Therefore,  for  each  frame 
over  which  the  point  is  trit^ed,  two  non-linear  constraint 
equations  (16  and  17)  are  obtained*  .  An  iterative  proce¬ 
dure  is  employed  to  solve  the  system  of  non-linear  eqnar 
tions.  At  ea^  iteration,  the  denominator  pu  is  held 
constant  using  the  previous  estimate  of  and  the  re¬ 
sulting  linear  system  of  equations  is  solved.  The  iter¬ 
ative  procedure  is  repeated  until  there  is  convergence. 

*Dnc  to  noise  both  in  image  measurements  and  pose 
estimates. 

*A  minimum  of  two  frames  is  needed  to  solve  the  system 
of  equations. 


510 


In  practice,  we  have  found  one  iteration  ia  sufficient  for 
robust  results.  The  input  covariance  matrix  V  required 
for  the  normal  equations  is  obtained  from  the  expressions 
derived  above  for  the  noise  terms  CxtCr-  The  output  co- 
variance  of  the  3D  point  estimate  can  also  be  computed. 

In  the  batch  method,  information  &om  all  frrmies  ia  used 
simultaneously  to  estimate  the  3D  locations  of  tracked 
image  points.  However,  it  may  be  desired  to  sequentially 
update  the  location  of  new  points  after  every  pair  (or  a 
larger  set)  of  frames.  In  the  sequential  or  quasi-l^tch 
mode,  equations  (6  and  7)  are  again  used  to  estimate 
the  3D  location  of  image  points  tracked  over  the  current 
set  of  frames.  These  new  estimates  must  be  fused  with 
the  previous  estimates  to  obtain  the  current  optimal  es¬ 
timate.  The  covariance  matrices  associated  with  each 
estimate  are  used  to  fuse  the  two  estimates  and  provide 
a  new  uncertainty  matrix  using  the  standard  Kalman 
Filtering  equations. 

Let  the  estimate  of  the  point’s  3D  location  and  its  covari¬ 
ance  at  frame  “ti"  be  ^ti)  and  Ap(ti)  respectively.  A 
new  3D  location  measurement  Q  with  uncertzdnty  (co- 
variance  matrix  Aq)  is  computed  from  a  batch  of  “n” 
image  frames.  The  fused  location  estimate  ^tn)  ^d  up¬ 
dated  covariance  matrix  Ap(tn)  at  frame  are  given 
by: 

=  Ap(tn)(Ap(ti)-'p(ti)-l-A5'g)  (20) 

Ap(tti)  =  (Ap(ti)  ^  -f  Ag^)  ^  (21) 

This  same  method  is  used  for  model  refinement.  Initial 
model  points  have  associated  with  them  their  input  co- 
variance  matrices.  When  the  model  is  tracked  over  a  new 
batch  of  frames,  3D  measurements  can  abo  be  made  for 
the  model  points  by  the  above  psuedo-intersection  pro¬ 
cedure.  These  new  measurements  are  fused  with  the  old 
estimate  using  the  above  equation. 

5  Experimental  Results  and  Discussion 

We  now  present  results  on  two  multi-Hame  sequences.  In 
both  cases,  similar  results  are  presented  using  an  initial 
model  built  from  points  on  the  recovered  shallow  struc¬ 
tures.  The  image  sequences  were  captured  with  a  SONY 
B/W  AVC-Dl  camera,  with  an  approximate  FOV  of  24 
degrees  and  digitised  to  256-by-242  pixels.  In  all  exper¬ 
iments  the  image  center  was  assumed  to  be  at  the  center 
of  the  image  frame  and  the  effective  focal  length  was 
calculated  from  the  manufacturers  specification  sheets. 
Since  we  have  shown  in  [Kumar,  1990]  that  errors  in  the 
image  center  do  not  significantly  affect  the  location  of 
new  points  in  a  world  coordinate  system  (for  a  small 
field  of  view  imaging  system),  calibration  for  the  image 
center  has  not  been  done. 

The  A211  (10  frames)  and  COMP  (6  frames)  sequences 
were  generated  by  taking  images  from  a  camera  mounted 


on  a  mobQe  robot  moving  roughly  parallel  to  the  opti¬ 
cal  axis.  Figure  2  shows  the  first  frames  in  the  A211 
and  COMP  sequences.  Between  consecutive  frames,  the 
robot  was  translated  approximately  0.38  and  1.4  feet  re¬ 
spectively  for  the  A211  and  COMP  image  sequences. 
The  depth  of  some  salient  structures  in  each  seqennce 
was  measured  with  a  tape  measure. 


(b) 

Figure  2(a)  A211  and  2(b)  COMP  Images. 

In  both  sequences,  image  lines  were  extracted  for  each 
frame  using  Boldt’s  [Boldt,  et  al,  1989]  line  grouping 
system.  The  tracking  algorithm  was  applied  to  the  image 
sequences  to  identify  the  shallow  structures  in  the  scene. 
Line  triples  were  automatiadly  selected  to  hypothesise 
aggregate  structures.  Each  of  these  was  tested  for  affine 
trackability,  resulting  in  its  labeling  as  a  shallow  or  a 
non-shallow  structure  [Sawhney,  1092a].  Figure  3  shows 
in  bold  lines  the  structures  identified  as  shallow  by  the 
algorithm  for  the  two  sequences. 


511 


The  numbeied  points  marked  by  crosses  in  Figure  2  ly* 
ing  on  the  recovered  shallow  structures  were  used  as  the 
initial  model  points  for  the  A211  and  COMP  image  se¬ 
quences  respectively.  These  points  are  defined  by  the 
intersection  of  some  of  the  pairs  of  lines  belon^g  to 
shallow  structures.  The  3D  model  locations  were  con¬ 
structed  by  back  projecting  the  points  in  the  first  image’s 
coordinate  frame. 


Figure  S.  Shallow  structures  indentified  in  the 
A211  and  COMP  image  sequences. 

The  model  extension  and  refinement  algorithm  was  run 
in  a  sequential  mode.  Tables  1  and  2  show  the  result 
of  locating  new  points  (circled  and  numbered  in  Fig¬ 
ure  2)  and  refining  the  initial  model  points  (marked  by 
crosses  in  the  figures).  The  ground  truth  available  for 
both  experiments  was  only  the  depths  (as  opposed  to 
3D  location)  of  the  points  in  the  first  image’s  coordinate 
frame.  Thus  the  results  shown  in  Tables  1  and  2  com¬ 


pare  the  measured  depth  value  (ground  truth)  with  the 
recovered  depth  value.  Column  2  in  the  tables  shows 
the  measured  depth  of  the  point  in  the  first  image  co¬ 
ordinate  frame.  Columns  3  and  4  show  the  error  and 
percentage  error  in  depth,  respectively,  for  the  initial 
model  pmnts  as  acquired  by  the  affine-based  tracking  al- 
gorithin.  Columns  5  and  6  show  the  output  error  and 
percentage  error  in  depth  (after  model  refinement  and 
extension)  respectively. 

For  the  new  points,  it  is  assumed  that  no  initial  model 
was  available;  ther^ore  columns  3  and  4  for  these  points 
are  blank.  Note  that  these  points  also  belong  to  the 
reconstructed  shallow  structures.  However,  their  recon¬ 
structed  locations  were  not  used  as  a  part  of  the  ini¬ 
tial  partial  modeL  Instead,  these  points  were  used  to 
demonstrate  model  extension  because  the  ground  truth 
was  available  only  for  these  structures.  In  the  taUes, 
the  percentage  error  in  depth  is  computed  with  respect 
to  the  depth  in  the  first  image’s  coordinate  frame. 


Thble  1:  Absolute  and  Percentage  SD  location  er¬ 
rors  for  points  in  A211  sequence  (see  Fig.  2.) 


1  INPUT  i 

1 OUTPUT 1 

l>t. 

No. 

Depth 

ft. 

Abs. 

Err. 

ft. 

% 

Err. 

Abs. 

Err. 

ft. 

% 

Err. 

Initial  Points 

1 

13.4 

0.24 

1.80  % 

0.24 

1.78% 

2 

14.6 

0.19 

1.31  % 

0.20 

134  % 

3 

19.0 

0.74 

3.88  % 

0.66 

3.46% 

4 

19.0 

0.16 

0.86  % 

0.11 

0.60% 

5 

20.4 

0.13 

0.62  % 

0.17 

0.86% 

6 

20.4 

0.39 

1.90% 

0.32 

1.60% 

7 

20.4 

0.49 

2.38  % 

0.46 

2.25% 

New  Points  | 

8 

13.4 

- 

- 

0.11 

0.79% 

9 

13.4 

- 

- 

0.00 

0.01% 

10 

14.6 

- 

- 

0.53 

3.65% 

11 

19.0 

- 

- 

0.73 

3.86% 

12 

19.0 

- 

- 

0.54 

2.82% 

13 

19.0 

- 

- 

0.11 

0.59% 

14 

19.0 

- 

- 

0.07 

0.34% 

15 

20.4 

- 

- 

0.23 

1.13% 

16 

20.4 

- 

- 

0.27 

1.32 

17 

20.4 

- 

- 

0.12 

0.57% 

18 

20.4 

- 

- 

0.34 

1.65  % 

19 

20.4 

- 

- 

0.62 

3.02% 

20 

20.4 

- 

- 

0.59 

2.92  % 

Average  depth  error  of  new  points  1.63%  | 

The  average  input  error  in  depths  of  the  Kven  initial 
model  points  in  the  A211  sequence  (as  recovered  by  the 
affine-based  tracking  algorithm)  was  0.4  feet  (1.8S  %  er¬ 
ror).  At  the  end  of  the  ten  frames,  the  average  error  of 
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l^ble  2:  Absolute  and  Percentage  SD  location  er¬ 
rors  for  points  in  COMP  sequence  (see  Fig.  2.) 


nr 

PUT 

OU' 

rPUT 

Pt. 

Depth 

Abs. 

% 

Abs. 

% 

No. 

Err. 

Err. 

Err. 

Err. 

ft. 

ft. 

ft. 

1  Initial  Points 

1 

29.3 

-0.23 

0.80  % 

-0.11 

0.36  % 

2 

31.3 

0.26 

b~84  % 

0.17 

0.55% 

3 

34.2 

-0.10 

0.29% 

-0.07 

0.20% 

4 

25.7 

-0.26 

1.03  % 

-0.23 

0.88  % 

5 

35.8 

1.59 

4.43  % 

1.54 

4.31  % 

6 

28.7 

0.39 

1.36% 

0.39 

1.35  % 

7 

43.2 

-1.65 

3.82  % 

-1.63 

3.76  % 

8 

43.2 

1.18 

2.73  % 

1.15 

2.66  % 

9 

28.7 

1.46 

5.08  % 

1.41 

4.91  % 

New  Points 

10 

29.3 

- 

- 

0.25 

0.86  % 

11 

29.3 

- 

- 

-0.35 

1.19  % 

12 

31.3 

- 

- 

0.51 

1.63  % 

13 

31.3 

- 

- 

0.28 

0.89  % 

14 

34.2 

- 

0.93 

2.70  % 

15 

34.2 

- 

1.31 

3.82  % 

16 

25.7 

- 

-0.02 

0.07% 

17 

25.7 

0.03 

0.11  % 

18 

35.8 

1.05 

2.93% 

19 

35.8 

0.50 

1.40  % 

20 

28.7 

-0.11 

0.39  % 

21 

28.7 

0.08 

0.29  % 

22 

43.2 

0.46 

1.07  % 

23 

43.2 

1.77 

4.10  % 

24 

43.2 

-0.45 

1.04  % 

25 

43.2 

0.13 

0.30  % 

26 

28.7 

- 

0.80 

2.77  % 

27 

28.7 

- 

- 

0.25 

0.88  % 

Average  depth  error 

of  new  points 

1.46% 

the  7  initial  points  was  0.37  feet  (1.76  %).  The  thiiteen 
new  points  were  located  to  an  average  accuracy  of  0.4 
feet  (1.63  %). 

The  average  input  error  in  depths  of  the  nine  initial 
model  points  in  the  COMP  sequence  (as  recovered  by 
the  afRne-based  tracking  algorithm)  was  1.01  feet  (2.27 
%  error).  At  the  end  of  the  six  frames,  the  average  er¬ 
ror  of  the  9  initial  points  was  0.98  feet  (2.11  %).  The 
eighteen  new  points  were  located  to  an  average  accuracy 
of  0.69  feet  (1.46  %).  For  this  experiment  the  measured 
depth  values  are  only  approximate  to  about  0.5  feet  for 
some  points.  This  is  especially  true  for  points  lying  on 
the  left  side  wall  (points  1,  2,  3  etc.  in  Figure  5). 

In  both  experiments,  the  model  extension  process  was 
fairly  accurate  in  locating  new  points.  However,  there 
was  only  slight  improvement  in  the  initial  model  as  a 


result  of  the  model  refinement  process.  The  robust  re¬ 
covery  of  the  location  of  new  3D  p<nnts  depends  on  the 
camera  motion.  Optimal  angles  for  triangnlation  are 
achieved  when  there  is  significant  translation  to 

the  image  plane.  In  the  A211  and  COMP  sequence,  the 
translation  of  the  camera  is  mostly  along  the  optical  axis. 
Thus,  the  FOE  (focus  of  expansion)  lies  on  the  image 
plane.  Points  close  to  the  FOE  have  smaller  disparity 
and  their  depths  caiuiot  be  reliably  estimated.  Conse¬ 
quently,  these  results  imply  that  we  may  be  at  the  limit 
of  recoverable  accuracy. 

Finally,  the  accuracy  of  the  model  extension  process  de¬ 
pends  on  the  initial  accuracy  of  the  model  points.  If  the 
initial  model  points  have  a  large  amount  of  noise,  then 
the  poses  determined  for  any  batch  of  frames  will  be 
highly  correlated.  In  this  case,  the  3D  location  estimates 
of  new  points  will  be  correlated  both  across  all  points  and 
also  all  frames.  To  fully  account  for  this  OBcorrelation, 
covariance  matrices  equal  to  the  sise  of  number  of  points 
times  number  of  frames  will  have  to  be  inverted.  In  our 
case,  it  is  assumed  that  the  initial  points  do  not  have 
significant  noise  and  hence  the  cross-correlations  can  be 
ignored.  But  for  larger  amounts  of  noise,  it  may  not  be 
possible  to  ignore  these  effects  [Oliensis,  1991]. 

Appendix 

Some  facts  from  linear  system  estimation  theory  arc  re¬ 
viewed.  An  unknown  parameter  vector  z  with  *p*  ele¬ 
ments  is  related  to  a  set  of  noisy  observations  y  by 
the  following  equation: 

Az  =  y  +  rf  (22) 

where  ^is  sero-mean  Gaussian  noise  with  coviuiance  ma¬ 
trix  V.  Assume,  that  this  set  of  equations  is  an  over¬ 
constrained  system.  Then  the  Best  Linear  Unbiased  Es¬ 
timate  (BLUE)  [Strang  1986]  of  the  unknown  vector  x 
and  the  covariance  matrix  *P”  of  the  output  parameters 
are  given  by: 

X  =  (A^V-'A)-^A'^V-^y  (23) 

P  =  (A^'V-^A)'^  (24) 
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Abstract 

We  summarize  some  recent  algorithms  devel¬ 
oped  for  qualitative  navigation  which  are  com¬ 
pletely  independent  of  range  estimates  to  land¬ 
marks.  We  introduce  several  distinctions  that 
reflect  more  realistic  application  of  qualitative 
navigation  algorithms  to  real  robots.  These 
involve  the  the  extent  to  which  landmarks 
can  be  identified  from  very  different  points  of 
view  (called  the  disUncUveness  of  landmarks); 
whether  or  not  a  compass  is  allowed;  and  dis¬ 
tinctions  between  diflerent  types  of  compasses. 

1  Introduction 

Qualitative  Navigation  [Kuipers  and  Byun,  1987; 
Kuipers,  1978;  Levitt  and  Lawton,  1990]  concerns  spa¬ 
tial  learning  and  path  planning  in  the  absence  of  a  single 
global  coordinate  system  for  describing  locations  and  the 
positions  of  landmarks.  It  is  based  on  a  multi-level  rep¬ 
resentation  of  space,  which,  at  its  most  abstract  level,  is 
based  on  topological  properties  which  allow  a  robot  to 
describe  a  location  using  the  directions  of  visually  salient 
patterns  (with  no  associated  range  measurements)  and 
then  navigating  using  cues  such  as  the  occlusions  that 
occur  between  landmarks.  An  advantage  is  that  the 
robot  can  use  landmarks  for  which  exact  positions  can 
not  be  determined.  Thus,  if  a  robot  sees  a  building  in 
the  distance,  it  may  not  know  or  be  able  to  recognize  the 
structure  as  a  building  or  determine  its  exact  position  in 
space  but  it  can  still  incorporate  this  to  form  an  effec¬ 
tive  spatial  memory.  This  is  actually  quite  intuitive:  it  is 
doubtful  that  animals  navigate  by  detecting  landmarks, 
determining  ranges  to  them,  and  then  storing  everything 
in  a  single  frame  of  reference[Gallistel,  19901.  It  &lso  re¬ 
moves  the  effects  of  incremental  errors  due  to  drift. 

Our  work  [Levitt  and  Lawton,  1990]  in  qualitative 
navigation  developed  while  trying  to  produce  basic  nav¬ 
igation  and  recognition  capabilities  in  an  autonomous 
land  vehicle.  Initially  we  worked  with  a  terrain  rep¬ 
resentation  based  upon  an  a  priori  terrain  grid,  which 
describes  terrain  in  terms  of  a  regular  grid  of  features 

'This  research  is  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  is  mon¬ 
itored  by  the  U.  S.  Army  Topographic  Engineering  Center 
under  contract  No.  DACA76-92-C-0016 


referenced  with  respect  to  a  single  global  coordinate  sys¬ 
tem.  We  discovered  several  problems  with  such  spatial 
representations.  The  grids  would  describe  large  patches 
of  terrmn  by  a  set  of  numbers  which  corresponded  to 
terrain  features  such  as  elevation  and  vegetation  type. 
Unfortunately,  the  world  consists  of  objects  which  are 
difficult  to  summarize  by  a  single  set  of  numbers.  It 
is  difficult  to  establish  the  exact  three-dimensional  po¬ 
sition  of  a  distant  landmark,  especially  when  using  pas¬ 
sive  sensing.  Thus,  it  is  difficult  to  know  where  to  attach 
landmarks  to  the  underlying  terrain  representation  when 
it  uses  a  single,  global  coordinate  system.  Robots  also 
have  limited  recognition  capabilities  in  complex  outdoor 
environments.  They  can  see  distinctive  things  in  the 
world,  and  yet  not  know  what  or  where  they  were.  In 
fact,  there  are  no  assurances  that  robots  can  see  the 
same  object  as  being  the  same  object  from  very  diflerent 
points  of  view. 

Qualitative  Navigation  deals  with  these  problems  via 
a  multi-level  representation  of  spatial  memory.  The  dif¬ 
ferent  levels  are  distinguished  by  what  constitutes  a  land¬ 
mark  and  the  connectedness  of  spatial  memory  which 
refers  to  how,  given  one  location,  it  is  possible  to  de¬ 
termine  the  position  of  another  location.  At  the  sim¬ 
plest  level  of  spatial  representation  (the  Sensorimotor 
level)  a  landmark  consists  of  a  perceptual  event  which 
can  be  used  for  sensory  feedback  to  control  guidance. 
The  next  level  (the  Topological  level)  is  based  upon 
noting  and  tracking  stable  perceptual  events  around  the 
robot,  but  without  associating  any  ran^e  information  to 
these.  This  level  is  topological  in  the  sense  that  there 
is  no  metric  information  associated  with  landmarks.  A 
place  is  described  by  the  set  of  visual  patterns  surround¬ 
ing  the  robot.  This  description  of  a  place  is  called  a 
viewframe.  Movement  from  place  to  place  is  deter¬ 
mined  when  there  is  some  change  in  the  order  of  these 
patterns.  The  next  level  allows  the  association  of  po¬ 
tentially  inexact  range  information  with  the  visual  pat¬ 
terns  (Local  Coordinate  systems).  At  this  level, 
viewframes  can  have  associated  range  estimates  with  the 
detected  visual  patterns  and  the  localisation  of  one  place 
to  another  was  inexact.  The  final  level  (Global  Co¬ 
ordinate  System)  assumes  that  we  have  exact  three- 
dimensional  information  for  all  landmarks.  In  [Levitt 
and  Lawton,  1990],  we  found  that  by  working  at  the 
level  of  a  viewframe  based  representation,  the  problems 
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faced  when  working  with  a  single  global  coordinate  sys¬ 
tem  were  drastically  simplified. 

In  this  paper  we  describe  qualitative  navigation  algo¬ 
rithms  which  work  completely  at  the  topological  level, 
dealing  with  landmarks  for  which  there  are  no  range  es¬ 
timates.  In  addition,  we  introduce  several  distinctions 
for  qualitative  navigation  algorithms.  One  distinction 
concerns  landmarks.  We  consider  two  basic  types;  dis¬ 
tinct  landmarks  which  can  always  be  recognized  as 
the  same  from  wherever  they  are  seen  and  nondistinct 
landmarks  which  may  not  be  recognized  as  being  the 
same  when  seen  from  different  points  of  view.  We  as¬ 
sume  that  once  landmarks  are  seen,  they  can  be  tracked 
over  time  until  they  disappear.  The  other  distinction 
involves  whether  or  not  the  navigation  algorithms  use  a 
compass  to  yield  a  fixed  direction.  We  also  distinguish 
two  different  types  of  compass.  The  direction  associated 
with  a  local  compass  can  change  from  place  to  place, 
but  at  a  given  place,  it  will  always  point  in  the  same 
direction.  An  example  is  a  compass  which  is  effected  by 
fixed  magnetic  influences  at  different  locations.  The  lo¬ 
cal  compass  can  also  be  a  very  strong  landmark  which  is 
visible  from  a  wide  set  of  views.  A  global  compass  will 
always  point  in  the  same  direction  regardless  of  where 
the  robot  is  located.  We  can  express  these  distinctions 
as  a  table  corresponding  to  the  different  types  of  topo¬ 
logical  navigation  algorithms  we  have  developed: 


Topological  Qualitative  Navigation  Algorithms 


Compass 

No  Compass 

distinct  landmarks 

Very  Good 

Good 

nondistinct  landmarks 

Good 

Difficult! 

For  example,  consider  qualitative  navigation  without 
a  compass  and  identical,  nondistinct  landmarks.  As  one 
might  expect,  this  is  very  difficult  and  depends  criti¬ 
cally  on  matching  viewframes  based  exclusively  upon 
the  angular  orientations  of  landmarks.  More  practical 
algorithms  are  those  which  are  based  upon  the  use  of 
a  local  compass  and  a  limited  number  of  distinct  land¬ 
marks.  This  corresponds  to  a  freely  navigating  robot 
which  can  build  maps  and  navigate  using  simple  visual 
features,  such  as  colored  regions  aligned  with  gravity,  as 
landmarks. 

In  the  remainder  of  this  paper,  we  describe  the  basic 
memory  organization  used  for  qualitative  navigation  and 
then  present  different  navigation  algorithms. 

2  Organization  of  Spatial  Memory  and 
Navigation  Behaviors 

2.1  Landmarks 

We  distinguish  between  types  of  landmarks  to  reflect  dif¬ 
ferent  recognition  capabilities  in  robots.  A  diaiinct  land¬ 
mark  is  one  which  can  be  recognized  as  being  the  same 
from  all  points  of  view.  Distinct  landmarks  require  con¬ 
siderable  recognition  capabilities  for  a  robot  owing  to  the 
variable  appearance  of  landmarks  from  different  points 
of  view.  A  nondisiinci  landmark  is  one  which  may  not 
be  recognized  as  being  the  same  from  different  points  of 


view.  We  assume  that  once  a  nondistinct  landmark  is 
seen,  it  can  be  tracked  over  time  until  it  disappears.  A 
nondistinct  landmark  is  not  necessarily  described  as  a 
particular  object  in  the  world,  but  can  be  described  as  a 
simple  visual  patter,  such  as  a  colored  region  of  a  partic¬ 
ular  shape  or  a  set  of  edges  aligned  with  gravity.  Such 
descriptions  of  landmarks  will  tend  not  to  be  unique. 

A  general  finding  of  the  algorithms  we  describe  here 
is  that  the  more  distinct  landmarks  there  are,  the  more 
easily  a  robot  can  find  shortcuts  and  novel  paths  between 
locations.  The  more  indistinct  landmarks  there  are,  de¬ 
termining  position  depends  on  recognizing  the  distribu¬ 
tion  of  landmarks  surrounding  a  robot.  In  this  case,  the 
robot  will  tend  to  stay  close  to  established  paths  that  it 
determines  during  explorations.  It  is  possible  for  a  robot 
to  determine  novel  paths  between  locations  with  nondis¬ 
tinct  landmarks,  but  it  requires  significant  exploration 
to  determine  that  a  landmark  is  the  same  form  many 
different  points  of  view. 

2.2  Viewframes 

A  viewframe  contains  the  set  of  visible  landmarks  sur¬ 
rounding  a  robot  at  a  given  location  with  their  corre¬ 
sponding  orientations  and  other  attributes  describing  the 
individual  landmarks  (such  as  color,  visible  height,  con¬ 
trast,  etc.)  Viewframes  are  a  one-dimensional  sequence 
of  landmarks  (The  direction  of  gravity  is  used  to  reduce 
the  two-dimensional  images  surrounding  the  robot  to  a 
one-dimensional  sequence).  An  example  viewframe  V  is 
shown  in  Figure  1.  This  viewframe  uses  compass  infor¬ 
mation  and  is  then  represented  as 


Viewframe  Identifier;  V 
!.andmarks; 

lid  A',  AttributesA],  0,4] 
lids',  AttributesB  ,  ag] 
lidc',  AttributesC],  ac] 
Robot’s  heading:  oa] 


When  a  viewframe  is  extracted  without  a  compass, 
there’s  no  associated  0-axis  to  describe  a  fixed  direction. 
The  relative  orientation  of  landmarks  is  then  represented 
by  the  angle  difference  between  successive  landmarks. 
The  same  viewframe  in  Figure  1  is  represented  (shown 
in  Figure  2)  as 


[Viewframe  Identifier: 
Landmarks: 


lidJ^^,  AttributesA 
lids',  AttributesB 
lidc',  AttributesC 
Robot’s  heading: 


,  9s] 
,  9c] 


V 


B)] 


For  the  viewframe  in  Figure  1  and  Figure  2, /td,4, /ids, 
lidc  the  local  identifiers  for  visible  landmarks  .4,  B, 
C.  The  Local  identifier  is  a  name  or  abstraction  of 
the  attributes  of  a  landmark  that  is  tied  to  a  specific 
viewframe.  Note  that  a  landmark  with  the  same  local 
identifier  in  different  viewframes  can  have  different  im¬ 
age  attributes  depending  upon  the  viewframe  it  is  con¬ 
tained  in.  A  distinct  landmark  which  can  be  recognized 
as  being  the  same  from  very  different  points  of  view  has  a 
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unique  local  identifier  with  respect  to  all  viewframes  and 
is  called  a  global  identifier.  When  a  robot  is  exploring 
the  environment,  distinct  landmarks  will  always  be  asso¬ 
ciated  with  a  unique  local  identifier  in  all  the  viewframes 
which  contain  it.  Nondistinct  landmarks  will  have  the 
same  local-identifiers  in  connected  viewframes  so  long 
as  the  landmark  is  visible  (or  after  landmark-unification 
-  see  below).  When  a  landmark  reappears  or  is  dis- 
occluded,  it  will  have  a  new  associated  local  identifier. 
This  is  similar  to  what  can  happen  when  an  animal  walks 
on  two  different  paths  without  realizing  that  there  is  a 
common  landmark  between  them.  For  each  nondistinct 
landmark,  there  can  be  more  than  one  local  identifier  for 
it  in  different  viewframes. 

When  viewframes  consist  largely  or  totally  of  nondis¬ 
tinct  landmarks,  being  able  to  access  or  recognize  a  par¬ 
ticular  viewframe  is  difficult  (for  example,  a  large  red  re¬ 
gion  can  be  a  landmark  in  several  different  viewframes). 
For  this  reason,  we  also  associate  keys  with  viewframes 
that  are  used  for  recognizing  viewframes  by  a  hashing  op¬ 
eration.  There  are  a  large  number  of  different  keys  such 
as  the  average  height  of  landmarks,  the  average  angle  be¬ 
tween  landmarks,  the  number  of  landmarks,  number  of 
highest  landmarks,  number  of  landmarks  for  particular 
colors,  variance  of  contrasts,  variance  of  heights,  variance 
of  angles  between  landmarks,  ratio’s  of  landmarks  having 
different  attributes,  etc.  These  keys  help  to  distinguish 
and  match  viewframes.  If  there  is  a  local  compass  many 
more  types  of  keys  are  possible  because  it  is  possible  to 
order  the  landmarks  in  the  viewframe  and  compute  keys 
based  upon  position  in  the  viewframe.  Each  key  has  lim¬ 
ited  number  of  values.  Two  viewframes  are  said  to  be 
kash^maiched  if  they  have  the  same  key  value  for  each 
key. 

Keys  are  also  useful  for  the  efficiency  of  accessing 
viewframes.  Suppose  we  have  10  keys,  each  key  has 
10  different  values;  therefore  we  have  10*°  equivalence 
classes.  We  build  a  hash  table,  making  an  entry  for 
each  possible  value  combination  of  all  keys.  To  find  a 
viewframe  to  match  V,  we  first  compute  key  values  of 
V  for  all  keys,  then  use  the  combination  of  those  values 
as  an  index  to  the  hash  table  to  find  the  viewframe  in 
the  database.  Since  we  have  0(1)  number  of  keys,  time 
to  compute  the  key  value  is  0(1),  time  to  search  in  the 
hash  table  is  0(1);  therefore,  the  time  complexity  to  find 
a  viewframe  to  match  V  is  reduced  to  0(1). 


Figure  1:  Viewframe  Representation  with  a  Compass 


Figure  2;  Viewframe  Without  a  Compass 

associated  with  allowable  changes  in  the  value  for  each 
key.  If  this  is  exceeded,  then  the  viewframe  is  stored  in 
spatial  memory.  For  example,  if  there  number  of  land¬ 
marks  changes  drastically,  it  is  necessary  to  then  extract 
a  viewframe  in  spatial  memory.  It  may  also  be  useful 
to  have  a  function  which  weights  the  changes  in  the  dif¬ 
ferent  keys  to  determine  whether  a  viewframe  is  novel 
enough  to  be  extracted. 

2.4  ViewFVame  Matching 

Viewframe  matching  is  the  process  which  deter¬ 
mines  the  similarity  of  two  viewframes.  We  use  a  two 
level  matching  processing.  The  first  level  finds  similar 
viewframes  by  hashing  and  then  uses  the  number  of  land¬ 
marks  with  common  local  identifiers  in  both  viewframes 
as  a  measure  of  similarity  called  connectivity.  First  level 
connectivity  between  two  viewframes  is  defined  as; 


2.3  Viewframe  Extraction  and  Filtering 

The  extraction  of  a  viewframe  involves  identifying  land¬ 
marks  surrounding  the  robot.  These  are  then  stored  in 
different  types  of  viewframes  depending  upon  whether 
or  not  there  is  a  compass  and  on  the  distinctiveness  of 
the  landmarks.  We  have  also  found  it  useful  to  compare 
a  newly  extracted  viewframe  to  the  previously  extracted 
viewframe  to  determine  if  the  newly  extracted  viewframe 
is  different  or  novel  enough  to  merit  storing  it  in  spatial 
memory.  This  process  is  called  viewfnme-filtering  and 
has  the  effect  of  reducing  the  number  of  very  similar  or 
redundant  viewframes  that  are  stored  in  memory.  Filter¬ 
ing  is  done  by  by  keeping  track  of  changes  in  the  values 
associated  with  the  different  keys.  There  is  a  threshold 
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Second  level  viewframe  matching  compares  the  orien¬ 
tation  (angle)  difference  between  landmarks.  For  this 
level  of  matching,  different  thresholds  for  the  maximum 
orientation  difference  for  corresponding  local  identifiers 
in  the  viewframes  are  used. 


2.5  Navigation  Behaviors 

The  navigation  algorithms  are  based  on  a  set  of  sim¬ 
ple  visual  tracking  behaviors.  Viewframe  centering  is 
when  the  robot  is  positioned  at  a  landmark  and  walks  in 
the  direction  of  the  center  of  a  viewframe  which  contains 
that  landmark.  Without  a  compass,  viewframe  centers 
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involves  moving  so  that  the  robot  optimizes  key  simi¬ 
larity  with  respect  to  the  viewframe.  Viewframe  cen¬ 
tering  is  simpler  with  a  local  compass  which  is  valid 
withing  the  extracted  viewframe  since  the  relative  direc¬ 
tion  of  a  landmark  and  the  viewframe  center  is  known. 
Viewframe  back-matching  (also  called  landmark 
unification  involves  recognizing  that  landmarks  in  dif¬ 
ferent  viewframes  are  the  same  and  their  local  identi¬ 
fiers  are  unified.  This  happens  when  a  robot  visits  the 
same  place  along  separate  paths.  Landmark  circling 
is  when  a  robot  circles  around  a  known  landmark.  It  is 
a  way  of  searching  for  surrounding  landmarks  when  no 
nearby  landmarks  are  distinguished  or  visible.  The  robot 
can  spiral  towards  or  away  from  the  landmark  (until  the 
landmark  is  no  longer  visible).  Landmark  targeting  is 
for  walking  towards  a  visible  landmark.  LBP  crossing 
is  when  a  robot  crosses  an  Linear  Pair  Boundary  defined 
by  two  landmarks.  The  crossing  can  occur  on  either 
side  or  through  the  center  of  the  LPB  between  the  two 
landmarks.  LBP  alignment  is  when  the  robot  trav¬ 
els  along  the  LPB  boundary  defined  by  two  landmarks. 
Random  walking  randomly  selects  a  visible  landmark; 
walks  to  it;  and  then  repeats.  An  alternative  version 
walks  straight  for  some  distance,  changes  direction,  and 
then  repeats.  In  Novelty  walking,  a  robot  walks  to  op¬ 
timize  the  changes  in  the  keys  used  for  viewframe  filter¬ 
ing.  The  effect  is  to  go  someplace  where  it  is  as  different 
as  possible  from  where  you  currently  are. 

2.6  Spatial  Memory 

Spatial  memory  consists  of  three  inter-related  databases: 
the  viewframe  database  (V-DB),  the  path  database 
(P-DB)  and  the  landmark  database  (L-DB)  (see  Fig¬ 
ure  3).  The  landmark  database  contains  descriptions 
of  landmarks  that  a  robot  has  seen.  It  is  possible  for 
the  same  physical  landmark  to  occur  several  different 
times  in  the  landmark  database  because  it  may  not  have 
been  identified  as  being  the  same  from  different  views. 
The  viewframe  database  contains  viewframes  which  de¬ 
scribe  the  visible  landmarks  surrounding  a  robot  at  a 
given  location  (this  is  described  in  more  detail  shortly). 
The  path  database  consists  of  connected  sequences  of 
viewframes  which  a  robot  determines  while  exploring  the 
environment.  Database  Storage  algorithm: 

Step  1  extract  VF  and  filter  against  previously  ex¬ 
tracted  VF. 


LOf  v-o>  roe 


Figure  3:  Memory  Architecture 


Figure  5:  Simulator  for  Indoor  Robot  with  displayed 
viewframe 

Step  2  compare  PFwith  other  viewframe  in  VF-DB  by 
some  viewframe  matching  mechanism. 

Step  3  if  NOT  matched  for  V'/’,add  VF  to  V-DB,  add 
pointer  to  VF  into  each  landmark  entry  with  the 
same  local  identifier  in  L-DB;  else  return  the  pointer 
to  VF  in  V-DB. 

Step  4  add  pointers  to  VF  into  current  path’s 
viewframe  sequence  in  P-DB; 

2.7  Qualitative  Navigation  Simulator 

We  have  been  exploring  different  qualitative  naviga¬ 
tion  schemes  using  the  simulators  shown  in  (Figure  4) 
and  (Figure  5)  (for  exploring  indoor  navigation).  Each 
contains  4  subwindows.  The  upper-left  is  a  Unix  shell; 
the  lower-left  has  controls  for  setting  the  such  things 
as  the  density  of  landmarks,  the  range  of  visibility,  the 
number  of  globally  distinct  landmarks,  selecting  differ¬ 
ent  navigation  modes  and  so  forth;  and  the  upper-right 
shows  the  360  degrees  of  view  from  the  robot  at  a  given 
location.  The  lower-right  shows  a  top-down  view  of  the 
navigation  world.  In  (Figure  4)  the  circle  shows  cur¬ 
rent  viewframe  containing  landmarks  displayed  in  upper- 
right  subwindow,  the  line  in  the  circle  shows  the  robot’s 
heading.  Distinct  landmarks  are  numbered  and  nondis- 
tinct  landmarks  are  not  numbered,  but  can  appears  as 
having  different  colors  and  intensities.  In  the  simula¬ 
tor  in  (Figure  5),  we  assume  that  limitations  on  sight 
are  only  caused  by  occlusion.  The  current  viewframe  is 
shown  as  the  set  of  radiating  lines  from  the  robot’s  cur¬ 
rent  position  to  each  of  the  visible  landmarks.  The  ur- 
rent  viewframe  is  displayed  as  a  sequence  of  landmarks 
in  the  upper  right-hand  window. 

3  Navigation  Using  A  Local  Compass 
With  A  Variable  Percentage  Of 
Distinct  Landmarks 

This  algorithm  is  intended  to  work  with  a  variable 
number  of  distinct  landmarks,  ranging  from  completely 
nondistinct  landmarks  to  completely  distinct  landmarks. 
The  nondistinct  case  would  correspond  to  walking 
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through  a  world  full  of  identical  landmarks  with  a  com¬ 
pass.  When  the  number  of  distinct  landmarks  increases, 
the  efficiency  of  the  path  planning  improves. 

Navigation  using  this  algorithm  is  shown  in  figures 
7,  8,  and  9.  The  robot  initially  walks  along  two  sep¬ 
arate  paths  which  form  a  4'-like  shape  shown  by  the 
solid  thin  lines.  The  robot  is  first  at  the  upper-middle 
part  of  the  and  walks  to  the  middle-lower  part.  It 
is  then  relocated  to  the  upper-left  corner  and  walks  to 
the  upper-right  corner  along  a  curved  path.  As  it  walks 
along  these  paths,  it  keeps  track  of  landmarks  and  stores 
extracted  viewframes  in  the  different  system  databases. 
It  then  has  the  task  of  going  from  the  upper-right  corner 
of  ♦  to  the  upper-left  corner  of  ♦.  To  do  this,  the  robot 
can  either  follow  the  long  curved  path  that  it  originally 
followed  or  else  it  can  find  a  short-cut  directly  between 
them.  The  key  result  is  that  as  the  number  of  distinct 
landmarks  increases,  the  robot  is  able  to  find  increas¬ 
ingly  more  direct  paths  between  locations.  With  more 
nondistinct  landmarks,  navigation  involves  staying  close 
to  paths  that  have  been  previously  followed.  Shortcuts 
are  possible  when  common  landmarks  between  paths  are 
found. 

The  algorithm  utilizes  viewframe  centering  and 
viewframe  back-matching.  In  Viewframe  centering 
a  robot  walks  from  a  landmark  towards  the  center  of  a 
viewframe  which  contains  that  landmark.  If  the  robot  is 
at  landmark  with  local  identifier  L  and  orientation  angle 


a  in  V,  viewframe  centering  is  to  walk  in  orientation  an¬ 
gle  a  -f  x  towards  the  center  of  V.  Two  viewframes  are 
said  to  be  connected  or  adjacent  if  they  have  at  least  one 
local  identifiers  in  common.  Navigation  then  involves 
finding  a  sequence  of  connected  viewframes  (Vo,  . . . ,  Vj, 
. . . ,  Vi,)  with  overlapping  landmarks  which  are  traversed 
by  successive  viewframe  centering. 

Viewframe  back-matching  (also  referred  to  as 
landmark  unification)  is  used  to  determine  that  land¬ 
marks  having  different  local  identifiers  in  different 
viewframes  actually  are  the  same  physical  landmark. 
They  can  then  be  used  to  navigate  from  one  viewframe 
to  another  and  form  the  basis  of  finding  shortcuts 
when  such  common  landmarks  are  recognized.  During 
viewframe  centering  to  V} ,  if  the  robot  cannot  find  a  dis¬ 
tinct  landmark  in  common  between  Vj  and  Vm  (m  >  j 
and  171  <  n),  it  attempts  viewframe  back-matching  to 
update  local  identifiers  in  the  viewframe  database.  This 
is  illustrated  by  Figure  6.  The  robot  is  currently  at  V/y 
and  has  previously  extracted  Va  with  local  identifiers  L\ , 
L2,  Lz  associated  with  the  nondistinct  landmarks  .  V^ 
and  Vb  have  local  identifier  L\  in  common.  The  robot 
first  goes  to  landmark  Li,  then  walks  towards  the  center 
of  Va-  While  it  is  walking  towards  the  center  of  Va,  it 
continues  to  extract  viewframes  and  perform  second  level 
viewframe  matching  (based  upon  angle  and  orientations 
of  landmarks)  with  respect  to  Va  ■  When  it  extracts  a 
viewframe  at  C  (which  is  nearby  A)  with  new  local  iden- 
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•  Ki(Vc) 


Figure  6:  Viewframe  Back-Matching  with  a  Compass 

tifiers  £4  for  L3,  I>g  for  L3,  the  robot  matches  Vc  to  Va 

,  updating  the  viewframe  database  by  substituting  L3 

into  £5,  Li  into  L4. 

The  algorithm  has  the  following  steps: 

Goal:  A  landmark  with  local  identifier  Icid  (tied  to  a 
specific  viewframe)  to  go  to. 

Step  0  If  Icid  is  in  current  viewframe,  go  directly  to  it 
and  the  algorithm  terminates. 

Step  1  Create  a  virtual  viewframe  Vn  containing  only 
the  goal  Icid  with  an  unspecified  orientation.  Con¬ 
struct  an  by  iV  weight  matrix  W  (N  is  cur¬ 
rent  total  number  of  viewframes  plus  one).  For 
each  pair  of  viewframes  (^.V})  including  the  vtr- 
iual  viewframe  V„  and  current  starting  viewframe 
Vo,  compute  con(Vi,  V})  by  equation  (1),  if  it  is 
greater  than  a  certain  threshold  (we  select  0),  we  as¬ 
sign  the  weight  matrix  entry  W(i,j)  =  1;  otherwise 
W(i,j)  =  00.  With  the  weight  matrix,  we  find  a 
least  sequence  of  connected  viewframes  Vq,  Vt ,  . . . , 
Vn  by  applying  a  shortest  path  algorithm[Cormen 
et  al,  1990].  Alternatively,  if  the  total  number  of 
viewframes  N  is  too  great,  we  use  a  breadth-first 
tree  search[Cormen  et  al,  1990]  from  to  find  ad¬ 
jacent  viewframes,  such  that  a  viewframe  cannot 
appear  twice  in  a  path  of  the  tree. 

Step  2  If  a  sequence  of  connected  viewfr2une8  are 
not  found,  stop.  Otherwise  the  robot  performs 
viewframe  centering  and  viewframe  back-matching 
through  Vq,  Vi,  ...,  V„.  It  walks  to  the  landmark 
with  common  local  identifier  in  both  Vo  and  V\, 
where  choice  of  distinct  landmark  has  priority. 

Step  2.1  If  the  robot  is  currently  at  landmark  P 
of  viewframe  V  (t  is  max),  it  viewframe  cen¬ 
ters  towards  the  center  of  Vi  ,  testing  if  cur¬ 
rent  viewframe  Vc  is  adjacent  to  Vm,  i-c-  m  € 
[t-Vl,  n]  and  m  is  max  such  that  con{Ve,  V^m)  > 
0;  if  m  is  found  which  means  a  distinct  land¬ 
mark  is  found,  then  it  changes  the  direction 
and  walks  to  the  landmark  in  both  Vc  and  Vm- 
Otherwise,  it  performs  back-matching  to  Va 
if  no  ambiguity  occurs  and  the  best  match  is 
found,  the  robot  updates  the  local  identifiers, 
i.e.  it  uses  local  identifiers  in  Vi  to  replace 


Figure  7:  100%  Nondistinct  Landmarks.  Path  Planning 
(Solid  Thick  Path)  from  Upper-Right  Corner  to  Upper- 
Left  Corner  of  the  '9 


Figure  8:  75%  Nondistinct  Landmarks.  Path  Planning 
(Solid  Thick  Path)  from  Upper-Right  Corner  to  Upper- 
Left  Corner  of  the  ♦ 


corresponding  local  identifiers  with  the  same 
orientations  in  Vc  as  well  as  those  local  identi¬ 
fiers  in  V-DB;  and  then  walks  to  a  landmark 
with  common  local  identifier  both  in  K  and 
K+i. 

Step  2.2  Repeat  Step  2.1  until  the  goal  is 
achieved,  or  failure  due  to  ambiguity. 

When  all  the  landmarks  are  distinct,  viewframe  back- 
matching  is  unnecessary. 
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Figure  9:  50%  to  0%  Nondistinct  Landmarks.  Path 
Planning  (Solid  Thick  Path)  from  Upper-Right  Corner 
to  Upper- Left  Corner  of  the  'i' 

4  Navigation  Without  a  Compass  for 
Distinct  Landmarks  (No  LPBs) 

This  algorithm  assumes  distinct  landmarks  and  no  com¬ 
pass  and  no  landmark  pair  boundaries  (LPBs).  LPB- 
based  navigation  is  described  in  next  section. 

The  algorithm  relies  on  viewframe  circling  to  com¬ 
pensate  for  the  lack  of  a  compass.  Figure  10  shows 
an  example  of  navigation  using  this  algorithm  with  the 
viewframes  and  paths  from  the  previous  figures.  The 
exploration  paths  4'  (solid  thin  lines)  are  generated  in 
the  same  manner  as  in  figures  7,  8,  9.  The  path  deter¬ 
mined  by  the  robot  is  shown  as  a  solid  thick  line  from 
the  upper-right  corner  of  the  4’  to  the  upper-left  corner 
of  the  4^.  The  result  shows  that,  the  path  is  slightly 
longer  than  that  with  a  compass  (in  Figure  9).  The  pro¬ 
cessing  example  in  figures  11,  12,  13,  14,  shows  some 
of  the  interesting  characteristics  of  this  algorithm.  In 
Figure  11,  the  robot  moves  to  Landmark  89  by  an  ex¬ 
ploratory  behavior  to  generate  a  path.  We  then  want 
the  robot  to  walk  back-and-forth  between  Landmark  89 
and  64  to  find  increasingly  more  direct  paths.  Initially, 
the  robot  can  not  find,  due  to  limitations  on  its  range 
of  vision,  most  of  the  visible  landmarks  along  its  origi¬ 
nal  path.  So  it  begins  to  circle  around  the  current  land¬ 
mark  to  search  for  landmarks  from  the  path  it  traversed. 
This  continues  until  it  returns  to  its  origin.  The  circling 
behavior  for  finding  landmarks  is  responsible  for  the  in¬ 
direct  looking  paths.  As  the  robot  traverses  back-and- 
forth  between  landmarks  64  and  89,  it  is  able  to  use  the 
viewframes  it  stored  from  previous  trips  to  determine 
a  more  direct  path.  The  robot  will  determine  different 
paths  between  the  two  landmarks  depending  upon  the 
direction  in  which  it  travels.  This  is  because  the  robot 
can  not  see  the  same  landmarks  when  traveling  in  the 
different  directions  due  to  limitations  on  allowable  view¬ 


ing  distance.  The  further  the  robot  can  see,  the  more 
direct  and  similar  the  paths  found  under  this  algorithm 
become. 

The  algorithm  has  the  following  steps: 

Goal:  A  landmark  with  identifier  id  to  go  to. 

Step  0  If  id  is  in  current  viewframe,  go  directly  to  it 
and  the  algorithm  terminates. 

Step  1  Create  a  virtual  viewframe  Vn  containing  only 
the  goal  id  with  arbitrary  orientation.  Construct 
an  JV  by  weight  matrix  W  {N  is  current  total 
number  of  viewframes  plus  one).  For  each  pair  of 
viewframes  (V<,  V})  including  the  virtual  viewframe 
Vn  and  current  starting  viewframe  Uq,  compute 
con(Vi,Vj)  by  equation  (1),  if  it  is  greater  than  a 
certain  threshold  (we  select  0),  we  assign  the  weight 
matrix  entry  W{i,j)  =  1;  otherwise  W(i,j)  =  oo. 
With  the  weight  matrix,  we  find  a  least  sequence  of 
connected  viewframes  Vq,  Vi,  ...,  V„  by  applying 
a  shortest  path  algorithm[Cormen  et  al,  1990].  Al¬ 
ternatively,  if  the  total  number  of  viewframes  N  is 
too  great,  we  use  a  breadth-first  tree  search[Cormcn 
et  al.,  1990]  from  Vb  to  find  adjacent  viewframes, 
such  that  a  viewframe  cannot  appear  twice  in  a 
path  of  the  tree. 

Step  2  If  a  sequence  of  connected  viewframes  are 
not  found,  stop.  Otherwise  the  robot  performs 
viewframe  centering  and  viewframe  back-matching 
through  Vb,  Vj,  . . .,  Un.  It  walks  to  the  common 
distinct  landmark  in  both  Vb  and  Vj . 

Step  2.1  If  the  robot  is  currently  at  landmark  P 
of  viewframe  V  (i  is  max),  it  landmark  circles 
Vi  ,  i.e.  it  walks  away  from  P  until  P  is  at 
its  visual  range-limit,  then  it  circles  around  P. 
During  the  walk,  it  tests  if  current  viewframe 
Vc  is  adjacent  to  Vb,,  i.e.  m  6  [»  +  l,n]  and 
m  is  max  such  that  con{Vc,Vm)  >  0;  if  m 
is  found  which  means  a  distinct  landmark  is 
found,  then  it  changes  the  direction  and  walks 
to  the  landmark  in  both  Ve  and  Vb,  ■ 

Step  2.2  Repeat  Step  2.1  until  the  goal  is 
achieved. 

5  LPB  Based  Navigation  Without  A 
Compass  For  Distinct  Landmarks 

This  navigation  algorithm  assumes  distinct  landmarks 
and  no  compass  and  use  of  LPBs.  In  [Levitt  and  Law- 
ton,  1990],  the  robot  uses  a  global  map  in  its  spatial 
memory  to  indicate  each  landmark’s  estimated  direction 
and  distance  for  path  planning.  Instead  of  assuming  that 
the  robot  knows  the  estimated  direction  and  distance  of 
each  landmark  in  spatial  memory,  we  assume  that  the 
robot  only  knows  the  directions  of  a  few  selected  land¬ 
marks  called  known  landmarks. 

Each  pair  of  the  known  landmarks  forms  an  LPD 
(Landmark-Pair- Boundary)  vector  or  an  LPB. 
LPBs  are  used  to  demark  visually  distinct  areas  by  not¬ 
ing  which  sides  of  the  LPBs  surrounding  a  region  the 
robot  is  in.  This  algorithm  uses  LPB  regions  instead  of 


Figure  10:  Path  Planning  from  Landmark  68  to  28  (thick 
path) 


Figure  11:  Exploration  Path  Generated  by  Random 
Walk  to  Landmark  89 


viewframes  the  basic  descriptions  of  locations.  For  an 
LPB  vector  7  and  a  location  A,  we  use  1{A)  to  indicate 
which  side  of  7  that  A  is  on.  7(i4)  has  0,  1,  2  values  to 
distinguish  different  sides.  In  Figure  15,  known  land¬ 
marks  K\,  K2  form  LPB  7*,,*,.  At  A,  >  ir  (from  A'l 
to  K2  counterclockwise),  7t,,t,(A)  =  1;  At  B,  ob  <  x, 
Ui,ki(B)  =  0.  At  C,  it’s  on  the  LPB,  7*, ,*,((7)  =  2.  The 
two  landmarks  which  define  an  LPB  break  the  LPB  into 
3  distinct  LPB  segments. 

Suppose  we  have  n  known  landmarks  forming  a  total 
of  {N  =  (2))  LPB  vectors  7i,  7j, . . . ,  7/v.  For  a  set  of  LPB 
vectors  7*, ,  7*,,. . . ,  7t^  ,  we  define  the  LPB  projection 


Figure  12:  3rd  Time  from  Landmark  89  to  64  (thick 
path) 


Figure  13:  Stable  Path  (thick  path)  from  Landmark  64 
to  89  after  2nd  time 


for  a  location  A  as 

LPB^rj{A)  =  7t.(A)  .7k,(A) . .  .h^(A)  (2) 

where  . . .  correspond  to  string  concatenation  of  the  val¬ 
ues  0,1,  or  2.  An  LPB  region  string  is  the  LPB  projec¬ 
tion  using  the  whole  set  of  LPB  vectors  determined  by 
all  the  known  landmarks.  This  creates  a  net  of  distinct 
LPB  regions. 

Path  planning  involves  finding  a  sequence  of  LPB  seg¬ 
ments  from  the  graph  formed  by  ail  the  LPB  segments 
formed  by  known  landmarks.  The  robot  walks  along 
each  LPB  and  tests  both  sides  of  it  to  see  if  it  is  ad¬ 
jacent  to  the  goal  region.  This  requires  at  most  O(n^) 
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Figure  14:  Stable  Path  (thick  path)  from  Landmark  89 
to  64  after  4th  time 


Xb 


Figure  15:  LPB(Landmark-Pair-Boundary)  Representa¬ 
tion 

LPB  vectors  to  be  visited.  However,  we  can  improve 
this.  There  are  a  total  of  (n  —  1)  LPB  vectors  crossing 
one  known  landmark,  which  will  partition  the  area  into 
at  most  2(n  —  1)  distinct  seclions.  Each  section  is  ex¬ 
pressed  in  terms  of  the  LPB  projection  (onto  those  LPB 
vectors)  of  any  location  from  that  section.  The  basic 
idea  is  that  the  robot  goes  to  the  known  landmark,  walks 
along  parts  of  2  LPB  segments,  which  are  borders  of  the 
section  having  the  same  LPB  projection  (onto  those  vec¬ 
tors)  as  that  of  goal  LPB  region.  The  robot  then  has  at 
most  0{n)  LPB  vectors  to  visit.  In  Figure  16,  values  in 
parentheses  show  distinct  LPBs  projections  (onto  In, 
^7,  Ks)  for  sections  I  through  VI;  to  visit  region  A  (sec¬ 
tion  I),  the  robot  only  needs  to  visit  parts  Klki,  Kin  of 
2  LPB  vectors  In  and  In- 

The  algorithm  has  the  following  steps: 

Goal:  A  given  LPB  region  and  corresponding  an  LPB 
region  string  Lg  to  go  tn. 

Step  0  If  any  components  of  Lg  is  equal  to  2  (it  is  on  on 
an  LPB  vector),  the  robot  first  goes  to  any  known 
landmark  on  that  LPB.  It  then  walks  along  it  in 
one  direction  until  the  goal  region  is  achieved;  if  so. 


Figure  16;  LPB  Distinct  Section  Partitions  through  One 
Known  Landmark 

the  algorithm  is  finished. 

Step  1  Initialize  segments  of  each  LPB  vector  as  un¬ 
visited.  Perform  masking  on  each  known  land¬ 
mark  K  visited  before  '.e.,  mark  the  segments  of 
LPB  vectors  crossing  A  as  visited  if  they  are  not 
borders  of  the  section  having  the  same  LPB  projec¬ 
tion  (onto  these  LPB  vectors)  as  that  of  Lg.  Also 
initialize  the  stack  SP  for  the  known  landmarks  as 
empty. 

Step  2  Test  the  stack  SP. 

If  SP  is  empty, 

select  any  known  landmark  K  which  is  one  end 
of  an  unvisited  LPB  segment,  push(K)  into  SP, 
goto  step  2;  if  K  is  not  found,  stop. 

Else 

K  =  pop{SPy,  if  K  is  not  one  end  of  any 
non-visited  LPB  segment,  go  to  Step  2. 

Step  3  The  robot  walks  to  known  landmark  K,  test¬ 
ing  whether  the  goal  region  is  achieved;  if  so,  the 
algorithm  is  finished. 

Step  4  If  A  has  not  been  visited  before,  the  robot  per¬ 
forms  masking  on  K. 

Step  5  If  A  is  one  end  of  non-visited  LPB  segment  S, 
mark  S  as  visited,  pusb{K)  into  SP.  Else  go  to 
Step  2. 

Step  6  The  robot  walks  along  segment  S,  testing 
whether  the  goal  region  Lg  is  found,  until  one  of 
the  following  conditions  is  satisfied 

•  if  the  goal  is  found,  the  robot  achieves  the  goal 
and  the  algorithm  is  finished. 

•  if  contradiction  to  the  goal  region  happens,  i.e. 
,  originally  the  robot  is  on  the  same  side  of 
one  LPB  as  that  of  the  goal,  later  different; 
mark  visited  for  the  segment  which  the  robot 
is  heading  towards,  go  to  Step  2. 

•  if  the  robot  arrives  at  another  known  landmark 
K„,  push(K„)  into  SP,  go  to  Step  2. 

Figure  17  shows  an  example  of  navigation  using  this 
algorithm  with  the  viewframes  and  paths  from  the  previ¬ 
ous  figures.  The  known  landmarks  are  circled,  the  LPB 
vectors  are  in  dash-dotted  line,  and  the  exploration  paths 
’i’  (  solid  thin  lines)  are  generated  in  the  same  manner 
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Figure  17:  Example  of  Navigation  Using  LPBs 


as  in  figures  7,  8,  9.  The  path  determined  by  robot  is 
shown  as  a  solid  thick  line  from  the  LPB  region  near 
the  upper-right  corner  of  the  4^  to  the  goal  region  near 
the  upper-left  corner  of  the  The  processing  time  of 
this  algorithm  is  0{n)  where  n  is  the  number  of  known 
landmarks.  In  addition,  experiments  have  shown  the 
algorithm  gracefully  degrades  as  the  number  of  known 
landmarks  is  decreased. 

An  interesting  finding  is,  if  we  apply  masking  on  all 
known  landmarks  as  stated  in  step  1  of  the  algorithm,  the 
LPB  candidates(i.e.  LPB  vectors  of  which  LPB  masks 
are  not  111)  form  a  flow  towards  destination  region.  Fig¬ 
ure  18  shows  LPBs  partition  the  area  into  small  regions. 
Figure  19  shows  the  flow  towards  the  goal  region  near 
the  Upper-Left  Corner  of  the  ♦  of  Figure  17. 

6  Navigation  Not  Using  a  Compass 
with  a  Variable  Percentage  of 
Distinct  Landmarks 

We  are  currently  exploring  different  alternatives  for  this 
case.  The  characteristic  behavior  appears  similar  to 
navigation  using  a  compass  with  nondistinct  landmarks 
(hugging  to  previously  explored  paths  without  taking 
shortcuts),  except  it  is  much  more  sensitive  to  the  al¬ 
lowable  viewing  distance.  One  approach  for  this  case  is 
to  perform  navigation  using  LPBs  defined  by  landmarks 
with  local-ids.  A  difficulty  is  that  one  or  both  of  the 
landmarks  defining  an  LPB  can  disappear  as  the  robot 
walks  away  from  it.  So  the  LPBs  connecting  viewframes 
may  not  be  stable.  It  may  be  possible  to  use  a  measure 
of  reliability  of  LPBs  between  viewframes  as  a  criteria 
for  extracting  viewframes. 

Another  approach  we  are  investigating  for  the  case  of 
low  percentage  of  distinct  landmarks,  involves  modifying 
viewframe  back-matching  in  the  algorithm  from  section 
3  to  satisfy  the  constraint  of  not  using  a  compass.  In 
Figure  20,  the  robot  is  at  A,  seeing  nondistinct  land- 


Figure  18:  LPBs  Partitions  the  Area  into  Small  Regions 


Figure  19:  LPB  Flow  Towards  the  Goal  Region  Near  the 
Upper-Left  Corner  of  the  ♦  of  Figure  17 


marks  Lj,  L2,  L3  which  are  also  in  vfo-  In  order  to  go 
nearby  the  center  O  of  vfo  to  back-match  vfo,  the  robot 
first  comes  to  one  of  Li,  L2,  L3,  say  L3,  then  it  walks 
along  arc  by  maintaining  angle  02  =  LL2OL3 

walk  .and  tests  if  angle  LL\BL2  equals  Oj,  if  so,  we  con- 
clud<  t:.-  robot  is  close  to  O.  The  robot  must  always  see 
all  3  l.<  hi. arks  before  it  comes  nearby  O  (Note  angles 
Q,  o.>  ,,r  ilriilated  counter-clock- wise,  so  there  is  only 
one  r.  I.!,  r  t  his  type  of  walking  is  used  for  viewframe 
back-mal(  hing  without  a  compass. 

The  following  algorithm  is  intended  for  the  case  of 
high  percentage  of  nondistinct  landmarks  without  a  com¬ 
pass.  It  has  the  following  steps: 
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Figure  20:  Viewframe  Back-Matching  Without  a  Com¬ 
pass 


Goal:  A  landmark  with  local  identifier  Icid  (tied  to  a 
specific  viewframe)  to  go  to. 

Step  0  If  Icid  is  in  current  viewframe,  go  directly  to  it 
and  the  algorithm  terminates. 

Step  1  Create  a  virtual  viewframe  Vn  containing  only 
the  goal  Icid  with  arbitrary  orientation.  Construct 
an  W  by  W  weight  matrix  W  (N  is  current  total 
number  of  viewframes  plus  one).  For  each  pair  of 
viewframes  (K  ,  Vj  )  including  the  virtual  viewframe 
Vn  and  current  starting  viewframe  Vq,  if  they  are 
at  least  3(for  i  ^  n  and  j  ^  n)  or  l(for  i  =  n 
or  j  =  n)  landmarks  with  common  local  ids  in 
both  viewframes,  we  assign  the  weight  matrix  en¬ 
try  W{i,j)  =  1;  otherwise  W(i,j)  =  oo.  With  the 
weight  matrix,  we  find  a  least  sequence  of  connected 
viewframes  Vq.  Vi,  . . .,  Vn  by  applying  a  shortest 
path  algorithmfCormen  et  ai,  1990].  Alternatively, 
if  the  total  number  of  viewframes  N  is  too  great,  we 
use  a  breadth-first  tree  search[Cormen  et  ai,  1990] 
from  to  find  adjacent  viewframes,  such  that  a 
viewframe  cannot  appear  twice  in  a  path  of  the  tree. 

Step  2  If  a  sequence  of  connected  viewframes  are 
not  found,  stop.  Otherwise  the  robot  performs 
viewframe  centering  and  viewframe  hack-matching 
through  Vo,  Vi,  Vn-  It  walks  to  the  landmark 
with  common  local  identifier  in  both  Vq  and  Vi, 
where  choice  of  distinct  landmark  has  priority. 

Step  2.1  If  the  robot  is  currently  at  landmark  P 
of  viewframe  Vi  (t  is  max),  it  finds  2  other 
landmarks  with  common  local  identifiers  both 
in  current  viewframe  Vc  and  vfi,  and  walks 
towards  the  center  of  vfi  by  using  viewframe 
back-matching  without  compass  explained  in 
Figure  20.  During  the  walk,  it  tests  if  current 
viewframe  Vc  has  at  least  3  (1  for  m  =  n) 
landmarks  with  common  local  identifiers  with 
vfmy  i  e-  m  €  [«  +  1,  n]  and  m  is  max;  if  m  is 
found,  which  means  3  distinct  landmarks  are 
found,  then  it  changes  the  direction  to  walk  to 
the  landmark  in  both  Vc  and  Kn  Otherwise, 
it  performs  back-matching  to  Vi',  if  no  ambi¬ 
guity  occurs  and  the  best  match  is  found,  the 
robot  updates  the  local  identifiers,  i.e.  it  uses 
local  identifiers  in  Vj  to  replace  corresponding 
local  identifiers  with  the  same  orientations  in 


Vc  as  well  as  those  local  identifiers  in  V-DB; 
and  then  walks  to  a  landmark  with  common 
local  identifier  both  in  Vi  and  Vi+i. 

Step  2.2  Repeat  Step  2.1  until  the  goal  is 
achieved,  or  failure  due  to  ambiguity. 

7  Summary  and  Future  Work 

We  have  described  different  range-free  qualitative  navi¬ 
gation  algorithms.  The  data  structures  we  have  used, 
especially  for  the  case  of  nondistinct  lamdmarks,  are 
compatible  with  the  types  of  features  that  could  be  ex¬ 
tracted  as  landmarks  with  basic  image  processing  tech¬ 
niques  on  a  robot  with  a  360  degree  field  of  view.  We 
also  have  performed  experiments  to  understand  path¬ 
planning  feasibility  and  efficiency  for  these  algorithms. 
One  measure  of  path  planning  efficiency  is  the  ratio  of 
the  straight-line  distance  between  two  locations  com¬ 
pared  to  the  actual  distance  walked  by  a  robot  to  go 
ftom  between  the  two  locations.  By  this  measure  of  ef¬ 
ficiency,  the  compass-based  algorithms  improve  if  num¬ 
ber  of  viewframes,  visual  range,  and  number  of  distinct 
landmarks  increases.  The  non-compass-bsed  algorithms 
also  depend  on  allowable  visual  range.  The  efficiency  of 
the  LPB,  non-compass-based  algorithms  increase  as  the 
number  of  known  landmarks  increases. 

Our  current  work  is  focusing  on  navigation  using 
LPBs  formed  from  nondistinct  landmarks,  viewframe  fil¬ 
tering  techniques,  and  different  approaches  to  organizing 
spatial  memory,  such  as  a  hierarchical  representation  of 
viewframes,  along  the  lines  discussed  in  [Kuipers,  1978]. 
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Abstract 

We  describe  an  architecture  for  an  interactive 
model  based  vision  system  and  its  application 
for  vehicle  tracking.  A  human  specifies  a  lim¬ 
ited  amount  of  information  which  establishes 
a  context  for  autonomous  interpretation  of  im¬ 
ages  obtained  by  a  telerobot.  Object  models 
are  described  by  constraints  specifying  neces¬ 
sary  geometrical  properties  and  relationships 
between  objects.  The  use  of  constraints  al¬ 
lows  for  flexible  object  instantiation.  A  user 
can  indicate  a  vehicle  and  this  directs  percep¬ 
tual  processing  routines  to  determine  the  cor¬ 
responding  local  surface  orientation  and  roads, 
or  he  can  instantiate  a  road  segment  to  di¬ 
rect  the  extraction  and  tracking  of  vehicles. 

We  conclude  with  a  processing  example  based 
upon  implemented  components  and  a  brief  dis¬ 
cussion  of  future  work. 

1  Introduction 

Efforts  to  develop  intelligent  and  autonomous  systems 
for  operation  in  complex,  natural  domains  have  been 
largely  unsuccessful  to  date,  in  spite  of  continued  ad¬ 
vances  in  the  underlying  technologies.  There  remain  un¬ 
resolved  and  fundamental  difficulties  in  terms  of  the  nec¬ 
essary  computational  power,  the  required  complexity  of 
perceptual  systems  which  can  operate  in  outdoor  envi¬ 
ronments,  and  the  corresponding  complexity  of  planning 
and  reasoning  systems.  A  recent  framework  addresses 
many  of  these  problems  by  stressing  the  importance  of 
telerobotic  and  interactive  systems.  This  is  a  realistic  ap¬ 
proach  to  fielding  advanced  technology  in  the  short  term, 
and  also  provides  a  long  term  framework  for  developing 
autonomous  systems.  An  interactive,  semi-autonomous 
system  can  significantly  amplify  the  capabilities  of  a  hu¬ 
man,  and  also  yields  an  evolutionary  approach  as  au¬ 
tonomous  system  capabilities  are  developed  and  begin 
to  replace  human  controlled  functions. 
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The  approach  described  here  is  to  develop  a  model- 
based  vision  system  that  a  human  can  interactively  con¬ 
trol.  The  human  uses  this  to  rapidly  interpret  sensory 
information  from  a  potentially  distributed  team  of  teler¬ 
obots.  The  resulting  interpretation  is  a  model  of  the 
world  that  the  telerobots  can  refine,  use  to  control  their 
behavior,  or  report  back  to  a  human.  In  this  way,  the  hu¬ 
man  directs  the  telerobots  by  initializing  and  constrain¬ 
ing  their  processing.  Communication  between  the  robot 
and  the  human  can  then  take  place  in  the  context  of 
a  shared  model  of  the  world  which  makes  possible  in¬ 
frequent,  semantically  meaningful,  and  low  bandwidth 
communication. 

The  particular  system  we  present  is  for  tracking  vehi¬ 
cles  in  outdoor  scenes.  A  human  can  manipulate  models 
of  objects  such  as  terrain  surface  patches,  roads,  and  ve¬ 
hicles  to  interpret  imagery  from  a  telerobot.  Once  an  in¬ 
terpretation  is  in  place,  the  telerobot  can  autonomously 
refine  and  extend  the  interpretations,  detect  and  track 
vehicles,  and  report  back  to  a  human  about  unusual  oc¬ 
currences  or  behavior  that  cannot  be  accounted  for.  For 
example,  a  human  will  indicate  that  a  particular  area 
is  a  road.  The  vision  system  will  then  track  movement 
along  the  road  and  fit  a  constraint-based  description  of  a 
vehicle  to  this  movement.  The  system  could  determine 
that  a  vehicle  has  just  gone  off  the  road  (or  that  it  is 
behaving  inconsistently  with  respect  to  the  model  of  a 
vehicle). 

We  begin  by  reviewing  the  basic  architecture  guiding 
the  development  of  the  interactive  model  based  vision 
system,  and  then  detail  some  of  its  components  involving 
object  models  and  perceptual  processing  that  have  been 
implemented. 

2  System  Architecture 

The  underlying  architecture  is  shown  in  Figure  1.  It 
is  built  around  three  major  data  bases  that  a  human 
can  access  and  manipulate  through  a  user  interface. 
The  basic  task  of  the  human  is  to  access  models  of  the 
various  types  of  objects  stored  in  the  Object  Model 
Data  Base  rdong  with  information  describing  maps, 
landmarks,  and  previous  interpretations  in  the  Long 
Term  Data  Base  to  build  an  interpretation  of  the  cur¬ 
rent  scene  which  is  stored  in  the  World  Model  Data 
Base.  For  example,  the  human  is  presented  with  im¬ 
ages  from  cameras  on  the  telerobot.  He  can  use  a  priori 
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Objects  Database  World  Database  Long-Term  Database 


Figure  1:  System  architecture 


maps  to  align  landmarks  and  terrain  features  from  these 
maps  with  the  images.  He  can  also  access  the  three- 
dimensional  and  physically  based  models  of  objects  and 
position  them  with  respect  to  the  world  model.  As  he 
does  this,  the  models  are  projected  back  against  the  im¬ 
ages  obtained  from  the  telerobots  for  interr,ctive  control 
and  to  initiate  processing. 

The  Object  Model  Data  Base  contains  generic 
models  of  objects,  relationships,  and  events  for  ter¬ 
restrial  scenes.  This  involves  objects  such  as  terrain 
patches,  roads,  vehicles,  and  gravity.  We  distinguish  be¬ 
tween  two  different  types  of  objects:  Primitives  which 
correspond  to  basic  entities  and  relationships  used  to 
describe  and  represent  Terrestrial  Objects  which  cor¬ 
respond  to  the  conventional  objects  found  in  the  world 
such  as  roads  and  cars.  Primitive  Objects  describe  char¬ 
acteristics  such  as  shape  constraints,  material  composi¬ 
tion,  and  relationships  between  parts.  The  representa¬ 
tion  of  objects  for  an  interactive  vision  system  is  more 
complex,  though  related  in  many  ways,  to  those  used 
in  CAD/CAM  and  geometric  modeling  packages,  be¬ 
cause  they  will  be  manipulated  for  autonomous  pro¬ 
cessing  and  reasoning.  Thus,  in  addition  to  describ¬ 
ing  shape,  the  model  of  a  car  needs  to  include  that 
a  car  is  acted  on  by  gravity  and  will  have  a  preferred 
type  of  orientation  and  attachment  with  respect  to  the 
ground  surface.  Object  models  are  described  by  sets  of 
constraints  [Horning,  1981;  Lawton,  1980;  Leler,  1988; 
Mundy  ei  al.,  1989]  which  must  be  satisfied.  A  simple 
constraint  is  that  the  value  of  some  parameter  associated 
with  an  object  model  is  bounded.  More  complicated  con¬ 
straints  deal  with  relations  between  objects.  The  human 
will  in  general  specify  a  limited  amount  of  information 


for  an  object,  and  the  system  will  use  the  constraints 
and  associated  processing  actions  to  then  refine  the  in¬ 
stantiation  of  an  object. 

The  World  Model  Data  Base  describes  the  three 
dimensional  world  of  objects  and  situations  surround¬ 
ing  the  telerobots.  It  is  initially  formed  by  the  human 
accessing  models  in  the  object  data  base  and  instanti¬ 
ating  them.  There  are  three  types  of  controllers  associ¬ 
ated  with  the  World  Model  Data  Base.  The  Constraint 
Controller  checks  for  consistency  in  the  world  model. 
The  constraint  controller  uses  the  constraints  which  de¬ 
fine  an  object  or  relationship  to  refine  an  instantiation  or 
to  find  a  violation  or  inconsistency  and  ask  the  human  for 
help.  The  Perceptual  Processing  Controller  deals 
with  the  extraction  of  information  from  images  and  sen¬ 
sors  on  the  telerobot.  The  constraints  in  an  object  model 
specify  the  types  of  processing  that  are  necessary  to  ob¬ 
tain  this  information.  When  the  human  indicates  that 
a  road  is  located  somewhere,  this  constrains  the  type  of 
tracking  and  feature  extraction  processes  that  are  used. 
The  corresponding  image  areas  are  isolated  and  the  type 
of  segmentation  or  tracking  procedure  corresponding  to 
the  material  class  and  distance  of  the  object  is  applied. 
The  Graphics  Controller  deals  with  interactive  scene 
measurements  and  the  presentation  of  the  world  model 
to  the  user.  Thus  when  he  accesses  a  model  of  a  vehi¬ 
cle,  he  is  presented  with  a  cartoonish  three-dimensional 
vehicle  template  which  is  back  projected  onto  the  image 
being  interpreted. 

The  user  interface  is  currently  based  upon  windows 
for  displaying  imagery  and  graphical  overlays,  and  text- 
based  browsers  for  inspecting  entities  in  the  data  base 
in  detail.  This  basic  level  of  interface  can  be  quite  te- 
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dious  to  work  with  and  it’s  future  role  will  be  to  serve 
as  a  debugging  tool.  An  intermediate,  near-term  sys¬ 
tem  interface  will  use  a  more  natural  set  of  tools  such  as 
three-dimensional  hand/finger  position  sensors  and  voice 
input.  Using  these,  the  human  will  actually  have  a  sense 
of  reaching  into  the  data  base  of  models,  grabbing  some¬ 
thing,  and  then  placing  it  into  the  world  model.  In  the 
eventual  system,  the  world  model  and  the  sensor  input 
from  the  different  telerobots  could  be  presented  to  the 
human  as  a  virtual  reality  in  which  the  human  can  be 
embedded  in  the  world  model  itself. 

3  Object  Models 

We  have  currently  developed  models  for  objects  corre¬ 
sponding  to  gravity,  the  immediate  ground  plane  sur¬ 
rounding  the  camera  from  which  an  image  is  obtained, 
terrain  patches,  and  a  generic  vehicle  along  with  con¬ 
straints  describing  relations  for  attachment,  alignment, 
and  coincidence.  There  is  a  simple  mechanism  to  in¬ 
voke  the  instantiation  of  models  based  upon  other  mod¬ 
els  that  have  been  instantiated  and  the  results  of  percep¬ 
tual  processing.  A  more  powerful  constraint  propagation 
mechanism  would  determine  consistency  of  relationships 
between  objects  in  the  world  model  data  base. 

Each  object  model  has  a  two-dimensional  image  reg¬ 
istered  display  that  can  be  interactively  manipulated. 
When  this  is  instantiated  it  sets  up  perspective  con¬ 
straints  with  respect  to  the  three  dimensional  object 
models  which  will  have  unspecified  parameter  values. 
For  example,  gravity  appears  as  a  two-dimensional  vec¬ 
tor  field  that  can  be  interactively  aligned  with  image 
features.  We  use  different  types  of  road  models  for  two 
and  three  dimensions.  The  two  dimensional  road  model 
is  a  sequence  of  connected  parallel  line  segments  for 
the  road  boundaries  and/or  the  center-line  of  the  road. 
This  is  used  to  indicate  and  mask  images  areas  which 
are  adjacent  to  the  road.  The  three  dimensional  road 
model  is  a  connected  sequence  of  segments  with  three- 
dimensional  coordinates  and  associated  road  width  infor¬ 
mation  with  constraints  on  allowable  orientations  with 
respect  to  gravity  and  adjacent  terrain  patches.  The 
end-points  of  the  linear  segments  in  the  two-dimensional 
display  of  a  road  model  can  constraint  the  three  dimen¬ 
sional  parameters  for  positioning  the  road  model.  Differ¬ 
ent  material  properties  can  be  associated  with  the  roads, 
but  this  currently  isn’t  used  by  the  segmentation  and 
feature  extraction  procedure. 

The  generic  vehicle  model  is  an  oriented  box  with  an 
indication  of  where  the  track/wheel  area  of  the  vehicle 
is,  where  the  engine  is  positioned,  and  where  the  cab 
area  is.  The  scale  and  relative  position  of  these  is  pa¬ 
rameterized  and  can  be  specialized  for  different  types  of 
vehicles.  There  are  scale  and  orientation  constraints  on 
all  of  these  components  as  well  as  for  relative  position  to 
ground  surfaces  and  gravity  (see  Figure  2). 

4  Perceptual  Processing 

Image  processing  and  tracking  procedures  are  organized 
in  terms  of  the  type  of  information  they  depend  on  and 
can  extract.  One  type  of  tracker  depends  on  a  two  or 


Figure  2:  Perspective  view  of  the  three-dimensional  ve¬ 
hicle  model 

three  dimensional  road  model  and  can  yield  information 
to  instantiate  a  vehicle  model.  An  instantiated  vehicle 
model  constrains  the  extraction  of  features.  These  fea¬ 
tures  satisfy  the  requirements  of  another  type  of  tracker 
that  can  determine  a  scaled  three  dimensional  trajectory 
for  extracted  image  points.  The  information  determined 
by  this  tracker  can  in  turn  be  used  to  determine  a  three 
dimensional  road  model,  and  also  refine  the  attributes  of 
an  instantiated  vehicle  model.  As  a  result,  the  flow  of  in¬ 
formation  and  processing  varies  based  upon  the  state  of 
the  current  interpretation.  The  current  processing  rou¬ 
tines  consists  of  three  types  of  trackers  and  restricted 
segmentation  and  interest  operators  which  are  applied 
when  a  vehicle  model  is  instantiated. 

4.i  Difference  Tracker 

The  difference  tracker  operates  with  respect  to  an  in¬ 
stantiated  two  or  three  dimensional  road  model.  It  de¬ 
termines  regions  above  the  indicated  road  areas  which 
are  changing  overtime  and  are  also  moving  in  a  consis¬ 
tent  direction  (not  necessarily  along  the  road).  It  deter¬ 
mines  information  to  instantiate  a  vehicle  model  by  find¬ 
ing  the  front  and  back  (or  only  the  back  or  the  front)  of 
a  vehicle.  If  a  three  dimensional  road  model  has  been  in¬ 
stantiated,  it  can  further  constrain  the  dimensions  of  the 
generic  vehicle  model  instantiation.  It  also  restricts  the 
extraction  of  features  for  the  local  translational  tracker 
(Section  4.2)  which  can  in  turn  recover  the  direction  of 
motion  of  the  vehicle,  whether  it  is  turning,  and  the  cor¬ 
responding  direction  of  motion  relative  to  the  road. 

The  first  step  in  the  difference  tracker  is  to  reduce 
the  image  noise  by  convolving  consecutive  images  in  a 
motion  sequence  with  a  low-pass  filter.  If  no  models  are 
present,  the  entire  image  must  be  convolved  with  this 
filter.  However,  given  a  two-dimensional  road  model, 
the  filter  is  only  convolved  with  pixels  that  are  above 
the  road.  The  road  model  shown  in  Figure  3  is  used  to 
constrain  the  smoothing  process. 

Once  the  images  have  been  smoothed,  the  algorithm 
begins  to  search  for  areas  of  motion  that  lie  near  the 
road.  This  is  accomplished  through  image  subtraction. 
Pixels  from  temporally  consecutive  images  that  are  sit¬ 
uated  near  the  road  model  are  subtracted.  If  the  result 
of  this  subtraction  is  greater  than  a  threshold,  the  envi¬ 
ronmental  object  corresponding  to  this  pixel  position  is 


assumed  to  have  undergone  motion.  This  pixel  is  marked 
as  a  motion  pixel,  and  a  region  growing  process  begins. 

An  object  traveling  along  the  road  may  extend  some 
distance  from  the  road  (i.e.  the  object  could  be  very 
close  to  the  camera,  in  which  case  it  would  appear  to 
be  quite  large).  The  search  for  all  areas  of  motion  as¬ 
sociated  with  an  object  is  accomplished  through  region 
growing.  Once  a  pixel  near  the  road  has  been  identified 
as  a  motion  pixel,  its  neighbors  are  also  examined  using 
the  subtraction  technique  discussed  above.  If  any  of  the 
neighboring  pixels  contain  motion,  their  neighbors  are 
also  examined.  This  recursive  procedure  continues  until 
no  more  motion  pixels  can  be  found.  An  example  of  this 
extraction  of  the  areas  of  motion  is  shown  in  Figure  4. 
Once  the  areas  containing  motion  have  been  identified, 
the  centroid  of  these  areas  is  located.  Over  time  a  two 
dimensional  trajectory  can  be  constructed. 

4.2  Local  Translation  TTacker 

Moving  vehicles  can  often  be  treated  as  rigid  objects 
which  are  translating  over  short  periods  of  time.  For 
example,  as  a  vehicle  goes  around  a  curve,  because  of 
turning  radii  constraints,  the  axis  of  rotation  is  often  far 
away  from  the  vehicle  itself  and  the  vehicle  motion  can  be 
treated  as  a  sequence  of  small  translations  corresponding 
to  tangents  of  the  curve  of  motion.  The  local  translation 
based  tracker  determines  the  direction  of  motion  of  a  set 
of  extracted  image  points  over  time,  and  fits  their  motion 
to  an  estimate  of  the  current  direction  of  motion  of  the 
corresponding  vehicle  in  three  dimensions.  Essentially, 
it  determines  the  direction  of  motion  of  a  set  of  environ¬ 
mental  points  over  time.  The  effect  of  this  tracker  can 
be  visualized  as  a  unit  sphere  with  an  axis  correspond¬ 
ing  to  the  current  direction  of  motion.  As  the  vehicle 
and  the  corresponding  set  of  points  move,  the  position 
of  the  axis  changes  with  respect  to  the  sphere.  This  pro¬ 
cessing  works  well  with  temporal  filters  since  there  are 
constraints  on  how  quickly  a  vehicle  can  change  it’s  di¬ 
rection  of  motion.  This  can  also  be  used  to  determine 
if  a  vehicle  is  rotating  with  respect  to  an  axis  contained 
within  the  vehicle.  This  is  indicated  by  areas  of  the  im¬ 
age  which  show  differences  over  time,  but  for  which  no 
clear  axis  of  translation  can  be  determined. 

This  tracking  algorithm  is  based  on  the  strong  geo¬ 
metric  constraints  on  image  motion  in  the  case  of  trans¬ 
lational  motion  (radial  motion  of  image  features  from  a 
focus  of  expansion,  determined  by  the  intersection  of  the 
direction  of  translation  with  the  imaging  surface)  [Law- 
ton,  1982].  The  algorithm  evaluates  an  error  measure 
which  associates  with  a  potential  axis  of  translation,  the 
quality  of  feature  displacements  along  the  corresponding 
radial  flow  paths.  This  error  measure  is  evaluated  by 
searching  over  a  unit  sphere  which  describes  all  poten¬ 
tial  directions  of  translation.  It  is  possible  to  determine 
the  direction  of  translation  to  within  a  few  degrees  in 
small  image  areas,  using  only  a  few  features. 

If  there  is  an  instantiated  three-dimensional  road 
model  and  a  rough  estimate  of  the  position  of  the  ve¬ 
hicle  along  the  road  has  been  established,  the  tangent 
information  associated  with  the  road  model  can  be  used 
to  initialize  the  search  for  the  axis  of  translation.  If  there 


is  an  instantiated  vehicle  model,  it  restricts  the  features 
that  the  local  translational  tracker  uses. 

4.2.1  Feature  Extraction 

The  local  translation-based  tracker  requires  features 
which  can  be  matched  in  successive  images.  The  type  of 
features  we  use  are  conventional  masks  of  image  pixels, 
extracted  from  distinct  areas  of  the  image.  In  the  exam¬ 
ples  shown  in  this  paper,  the  masks  are  5x5  pixel  arrays. 
We  have  use  normalized  correlation  [Ballard  and  Brown, 
1982]  to  determine  similarity  of  extracted  features.  This 
is  used  in  measuring  feature  distinctiveness  and  for  eval¬ 
uating  the  matches  of  extracted  features  along  the  radial 
flow  determined  by  a  possible  axis  of  translation.  Since 
the  radial  flow  lines  do  not  necessarily  pass  through  the 
center  of  the  image  pixel  arrays,  we  use  bilinear  interpo¬ 
lation  for  matching  features. 

The  distinctiveness  of  a  feature  is  1  minus  the  best 
correlation  value  obtained  when  the  feature  is  correlated 
with  its  immediately  neighboring  areas.  Good  features 
are  selected  by  finding  the  local  maxima  in  the  values 
of  the  distinctiveness  measure  over  an  image.  We  con¬ 
strain  the  neighborhoods  over  which  the  features  are  se¬ 
lected  to  areas  that  contain  large  intensity  discontinu¬ 
ities,  determined  by  extracting  zero-crossings.  The  area 
of  feature  extr2u:tion  is  further  constrained  by  the  output 
of  the  difference  tracker  or  an  instantiated  vehicle  and 
road  model.  The  distinctiveness  measure  is  then  applied 
only  to  these  restricted  areas  in  an  image.  This  gener¬ 
ally  results  in  the  extraction  of  areas  of  high  curvature 
along  the  zero-crossing  contours.  In  addition,  as  a  vehi¬ 
cle  is  tracked  over  a  sequence  of  images,  this  processing 
is  continually  reapplied  to  find  features  in  addition  to 
those  that  have  matched  successfully.  These  can  corre¬ 
spond  to  new  features  due  to  occlusions  or  changes  in 
observable  detail  as  a  vehicle  moves  in  depth. 

4.2.2  Determining  the  Direction  of  Translation 

Features  in  image  sequences  will  move  along  radial 
lines  defined  by  the  focus  of  expansion  (FOE)  during 
translational  motion.  The  FOE  is  determined  by  inter¬ 
secting  the  direction  of  translation  with  the  imaging  sur¬ 
face  (where  the  direction  of  translation  emanates  from 
the  focal  point  of  the  camera).  Using  this  geometric  re¬ 
lationship,  the  displacement  paths  of  all  image  features 
can  be  determined  for  a  potential  direction  of  transla¬ 
tion.  To  evaluate  a  potential  direction  of  translation, 
we  search  for  each  feature  along  the  appropriate  image 
displacement  paths.  The  error  measure  used  to  evaluate 
this  potential  direction  of  translation  is  determined  by 
summing  the  best  matches  for  each  of  the  features. 

To  search  for  the  direction  of  translation  we  use  a  unit 
sphere  centered  at  the  focal  point  of  the  camera.  Any 
vector  which  has  its  initial  point  at  the  camera’s  focal 
point  and  its  terminal  point  resting  on  the  surface  of 
the  sphere  is  a  potential  direction  of  translation.  The 
search  procedure  is  defined  with  respect  to  this  sphere 
instead  of  the  potential  positions  of  the  FOE  in  the  image 
plane.  This  is  because  the  sphere  is  a  bounded  surface 
which  makes  uniform  global  sampling  of  the  error  mea¬ 
sure  feasible.  When  the  image  plane  is  used  directly,  the 
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resolution  in  the  position  of  the  tramslational  direction 
varies. 

The  initial  search  process  consists  of  two  phases;  an 
initial  global  sampling  of  the  sphere,  followed  by  a  local 
search  for  the  maximum  value.  The  local  search  begins 
at  the  position  of  the  maximum  value  as  determined  by 
the  global  sampling.  The  local  search  process  recursively 
searches  the  area  of  the  current  maximum.  The  step  size 
of  the  local  search  processes  is  reduced  until  it  is  at  the 
desired  resolution  for  the  determination  of  the  direction 
of  translation.  Figure  6  shows  a  sequence  of  tessellated 
spheres  along  with  their  potential  directions  of  transla¬ 
tion.  Once  a  direction  of  motion  has  been  established,  it 
will  tend  to  change  smoothly  and  we  can  then  use  gra¬ 
dient  based  techniques  to  track  the  axis  of  translation 
for  successive  images.  In  addition,  if  there  is  an  oriented 
vehicle  model  or  a  road  model  segment,  the  search  for 
the  translational  axis  is  constrained  to  limited  areas  of 
the  sphere. 

4.3  Planar  Tk-acker 

Often  the  motion  of  a  vehicle  is  restricted  to  a  plane 
determined  by  the  local  road  or  surface  orientation.  In 
this  case,  the  geometry  of  planar  perspective  makes  it 
possible  to  associate  three  dimensional  information  with 
extracted  image  features  if  they  are  contained  in  the  area 
of  the  planar  patch.  In  addition,  the  directions  of  motion 
are  constrained  to  be  parallel  to  this  plane,  so  the  pos¬ 
sible  directions  of  motion  for  the  local  translation-based 
tracker  are  restricted  to  a  circle  on  the  unit  sphere  whose 
orientation  is  parallel  to  the  plane  of  motion.  This  sim¬ 
plifies  initialization  and  also  tracking  of  the  axis  of  trans¬ 
lation  over  time. 

There  is  another  useful  constraint  associated  with  pla¬ 
nar  motion  that  may  not  be  immediately  apparent.  In 
this  case,  an  environmental  displacement  vector  v  must 
be  perpendicular  to  the  normal  of  the  plane  of  motion,  v 
also  lies  in  the  plane  determined  by  its  corresponding  im¬ 
age  displacement  and  the  focal  point  of  the  camera.  The 
direction  of  environmental  motion  can  be  determined  by 
intersecting  these  planes.  This  is  useful  for  tracking  pla¬ 
nar  motion  without  the  constraints  supplied  by  a  road 
model. 

4.4  Feature  Extraction  from  a  Model 

When  the  vehicle  model  is  instantiate  it  constrains  seg¬ 
mentation  and  feature  extraction  procedures  to  a  limited 
image  area.  In  addition  to  the  feature  and  zero-crossing 
extraction  described  above,  we  use  histogram  b  ased  seg¬ 
mentation  to  determine  potential  vehicle  features. 

An  instantiated  vehicle  model  can  also  constrain  the 
places  to  search  for  detailed  features  corresponding  to 
portions  of  the  vehicle  which  can  be  tracked.  A  particu¬ 
lar  problem  we  have  found  is  that  it  is  necessary  to  have 
a  large  image  area  to  get  clear  views  of  the  features  to  be 
matched  to  the  model.  Images  of  the  vehicle  will  need  to 
be  larger  to  begin  finding  detailed  features  such  as  head¬ 
lights,  bumpers,  and  so  forth.  Such  images  could  per¬ 
haps  be  obtained  by  using  one  of  the  trackers  to  direct 
a  zoom  camera  to  follow  a  moving  vehicle.  Currently 
we  use  the  interest  operator  described  for  the  transla¬ 


tional  tracker  to  match  extracted  features  to  a  vehicle 
model  for  each  successive  image.  If  extracted  features 
are  near  previously  extracted  features  that  have  success¬ 
fully  matched  they  are  discarded.  Otherwise,  they  are 
associated  with  the  instantiated  vehicle  model. 

5  User  Interface  and  Model 
Instantiation 

An  important  facility  in  the  user  interface  is  a  conven¬ 
tional  depth  buffer  used  for  hidden  surface  removal  which 
has  been  modified  to  have  pointers,  ordered  by  depth,  to 
all  the  objects  in  the  world  that  project  onto  a  given  pixel 
in  the  image.  Thus,  when  the  human  “touches”  a  pixel 
in  the  image  from  the  telerobot,  he  can  access  all  the 
objects  in  the  world  model  that  project  onto  that  pixel. 
We  call  this  an  augmented  depth  buffer. 

The  user  interface  enables  the  human  to  place  objects 
into  the  world  model  in  several  ways.  He  can  access  the 
objects  and  manipulate  them  via  their  three  dimensional 
attributes  with  respect  to  a  coordinate  system  linked  to 
the  world  model.  This  looks  like  back  projecting  a  three- 
dimensional  cartoon  of  the  object  onto  the  image.  When 
it  has  been  positioned  as  desired,  the  different  compo¬ 
nents  of  the  object  can  be  placed  in  the  augmented  depth 
buffer  associated  with  the  image.  In  this  way,  the  pro¬ 
jected  attributes  of  the  instantiated  object  can  access  the 
actual  image  or  the  results  of  image  processing  routines. 
The  user  can  burn-in  attributes  when  he  instantiates  an 
object.  Burning-in  means  that  the  attributes  can  not  be 
changed.  This  often  involves  constraining  a  particular 
feature  to  lie  along  a  given  ray  of  projection.  Another 
technique  is  for  the  user  to  directly  draw  the  specified 
object  on  the  sensory  input  and  then  indicate  it’s  at¬ 
tributes.  An  example  of  this  is  interactively  segment¬ 
ing  an  image  into  different  types  of  terrain  patches  and 
pointing  out  that  different  edges  correspond  to  terrain 
feature  discontinuity. 

6  Processing  Example 

An  example  of  this  processing  is  shown  in  Figures  3-6. 
Figure  4  shows  a  sequence  of  images  obtained  with  a 
video  camera  viewing  a  road  scene.  In  Figure  3  a  human 
hw  interactively  positioned  a  generic  vehicle  model  with 
respect  to  the  road  and  has  begun  to  “drive”  the  model 
vehicle  through  three  dimensions  while  using  the  back- 
projection  of  the  vehicle  as  a  three  dimensional  cursor. 
Note  the  center  segments  of  the  road  being  laid  down 
behind  the  vehicle.  This  establishes  a  two-dimensional 
road  mask  and  also  an  initial  set  of  connected  three- 
dimensional  road  segments  to  constrain  later  processing. 
Figure  4  shows  connected  regions  of  image  differences 
moving  in  a  consistent  direction  with  respect  to  the  user 
instantiated  road  model.  These  correspond  to  the  front 
and  back  of  a  vehicle.  Since  orientation  is  known  along 
the  road  and  the  road  model  has  been  scaled  relative  to 
the  generic  road  model,  it  is  possible  to  use  these  areas 
to  instantiate  a  three  dimensional  vehicle  model.  Fig¬ 
ure  5  shows  interesting  points  which  have  been  extracted 
in  the  corresponding  areas  determined  by  the  vehicle 
model.  These  features  are  then  used  by  the  translation 
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Figure  4:  Areas  of  motion  and  vehicle  position  found  through  differencing 


tracker  to  refine  the  estimate  of  vehicle  and  road  orienta¬ 
tion.  The  determined  successive  directions  of  translation 
are  shown  in  Figure  6.  More  and  more  features  are  as¬ 
sociated  with  the  vehicle  model  over  time. 

7  Future  work 

Our  current  work  involves; 

•  Extending  the  number  and  complexity  of  the  mod¬ 
els  that  are  used  along  with  the  a  more  general  con¬ 
straint  propagation  mechanism. 

•  Extending  the  user  interface  to  use  a  wide  range  of 
interactive  devices  such  as  a  data  glove  and  other 
three-dimensional  positioning  devices. 

•  Integrating  the  local  translational  tracker  with  a 
Kalman  Filter  for  processing  over  time. 

•  Using  multiple  cameras  from  different  points  of 
view  with  respect  to  the  same  scene. 
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Figure  6:  lYansIational  motion  spheres  corresponding  to  the 
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Abstract 

In  this  paper  we  describe  a  purely  topological 
method  for  navigation  in  a  large  unstructured 
environment  that  contains  featureless  objects, 
using  qualitative  non-metric  information  such  as 
“isolated”  landmarks  and  “trajectories”,  which 
we  dehne.  The  map-maker  and  the  navigator  are 
implemented  using  an  IBM  7575  SCARA  robot 
arm,  PIPE,  and  two  cameras.  The  navigational 
environment  consists  of  a  flat  plane  with  iden¬ 
tical  spherical  objects  populated  randomly  but 
densely  on  it.  First,  the  map-maker  model  ob¬ 
serves  the  environment,  and  ^ven  a  starting  po¬ 
sition  and  a  goal  position,  it  generates  a  “custom 
map”  that  describes  in  a  non-metric  language 
how  to  get  from  the  starting  position  to  the  goal 
position  efficiently  ajid  reliably.  The  accuracy 
and  the  cost  of  the  directional  instructions  are 
analyzed,  then  demonstrated  by  the  navigator 
by  following  the  commands  in  the  custom  map. 
Several  non-intuitive  and  ill-specifled  aspects  of 
navigation  in  this  manner  are  then  discussed. 

1  Introduction 

Navigation  in  a  large  unstructured  environment 
requires  different  information  and  tools  than 
that  of  navigation  in  a  structured  small  environ¬ 
ment.  To  guide  a  navigator  in  a  such  environ¬ 
ment,  the  direction  giver  has  to  produce  a  set  of 
directional  instructions  that  contains  not  only 
sufficient  information  for  accurate  navigation, 
but  also  has  to  make  sure  that  the  given  set  of 
directional  instructions  is  not  overburdening  the 
navigator  with  too  much  information[Streeter  et 
al.,  1985].  In  this  paper,  we  explore  the  effective¬ 
ness  of  navigation  using  “isolated”  landmarks 
and  “trajectories”,  both  of  which  make  use  of 


qualitative  information  rather  than  quantitative 
one. 

2  Definitions 

2.1  The  world,  the  map-maker,  and 
the  navigator 

There  are  two  major  modules  in  this  project, 
namely,  the  map-maker  and  the  navigator.  They 
both  operate  on  the  navigational  world. 

The  navigable  world 

The  navigational  environment  that  we  are  inter¬ 
ested  in  is  a  three  dimensional  world,  although 
the  current  implementation  of  the  navigator  (due 
to  its  restriction  of  degrees  of  freedom)  makes 
the  effective  environment  two  dimensional.  The 
navigational  terrain  itself  is  a  flat  surface  that  is 
visually  uniform  all  over.  Objects  that  will  be 
scattered  over  this  flat  surface  are  spherical  ob¬ 
jects,  such  as  marbles,  uniform  in  size.  There  is 
no  restrictions  on  where  the  objects  are  placed, 
except  that  no  objects  are  allowed  to  be  placed 
on  top  of  another.  However,  the  two  assump¬ 
tions  of  the  objects  -  that  they  are  uniform  in 
size,  and  that  they  are  placed  randomly  -  em¬ 
phasize  the  spatial  and  topological  problems  of 
doing  vision  and  navigation  “in  the  large”.  That 
is,  firstly,  an  object  can  no  longer  be  described 
by  its  intrinsic  attributes,  such  as  shape,  size,  or 
color.  Therefore,  in  order  to  describe  an  object, 
the  geometrical  relationship  of  the  target  object 
to  its  neighboring  objects  must  be  considered. 
Secondly,  the  directions  of  the  movements  of  the 
navigator  have  no  external  reference  to  rely  on. 
Our  goal  is  to  be  able  to  throw  a  number  of  mar¬ 
bles  on  a  table  and  have  our  robot  successfuUy 
navigate  through  this  random  world.  To  a  large 
extent,  we  have  succeeded,  but  by  using  disks. 
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Map-maker 

Navigator 

Objective 

Visibility 

Metric  ability 

Intelligence 

Memory 

Computing  power 

Generate  a  custom  map 
Infinite 

Yes 

Omniscient 

Large 

Fast 

Navigate  using  a  custom  map 

Limited  to  current  view  window 

Only  within  current  view  window 

Limited  to  interpreting  the  custom  map 
None 

Slow 

Table  1:  Capabilities  and  limits  of  the  map-maker  and  the  navigator 


The  map-maker 

The  purpose  of  the  map-maker  is  to  generate  a 
“custom  map”,  consisting  of  directional  instruc¬ 
tions,  so  that  the  navigator  will  be  able  to  find 
its  way  to  the  destination  by  executing  each  of 
the  instructions  in  a  sequence.  The  map-maker 
is  assumed  to  be  omniscient  and  error  free.  It 
sees  the  whole  environment  and  knows  the  exact 
position  of  each  objects  that  exists.  The  map- 
maker  also  knows  the  capabilities  and  the  lim¬ 
its  of  the  navigator.  Therefore,  the  custom  map 
contains  directional  instructions  that  the  navi¬ 
gator  can  handle.  The  communication  between 
the  map-maker  and  the  navigator  is  done  off¬ 
line  (one-way,  one-time).  This  means  that  the 
custom  map  is  given  to  the  naviga.ior  in  the  be¬ 
ginning  of  the  journey  and  the  navigator’s  only 
source  of  directional  information  is  the  custom 
map. 

The  navigator 

The  navigator’s  capabilities  are  much  more  lim¬ 
ited.  Its  view  window  size  is  very  small  rela¬ 
tive  to  the  environment.  It  has  limited  metric 
measurement  capabilities  only  within  its  current 
view  window.  However,  the  accuracy  of  the  its 
metric  measurements  is  assumed  to  be  low.  For 
example,  the  navigator  can  move  the  window  po¬ 
sition  so  that  a  visible  reference  object  is  posi¬ 
tioned  near  one  of  the  four  corners  of  the  view 
window,  but  it  does  not  have  the  accuracy  to  pin 
point  the  coordinates  of  a  visible  object.  There¬ 
fore,  it  is  not  possible  for  the  map-maker  to  give, 
to  the  navigator,  the  (a:,  y)  coordinate  of  a  land¬ 
mark  as  part  of  the  directional  instructions.  The 
navigator  is  not  intelligent  enough  to  decide  on 
its  own  what  it  should  do,  and  is  thus  totally  de¬ 
pendent  on  the  custom  map  that  it  is  given.  It 
does  not  have  any  memory.  Therefore,  the  cus¬ 
tom  map  can  be  considered  as  the  navigator’s 
intelligence.  Table  1  summarizes  the  assump¬ 
tions  that  we  make  on  the  map-maker  and  the 
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Figure  1:  Randomly  populated  environment  of 
disks  on  a  table 

navigator. 

2.2  Parkways  and  trajectories 

First,  the  map-maker  captures  the  entire  world 
by  taking  the  image  of  the  world  from  a  vertically 
high  location.  The  positions  of  the  populated  ob¬ 
jects  are  recorded.  Figure  1  shows  an  example 
of  a  randomly  populated  world  with  100  objects. 
The  map-maker  is  assumed  to  know  in  advance 
the  capabilities  of  the  navigator,  such  as  view 
window  size  and  degrees  of  freedom.  The  map- 
maker  further  abstracts  the  world  into  a  graph 
data  structure,  with  vertices  and  edges.  There 
are  many  ways  to  decide  whether  or  not  two  ver¬ 
tices  (objects)  in  this  graph  are  connected  by  an 
edge.  One  way  is  to  define  that  two  objects  are 
“connected”  if  the  two  objects  can  be  viewed  in 
the  same  view  window  of  the  navigator.  By  ap¬ 
plying  the  connected  component  algorithm  using 
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Figure  2:  Parkways,  the  paths  that  allow  step 
by  step  movements  of  the  navigator 

the  above  definition  of  “connectedness”,  we  can 
then  generate  discrete  sets  of  connected  compo¬ 
nents.  We  define  a  parkway  to  be  such  a  con¬ 
nected  component.  Figure  2  shows  parkways 
formed  on  a  world  of  100  objects  using  a  window 
size  10  X  10  in  a  world  of  size  100  x  100.  The 
relative  size  of  the  view  window  to  the  world  is 
shown  in  the  lower  left  corner  of  the  figure.  The 
arcs  between  points  indicate  that  the  objects  are 
connected,  that  is,  both  are  simultaneously  vis¬ 
ible  in  same  window.  To  find  the  shortest  path 
between  two  objects  in  the  same  parkway,  we 
apply  Dijkstra’s  shortest  path  algorithm. 


To  find  a  path  between  two  objects  that  are  in 
mutually  separate  parkways,  we  need  to  be  able 
to  devise  methods  to  transfer  between  parkways. 
At  some  point  of  the  traversal,  the  navigator  has 
to  leave  a  parkway  and  get  to  the  other  park¬ 
way  without  getting  lost.  In  our  method,  we 
“slide”  the  view  window  in  the  direction  formed 
by  two  objects  within  the  view  window  until  a 
new  object  is  reached.  The  inter-parkway  paths 
generated  by  this  method  are  defined  as  “tra¬ 
jectories”.  The  overall  shortest  path  is  then  the 
appropriate  combination  of  parkway  paths  and 
the  trajectory  paths.  Figure  3  shows  an  example 
of  trajectories  computed  on  our  random  world. 


Figure  3:  Trajectories 


The  small  boxes  are  the  navigators  view  win¬ 
dows.  The  straight  lines  are  the  trajectories  of 
the  windows  in  the  direction  of  the  sliding  move¬ 
ment.  One  end  of  a  line  is  attached  to  the  win¬ 
dow  at  the  position  to  which  the  new  object  is 
expected  to  appear.  The  other  end  of  the  line  is 
the  object  that  this  sliding  window  is  seeking. 

2.3  Description  language 

The  map-maker  needs  to  generate  a  custom  map 
that  describes  how  the  navigator  may  follow  the 
landmarks  along  the  computed  shortest  path.  In 
this  section,  we  explore  the  issues  in  designing 
the  language  for  the  custom  map.  At  any  given 
instant,  the  navigator  will  have  only  a  small  por¬ 
tion  of  the  world  at  its  view,  which  may  contain 
several  objects.  The  map-maker  has  to  be  able  to 
describe  what  the  navigator  sees  in  order  for  the 
navigator  to  be  able  to  distinguish  a  particular 
object  to  use  as  the  next  reference  point.  This 
is  a  hard  problem  because  of  our  assumption  of 
the  navigable  environment  which  is  comprised 
of  point-like  objects  that  are  randomly  placed. 
If  the  robot  has  infinitesimal  accuracy,  infinite 
memory  and  extremely  fast  processor,  it  could 
keep  the  bitmap  image  of  each  possible  view,  or 
at  least  the  coordinate  of  each  landmark  within 
each  possible  window.  But  realistically,  we  need 
some  invariants,  such  as  colors  or  shapes,  to  de- 
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scribe  the  immediate  environment.  Since  our 
world  is  a  monochrome  point-like  world,  we  need 
a  qualitative  language  that  can  describe  the  ge¬ 
ometrical  relations  of  the  visible  objects.  The 
level  of  detail  in  the  description  process  would 
depend  on  the  intelligence  and  the  mobile  abil¬ 
ity  of  the  navigator;  our  exploration  here  deliber¬ 
ately  emphasizes  the  topological  aspects  of  nav¬ 
igation. 


Some  vocabulary  follows:  An  object  is  defined 
to  be  “obvious”,  if  it  is  the  only  other  object 
that  is  visible  except  for  the  reference  landmark. 
Two  or  more  objects  are  defined  to  be  “confus- 
able”  if  it  does  not  matter  which  object  is  to  be 
chosen  as  the  next  reference  point.  The  term 
“new”  objects  refers  to  the  objects  that  newly 
came  into  the  view  window  as  a  result  of  robot’s 
movement.  The  term  “isolated”  point  refers  to 
a  single  isolated  object  as  a  result  of  applying  a 
clustering  algorithm.  The  term  “isolated  pair” 
refers  to  a  single  pair  of  isolated  objects  as  a 
result  of  a  clustering  algorithm.  Figure  4  illus¬ 
trates  these  methods.  All  have  been  or  are  being 
implemented,  but  we  will  only  talk  about  one 
of  these  vocabulary  terms  in  this  paper.  Fig¬ 
ure  5  shows  an  example  of  a  navigation  using 
the  description  language.  The  custom  map  en¬ 
tries  corresponding  to  each  movement  is  show  to 
the  right  of  the  figure.  Each  entry  is  of  the  form 
(Z?,  [T],M),  where  D  is  a.  description  language 
vocabulary  that  identifies  a  landmark,  [T]  is  an 
optional  term  that  indicates  trajectory,  and  M 
is  the  corner  designator  to  indicate  which  of  the 
4  corners  (SE,  SW,  NE,  NW)  the  chosen  land¬ 
mark  is  to  be  positioned.  The  symbol  *  is  the 
wildcard  that  matches  any  D  ox  M. 


3  Isolated  landmark  following  and 
trajectory  traversal 


Uiwar  ordering  Spiral  bmatd  ordaring  /  Obvioua  point 


Newobiecta 


Figure  4:  Examples  of  description  language,  only 
one  of  which  (isolated  point)  this  paper  analyzes 
and  implements 


6.  (iMtaM.*) 


4.  (obvtou».  SE) 


3.  (obv(eu«.SE) 


2.  (laotoM.tnlMlory.SW) 


1.  r.SW) 


Figure  5:  An  example  of  navigation  using  a  cus¬ 
tom  map,  showing  isolated  point,  obvious  point, 
trajectory,  and  isolated  pair 


In  this  paper,  we  present  one  navigational  tech¬ 
nique  using  the  isolated  point  descriptor  and  the 
trajectory  method.  Parkway  traversal  is  done 
by  following  isolated  landmarks  solely.  Parkway 
crossing  in  done  by  using  the  trajectory  method 
solely.  As  stated,  other  descriptors  and  methods 
have  been  or  are  being  developed. [Park,  1993] 


3.1  Calculating  the  isolated  point 

Our  algorithm  to  compute  the  most  isolated 
point  in  a  scene  consisting  of  n  point-like  objects 
uses  the  concept  of  mutual  neighborhoods[Gov/da 
and  Krishna,  1978].  (Several  other  definitions  of 
“isolated”  led  to  unstable  performance  or  costly 
computations.)  The  algorithm  is  as  follows: 
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Figure  6:  Finding  the  most  isolated  point  in  a 
window  according  to  the  mnv  algorithm 


a 

b 

c 

d 

a 

0 

5 

4 

6 

b 

5 

0 

2 

3 

c 

4 

2 

0 

4 

d 

6 

3 

4 

0 

Table  2:  The  mnv  matrix 

1 .  For  each  pair  of  points  in  the  scene  compute 
the  mnv  (mutual  neighborhood  value).  The 
mnv  of  two  points  a  and  b  is  the  sum  of  two 
numbers,  representing  the  order  of  how  close 
6  is  to  a  and  the  order  of  how  close  a  is  to  6 
relative  to  all  the  other  objects.  For  exam¬ 
ple,  if  b  is  2nd  closest  point  to  a,  and  a  is  3rd 
closest  point  to  6,  then  the  mnv  is  5.  The 
result  is  stored  in  a  n  x  n  matrix  where  n  is 
the  number  of  visible  objects.  The  objects 
in  figure  6  have  the  mnv  matrix  of  table  2. 

2.  For  each  column  of  the  mnv  matrix,  find  the 
smallest  value  greater  than  0.  This  value  is 
the  mnv  value  between  this  particular  object 
(that  correspond  to  *^his  particular  column) 
and  its  “closest”  neighbor.  Call  this  value 
the  “c-value”  of  this  particular  object.  So  in 
our  example,  the  c-values  for  a,  6,  c,  and  d 
are  4,  2,  2,  and  3  respectively. 

3.  The  object  (column)  that  has  largest  c-value 
is  the  most  isolated  point  in  the  scene.  In 
our  example,  a  has  the  largest  c-value(4)  and 
therefore  it  is  the  most  isolated  point.  Note 
that  c-value  is  a  small  integer  of  value  at 
most  n. 

Using  the  isolated  point  concept  as  the  “con¬ 
nected”  definition,  we  can  form  an  “isolated” 
parkway  of  paths  between  isolated  landmarks. 

3.2  Trajectory  traversal 

The  basic  idea  in  trajectory  method  is  to  “slide” 
the  view  window  along  the  direction  formed  by 


trajectory  target 


two  objects  within  the  view  window  and  to  see 
which  object  gets  encountered  by  the  window 
first.  Since  the  method  we  are  using  to  iden¬ 
tify  a  landmark  in  a  view  window  is  by  selecting 
the  isolated  landmark,  the  two  objects  we  use 
are  the  reference  point  and  the  isolated  point. 
The  directions  formed  by  these  two  points,  the 
positive  and  the  negative  directions,  leads  to  at 
most  two  trajectory  goal  points.  Note  that  we 
are  not  establishing  any  absolute  coordinates  for 
the  navigator’s  movement,  but  in  fact  the  navi¬ 
gator’s  movement  is  based  on  local  orientations 
formed  by  landmark  pairs.  The  implementation 
of  trajectories  is  done  by  geometrically  subdivid¬ 
ing  the  navigation  plane  bcised  on  the  naviga¬ 
tor’s  view  window  position  and  the  direction  of 
the  trajectory  movement,  and  then  computing  to 
see  which  of  the  outstanding  object  is  closest.  In 
figure  7,  we  see  a  view  window  centered  around 
two  objects,  Fi(xi,t/i).  and  Fi(x2<y2)-  The  di¬ 
rections  formed  by  these  two  points.  Pi  and  Pj. 
is  defined  by  the  slope  m  of  the  line  that  passes 
through  these  two  points.  The  trajectory  goal 
point,  is  then  the  closest  object  to  this  window 
along  the  directions  (left  and  right)  defined  by 
m,  bounded  by  the  two  lines,  LTop  and  LBot. 
Therefore,  for  each  pair  of  landmarks,  there  can 
be  up  to  two  different  trajectories.  When  the 
navigator  is  moving  right,  two  classes  of  objects 
are  considered.  The  first  class  is  the  objects 
that  are  enclosed  by  the  open  polygon  bounded 
by  LTop,  LRight,  and  the  “north"  part  of  the 
view  window.  The  second  class  is  the  objects 
that  are  enclosed  by  the  open  polygon  bounded 
hy LRight,  LBot,  and  the  “east"  part  of  the  view 
window.  From  the  objects  that  are  in  these  two 
classes,  the  one  which  is  closes  to  the  view  win- 
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Figure  8:  The  map-maker  and  the  navigator 


dow  frame  is  the  trajectory  goal  object.  When 
the  navigator  is  moving  left,  two  classes  of  ob¬ 
jects  are  considered,  similarly  with  respect  to 
LLeft. 

3.3  Implementation 

We  have  implemented  our  map-maker  and  nav¬ 
igator  using  an  IBM  7575  SCARA  arm,  two 
CCD  cameras,  PIPE,  which  is  a  high  speed  real¬ 
time  image  processor,  and  a  SUN-4  workstation 
for  high  level  control  of  the  navigator  and  as 
the  map-maker.  Figure  8  shows  the  configura¬ 
tion  of  the  map-maker  and  the  navigator.  The 
map-maker  is  comprised  of  one  CCD  camera  lo¬ 
cated  at  a  position  that  can  capture  the  whole 
workspace  of  the  navigator.  This  camera  is  at¬ 
tached  to  the  PIPE  which  grabs  the  image  and 
sends  it  to  the  SUN-4  workstation  which  runs  the 
“map-making”  program  based  on  the  centroid 
information  of  the  scattered  objects.  The  om- 
nisciency  assumption  of  the  map-maker  requires 
that  the  image  captured  by  the  global  camera 
correctly  be  used  to  generate  the  position  of  each 
object.  To  account  for  the  image  distortions  due 
to  translation,  rotation,  and  perspective,  we  use 
a  simple  geometric  calibration  matrix  A  that 
transforms  the  homogeneous  world  coordinates 
(A,  Y,  Z,  1)  into  the  homogeneous  camera  coor¬ 
dinates  (U,  V,  1)[Fu  et  al.,  1987]. 


Figure  11  shows  the  user  interface  program  of  the 
map-maker,  running  on  the  X- Window  system. 
When  the  start  and  goal  objects  are  chosen,  the 
map-maker  computes  the  optimal  path  in  terms 


of  “the  most  isolated  point”  in  each  view  win¬ 
dow  of  the  navigator.  In  the  screen,  the  open  dot 
at  the  lower  right  corner  is  the  starting  position 
and  the  open  dot  on  the  left  side  of  the  screen 
is  the  goal  position.  Each  of  the  small  rectangle 
represent  the  projected  navigator’s  view  along 
its  path.  For  example,  the  rectangle  in  the  lower 
right  corner  of  the  screen  that  contains  the  start¬ 
ing  point  is  the  initial  view  of  the  navigator.  The 
tine  segment  in  each  of  these  boxes  connect  the 
landmarks  which  are  the  most  isoated  points  in 
the  robot’s  path.  FinaUy,  the  map-maker  gen¬ 
erates  a  file  called  “custommap”  which  contains 
the  list  of  directional  instructions  for  the  navi¬ 
gator.  The  navigator  is  comprised  of  a  second 
camera  attached  to  the  IBM  robot  arm.  This 
camera  is  also  connected  to  the  PIPE  for  image 
processing  of  each  scene  as  the  navigator  moves 
along.  For  each  directional  instruction  in  the 
custom  map,  the  derivation  of  the  most  isolated 
point  and  the  amount  of  movement  of  the  robot 
for  the  corresponding  instruction  is  computed  by 
the  SUN-4  workstation.  It  then  sends  out  low 
level  instructions  to  the  IBM  arm  controller  for 
the  actual  movement. 

4  Error  modeling 

4.1  Reliability  of  isolated  landmarks 

The  navigation  using  “custommap”  was  tested 
for  various  types  of  populations  and  start /goal 
positions.  We  discovered  that  the  navigator 
tends  to  fall  in  subareeis  of  the  environment 
where  the  objects  are  highly  populated.  To  ex¬ 
plain  this  phenomenon,  we  did  a  statistical  ex¬ 
periment,  since  the  non  linear  definition  of  “iso¬ 
lated”  defies  analytic  solution.  We  started  with 
a  randomly  populated  window  with  n  objects. 
To  simulate  the  inherent  error  of  the  navigator’s 
position  estimation  of  the  visible  objects,  the  po¬ 
sition  information,  (x,j/),  of  each  of  the  n  ob¬ 
jects  was  associated  by  a  2  dimensional  Gaussian 
probability  distribution.  The  output  of  such  as¬ 
sociation  produces  a  distorted  position  informa¬ 
tion,  (x  ■¥  ixiV  (y)-  Then  we  applied  the  iso¬ 
lated  point  algorithm  on  the  view  window  with 
the  distorted  positional  information  to  see  if  the 
algorithm  still  identifies  the  correct  landmark. 
The  reliability  of  the  isolated  point  in  a  view 
window  with  n  objects  is  measured  by  the  prob¬ 
ability  of  achieving  correct  isolated  point.  This 
is  is  approximated  by  the  ratio  where  X  is 


540 


Figure  9:  Reliability  of  isolated  landmark  de¬ 
creases  as  the  number  of  neighboring  objects  in¬ 
creases 


the  number  of  tries  and  C  is  the  number  of  times 
when  the  algorithm  identifies  the  correct  isolated 
point.  We  ran  this  test  with  N  =  1000,  with 
n  =  3,5,7,9,11,13.  Figure  9  shows  the  results. 
The  data  points  near  the  top  line  indicates  the 
reliability  measures  of  the  isolated  point  when 
the  standard  deviations,  Cxi  Oy,  of  the  position 
estimations  are  equal  to  1%  of  the  window  width 
W  (of  window  size  W  x  W).  The  data  points 
near  middle  and  bottom  line  correspond  to  the 
reliability  measures  of  isolated  point  when  stan¬ 
dard  deviation  are  2%  and  3%  of  the  window 
width,  respectively.  As  we  can  see,  the  reliabil¬ 
ity  of  an  isolated  point  decreases  as  the  number 
of  neighboring  points  increases.  Note  also  that 
as  the  positional  error  increases  (standard  devi¬ 
ation  of  error  function  increases),  the  reliability 
decreases.  After  experimentation,  we  modeled 
the  reliability  of  isolated  point  with  the  follow¬ 
ing  equation,  which  states  that  the  reliability  is 
inversely  proportional  to  the  number  of  neigh¬ 
boring  objects. 

]l  —  jlj  4-  eg  -i-  d  (1) 

n 

This  equation  was  fitted  with  the  data  points  us¬ 
ing  Mathematica  package[Wolfram,  1988].  The 
result  of  the  curve  fitting  is  also  shown  in  fig¬ 
ure  9. 


Figure  10  is  a  visualization  of  this  reliability  mea¬ 
sures  in  original  world  of  figure  1.  Each  “isolated 
point”  candidate  ha  -  a  box  around  it.  The  size  of 
the  each  box  is  drf  d  -is  W  x  R,  where,  W  xW 


Figure  10:  Reliability  of  each  isolated  land¬ 
mark,  illustrated  graphically;  larger  boxes  indi¬ 
cate  higher  reliability 
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Figure  11:  The  user  interface  program  for  the 
map-maker,  in  which  the  world  of  dots  has  been 
captured  by  the  camera  and  displayed.  The  com¬ 
mand  buttons  are  shown  on  the  right.  Also 
shown  is  the  computed  optimal  path  from  a  user- 
selected  source  to  a  user-selected  goal. 

is  the  view  window  size  of  the  navigator,  and  R 
is  the  reliability  of  the  isolated  landmark.  Basi¬ 
cally,  the  larger  the  box  is,  the  more  reliable  the 
object  is  as  an  isolated  landmark.  Therefore, 
during  the  derivation  of  the  optimal  path,  ob¬ 
jects  with  larger  reliability  windows  are  favored. 
Figure  11  shows  the  generated  path  that  favors 
“reliable”  isolated  landmarks.  Notice  the  nav¬ 
igator’s  path  has  detoured  to  avoid  the  highly 
cluttered  area. 
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4.2  Reliability  of  trajectories 

The  reliability  of  the  trajectory  method  depends 
on  whether  or  not  the  trajectory  goal  is  accu¬ 
rately  achieved  by  the  navigator’s  movement. 
Note  that  the  error  involved  in  achieving  a  wrong 
object  during  a  trajectory  movement  is  strictly 
one  dimensional,  as  opposed  to  the  two  dimen¬ 
sional  area  involved  in  the  error  of  “the  most 
isolated- point”  method.  In  the  isolated  point 
method,  the  neighboring  objects  surrounding  the 
actual  isolated  point  contribute  to  the  error. 
However,  the  trajectory  landmark  needs  to  be 
distinct  in  only  one  direction  -  the  direction  of 
the  navigator’s  trajectory  movement. 


The  errors  in  trajectory  traversal  can  come  from 
sensor  error  in  determining  positional  informa¬ 
tion  of  objects,  or  from  translation  or  rotational 
error  of  the  view  window  during  navigation.  The 
first  of  these  three  sources  produce  a  static  error, 
meaning  that  the  error  does  not  change  with  the 
traveling  distance  of  the  navigator.  But  the  er¬ 
rors  due  to  the  second  and  the  third  sources  are 
dynamic,  meaning  that  the  error  grows  as  the 
trajectory  distance  increases. 


First,  let  us  examine  the  static  error.  Let  Xi 
and  Xj  denote  the  distance  from  the  view  win¬ 
dow  frame  to  to  objects  Pi  and  Pj,  respectively. 
Assume  that  Xi  and  Xj  follow  some  probabil¬ 
ity  distributions  with  density  functions  /(xi)  and 
/(xj)  respectively.  Since  Xi  and  Xj  are  indepen¬ 
dent,  the  joint  density  function  is  given  by, 

f(Xi,Xj)  =  f(Xi)f(Xj) 

Then,  the  probability  that  object  Pj  will  be 
reached  before  object  Pi  is  given  by: 

P{Xi  >  Xj)=  II  f{xi,xj)dxidxj 

J  Jxi>X] 

For  simplicity  in  this  paper,  let  us  assume  f{xi) 
and  f{xj)  are  uniform  distribution  density  func¬ 
tion  (other  error  models  have  also  been  ana¬ 
lyzed)  as  in  figure  12. 

Xi  ~  1/(01,02)  where  02  -  oi  =  c 

Xj  ~  U{bi,b2)  where  b2  -  bi  =  c 


f(»»)  f(xi) 

1  1 1/c 


a1  b1  a2  b2 

«  *  -t - ► 

Distributiora  of  Xi  and  Xj 


Figure  12:  Uniform  joint  probability  distribution 

Then  the  probability  that  B  will  be  reached  be¬ 
fore  A  is  given  by  P{Xi  >  Xj),  which  is  the  area 
of  the  box  under  the  line  Xj  =  a:,  in  figure  12. 

P{Xi  >  Xj)  =  II  f{xi)f{xj)dxidxj 

J  Jx,>Xj 

=  (2) 

As  stated  earlier,  errors  due  to  the  second  and 
the  third  sources  are  dynamic,  meaning  that  the 
error  will  grow  as  the  trajectory  distance  in¬ 
creases.  To  model  this,  we  use  use  a  variable 
size  <T  that  is  a  function  of  distance.  For  exam¬ 
ple,  if  object  Pj  is  10  times  farther  away  from  the 
window  than  object  P,-,  then  the  distribution  of 
Pj  should  be  10  times  more  dispersed  than  that 
of  Pi.  The  question  is  what  the  “initial”  value 
of  the  a  is.  Let  us  assume  that  during  the  tra¬ 
jectory  movement,  as  the  navigator  travels  a  dis¬ 
tance  equal  to  the  window  size  (W),  the  dynamic 
error  (in  the  case  of  Gaussian  distribution,  <t),  is 
bounded  by  the  size  of  the  blob  (radius  r).  This 
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Figure  13:  Combined  path  of  Parkway  traversal 
and  Trajectory  traversal 

means  that  when  the  navigator  travels  a  distance 
equal  to  its  window  size,  the  positional  error  can 
be  as  big  as  the  blob  size.  Let  the  radio  between 
the  blob  radius  and  the  window  size  be  1/5  and 
the  distance  traveled  be  d.  Then, 

dr  _  d 

Combined  navigation  using  parkways  and  tra¬ 
jectories  both  weighted  by  reliability  is  shown 
in  figure  13.  The  corresponding  custom  map  for 
the  generated  path  is  :  ((isolated,  SE),  (isolated, 
trajectory,  SE),  (isolated,  SE),  (isolated,  trajec¬ 
tory,  NE),  (isolated,  NW),  (isolated,  *)). 

5  Discussion 

5.1  Tie  breaking  heuristics 

Sometimes,  particularly  in  sparse  areas,  there 
can  be  more  than  one  winner  in  the  isolated 
landmark  method.  In  this  case,  two  or  more 
landmarks  in  the  navigator’s  view  window  will 
seem  equally  isolated  to  the  navigator,  i.e.,  they 
have  the  same  mnv  c-values  described  previ¬ 
ously.  To  break  he  tie,  we  use  a  heuristic  to 
choose  the  isolated  landmark  that  is  the  farthest 
away  from  the  current  reference  point.  This 
method  is  based  on  the  observation  that  the  nav¬ 
igator  tends  to  travel  towards  the  goal  (and  away 
from  the  starting  location)  at  any  instance  of 
navigation.  Therefore,  by  selecting  the  farthest 
landmark  from  the  current  reference  point,  the 
navigator  usually  moves  closer  to  the  goal.  This 
is  in  contrast  to  a  more  usual  search  heuristic, 
which  would  select  the  landmark  closest  to  the 
goal.  The  reason  we  use  the  former  is  that  the 
global  position  of  the  goal  (or  even  of  the  two 
competing  landmarks)  is  unknown  to  the  topo¬ 
logically  driven  navigator. 


5.2  Definition  of  optimal  path 

Currently,  we  have  cost  functions  that  estimate 
the  distance  of  navigation  path  D  and  the  un¬ 
reliability  of  navigation  path  R.  One  of  map- 
maker’s  responsibility  is  to  generate  a  path  that 
either  minimizes  D,  or  maximizes  R.  Unfortu¬ 
nately,  in  some  environments,  these  two  cost  es¬ 
timates  are  in  a  direct  conflict  of  each  other.  For 
example,  if  the  shortest  distance  path  involves 
in  a  highly  cluttered  area,  the  reliability  would 
be  low.  Conversely,  a  reliable  path  that  avoids 
highly  cluttered  areas  may  force  the  navigator 
to  detour  around  a  shortest  path.  Here,  we  sug¬ 
gest  a  third  function  C  that  compromises  D  and 
R,  defined  as:  C  =  log  5  —  log  5.  Using  C  as 
our  cost  estimate  for  travel,  we  can  apply  Dijk- 
stra’s  shortest  path  algorithm  to  derive  a  path 
that  minimizes  the  DfR  ratio.  The  generated 
path  will  tend  to  favor  short  path  and  sparsely 
populated  areas.  Note  that  5  is  a  function  of 
not  only  n  (population)  but  also  a  (position  es¬ 
timate  error)  as  in  equations  1  and  2.  If  a  is 
small,  the  path  will  resemble  the  Z)-path  and  if 
(T  is  large,  the  path  wiU  resemble  the  5-path. 
This  means  that  if  the  navigator’s  metric  ability 
within  its  window  is  good,  the  optimal  path  will 
be  the  metrically  shortest  path.  On  the  other 
hand,  if  the  navigator’s  metric  ability  is  poor, 
the  optimal  path  will  be  the  one  that  least  con¬ 
fuses  the  navigator.  In  figure  13,  note  that  the 
generated  path  neither  passes  through  a  highly 
cluttered  areas  nor  it  detours  too  much  from  the 
metrically  shortest  path. 

5.3  Context-based  landmarks 

The  shortest  path  in  a  parkway  network  does 
not  guarantee  the  shortest  travel  path  for  the 
navigator  because  the  cost  (distance  traveled) 
of  achieving  a  landmark  depends  on  the  current 
“context”  of  the  landmark.  For  example,  con¬ 
sider  a  path  generated  within  a  parkway.  Each 
arc  of  the  path  represents  the  distance  between 
a  reference  point  and  a  subsequent  landmark. 
However,  depending  on  the  which  corner  of  tie 
view  window  (SE,  SW,  NE,  NW)  the  new  land¬ 
mark  is  placed,  the  actual  traveled  distance  of 
the  navigator  may  or  may  not  be  equal  to  the 
arc  length.  In  an  extreme  situation,  we  can  vi¬ 
sualize  a  parkway  path  that  zigzags  but  the  ac¬ 
tual  traveling  movement  of  the  navigator  is  linear 
and  the  traveled  distance  is  much  smaller  (see 
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Figure  14:  Zigzag  arrows  indicate  the  physical 
distances  between  landmarks,  whereas  the  actual 
movement  of  the  navigator  is  almost  linear 

figure  14.)  This  is  analogous  to  a  driver  follow¬ 
ing  a  series  of  landmarks,  such  as  buildings,  in 
which  the  physical  distances  between  landmarks 
is  not  equal  to  the  odometer  reading.  Using  our 
navigator  model,  each  landmark  has  4  different 
contexts,  corresponding  to  the  SE,  SW,  NE,  and 
NW  corners  of  the  navigator’s  view  window.  In 
order  to  implement  context- based  landmark  fol¬ 
lowing,  data  structures  to  represent  parkways, 
trajectories  and  the  cost  matrix,  must  be  modi¬ 
fied.  Instead  of  n  nodes  in  the  parkway  network, 
we  now  have  4n  nodes,  namely,  the  4  ways  each 
landmark  can  be  “seen”  by  the  navigator.  The 
size  of  the  cost  matrix  grows  by  a  factor  of  4^,  but 
stays  very  sparse.  Out  of  16n^  cells,  at  most  12n 
are  used.  Therefore,  a  sparse  matrix  representa¬ 
tion  is  needed  for  storage  efficiency.  Time  com¬ 
plexity  of  the  search  algorithm  is  also  increases, 
but  by  a  factor  of  4^  since  Dijskstra  takes  O(n^). 

6  Future  work 

Future  work  will  include  a  more  elaborate  de¬ 
scription  language  that  can  fully  represent  the 
navigational  scenes;  statistical  analysis  to  help 
decide  on  which  “vocabulary”  of  the  description 
language  to  use  for  each  situation;  error  recov¬ 
ery  schemes  for  the  navigator  to  avoid,  detect, 
and  to  correct  for  errorful  situations;  refinement 
of  reliability  of  the  directional  instructions,  and 
an  increase  in  the  navigator’s  degrees  of  freedom 
for  a  more  general  modeling  of  the  real  world 
navigation. 
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Abstract 

We  address  the  problem  of  automatically  learn¬ 
ing  object  models  for  recognition  and  pose  estimation. 
In  contrast  to  the  traditional  approach,  we  formulate 
the  recognition  problem  as  one  of  matching  appear¬ 
ance  rather  than  shape.  The  appearance  of  an  ob¬ 
ject  in  a  two-dimensional  image  depends  on  its  shape, 
reflectance  properties,  pose  in  the  scene,  and  the  il¬ 
lumination  conditions.  While  shape  and  reflectance 
are  intrinsic  properties  of  an  object  and  are  constant, 
pose  and  illumination  vary  from  scene  to  scene.  We 
present  a  new  compact  representation  of  object  ap¬ 
pearance  that  is  parametrized  by  pose  and  illumina¬ 
tion.  For  each  object  of  interest,  a  large  set  of  images 
is  obtained  by  automatically  varying  pose  and  illumi¬ 
nation.  This  large  image  set  is  compressed  to  obtain 
a  low-dimensional  subspace,  called  the  eigenspace,  in 
which  the  object  is  represented  as  a  hypersurface. 
Given  an  unknown  input  image,  the  recognition  sys¬ 
tem  projects  the  image  onto  the  eigenspace.  The  ob¬ 
ject  is  recognized  based  on  the  hypersurface  it  lies 
on.  The  exact  position  of  the  projection  on  the  hy¬ 
persurface  determines  the  object’s  pose  in  the  image. 
We  have  conducted  experiments  using  several  objects 
with  complex  appearance  characteristics.  These  re¬ 
sults  suggest  the  proposed  appearance  representation 
to  be  a  valuable  tool  for  a  variety  of  machine  vision 
applications. 

1  Introduction 

One  of  the  primary  goals  of  an  intelligent  vision  sys¬ 
tem  is  to  recognize  objects  in  an  image  and  com¬ 
pute  their  pose  in  the  three-dimensional  scene.  Such 
a  recognition  system  has  wide  applications  ranging 

‘This  research  was  supported  in  part  by  DARPA  Contract 
No.  DACA  76-92-C-0007  and  in  part  by  the  David  and  Lucile 
Packard  Fellowship. 


from  autonomous  navigation  to  visual  inspection.  For 
a  vision  system  to  be  able  to  recognize  objects,  it 
must  have  models  of  the  objects  stored  in  its  mem¬ 
ory.  In  the  past,  vision  research  has  emphasized  on 
the  use  of  geometric  (shape)  models  [Besl  and  Jain  85] 
[Chin  and  Dyer  86]  for  recognition.  In  the  case  of 
manufactured  objects,  these  models  are  sometimes 
available  and  are  referred  to  as  computer  aided  design 
(CAD)  models.  Most  objects  of  interest,  however,  do 
not  come  with  CAD  models.  Typically,  a  vision  pro¬ 
grammer  is  forced  to  select  an  appropriate  representa¬ 
tion  for  object  geometry,  develop  object  models  using 
this  representation,  and  then  manually  input  this  in¬ 
formation  into  the  system.  This  procedure  is  cumber¬ 
some  and  impractical  when  dealing  with  large  sets  of 
objects,  or  objects  with  complicated  geometric  proper¬ 
ties.  It  is  clear  that  recognition  systems  of  the  future 
must  be  capable  of  acquiring  object  models  without 
human  assistance.  In  other  words,  recognition  sys¬ 
tems  must  be  able  to  automatically  learn  the  objects 
of  interest. 

Visual  learning  is  clearly  a  well-developed  and  vi¬ 
tal  component  of  biological  vision  systems.  If  a  human 
is  handed  an  object  and  asked  to  visually  memorize  it, 
he  or  she  would  rotate  the  object  and  study  its  appear¬ 
ance  from  different  directions.  While  little  is  known 
about  the  exact  representations  and  techniques  used 
by  the  human  mind  to  learn  objects,  it  is  clear  that  the 
overall  appearance  of  the  object  plays  a  critical  role  in 
its  perception.  In  contrast  to  biological  systems,  ma¬ 
chine  vision  systems  today  have  little  or  no  learning 
capabilities.  Hence,  visual  learning  is  now  emerging 
as  an  topic  of  research  interest  [Poggio  and  Girosi  90] 
[Ullman  and  Basri  91]  [Ikeuchi  and  Suehiro  92].  The 
goal  of  this  paper  is  to  advance  this  important  but 
relatively  unexplored  area  of  machine  vision. 

Here,  we  present  a  technique  for  automatically 
learning  object  models  from  images.  The  appearance 
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of  an  object  is  the  combined  effect  of  its  shape,  re¬ 
flectance  properties,  pose  in  the  scene,  and  the  illumi¬ 
nation  conditions.  Recognizing  objects  from  bright¬ 
ness  images  is  therefore  more  a  problem  of  appearance 
matching  rather  than  shape  matching.  This  observa¬ 
tion  lies  at  the  core  of  our  work.  While  shape  and 
reflectance  are  intrinsic  properties  of  the  object  that 
do  not  vary,  pose  and  illumination  vary  from  scene  to 
scene.  We  approach  the  visual  learning  problem  as  one 
of  acquiring  a  compact  model  of  the  object’s  appear¬ 
ance  under  different  illumination  directions  and  object 
poses.  The  object  is  “shown”  to  the  image  sensor  in 
several  orientations  and  illumination  directions.  This 
can  be  accomplished  using,  for  example,  two  robot 
manipulators;  one  to  rotate  the  object  while  the  other 
varies  the  illumination  direction.  The  result  is  a  very 
large  set  of  object  images.  Since  all  images  in  the  set 
are  of  the  same  object,  any  two  consecutive  images 
are  correlated  to  large  degree.  The  problem  then  is  to 
compress  this  large  image  set  into  a  low-dimensional 
representation  of  object  appearance. 

A  well-known  image  compression  or  coding  tech¬ 
nique  is  based  on  the  Karhunen-Loeve  transform 
[Oja  83]  [Fukunaga  90].  This  method  computes  the 
eigenvectors  of  an  image  set.  The  eigenvectors  form 
an  orthogonal  basis  for  the  representation  of  individ¬ 
ual  images  in  the  image  set.  Though  a  large  number  of 
eigenvectors  may  be  required  for  very  accurate  recon¬ 
struction  of  an  object  image,  only  a  few  eigenvectors 
are  generally  sufficient  to  capture  the  significant  ap¬ 
pearance  characteristics  of  an  object.  These  eigenvec¬ 
tors  constitute  the  dimensions  of  what  we  refer  to  as 
the  eigenspace  for  the  image  set.  From  the  perspective 
of  machine  vision,  the  eigenspace  has  a  very  attrac¬ 
tive  property.  When  it  is  composed  of  all  the  eigen¬ 
vectors  of  an  image  set,  it  is  optimal  in  a  correlation 
sense;  If  any  two  images  from  the  set  are  projected 
onto  the  eigenspace,  the  distance  between  the  corre¬ 
sponding  points  in  eigenspace  is  a  measure  of  the  simi¬ 
larity  of  the  images  in  the  norm.  In  machine  vision, 
the  Karhunen-Loeve  method  has  been  applied  pri¬ 
marily  to  two  problems;  handwritten  chf^acter  recog¬ 
nition  [Murase  et  al.  81]  and  human  face  recogni¬ 
tion  [Sirovich  and  Kirby  87],  [Turk  and  Pentland  91]. 
These  applications  lie  within  the  domain  of  pattern 
classification  and  do  not  address  the  problem  of  learn¬ 
ing  or  using  complete  parametrized  models  of  the  ob¬ 
jects  of  interest. 

In  this  paper,  we  develop  a  continuous  and 
compact  representation  of  object  appearance  that  is 
parametrized  by  the  variables,  namely,  object  pose 
and  illumination.  This  new  representation  is  referred 
to  as  the  parametric  eigenspace.  First,  an  image  set  of 


the  object  is  obtained  by  varying  pose  and  illumina¬ 
tion  in  small  increments.  The  image  set  is  then  nor¬ 
malized  in  brightness  and  scale  to  achieve  invariance 
to  image  magnification  and  the  intensity  of  illumina¬ 
tion.  The  eigenspace  for  the  image  set  is  obtained  by 
computing  the  most  prominent  eigenvectors  of  the  im¬ 
age  set.  Next,  all  images  in  the  object’s  image  set  (the 
learning  samples)  are  projected  onto  the  eigenspace  to 
obtain  a  set  of  points.  These  points  lie  on  a  hypersur¬ 
face  that  is  parametrized  by  object  pose  and  illumina¬ 
tion.  The  hypersurface  is  computed  from  the  discrete 
points  using  the  cubic  spline  interpolation  technique. 
It  is  important  to  note  that  this  parametric  represen¬ 
tation  of  an  object  is  obtained  without  prior  knowledge 
of  the  object’s  shape  and  reflectance  properties.  It  is 
generated  using  just  a  sample  of  the  object. 

Each  object  is  represented  as  a  parametric  hy¬ 
persurface  in  two  different  eigenspaces;  the  univer¬ 
sal  eigenspace  and  the  object’s  own  eigenspace.  The 
universal  eigenspace  is  computed  by  using  the  im¬ 
age  sets  of  all  objects  of  interest  to  the  recognition 
system,  and  the  object  eigenspace  is  computed  using 
only  images  of  the  object.  We  show  that  the  universal 
eigenspace  is  best  suited  for  discriminating  between 
objects,  whereas  the  object  eigenspace  is  better  tuned 
for  pose  estimation.  Object  recognition  and  pose  esti¬ 
mation  can  be  summarized  as  follows.  Given  an  image 
consisting  of  an  object  of  interest,  we  assume  that  the 
object  is  not  occluded  by  other  objects  and  can  be  seg¬ 
mented  from  the  remaining  scene.  The  segmented  im¬ 
age  region  is  normalized  in  scale  and  brightness,  such 
that  it  has  the  same  size  and  brightness  range  as  the 
images  used  in  the  learning  stage.  This  normalized  im¬ 
age  is  first  projected  onto  the  universal  eigenspace  to 
identify  the  object.  After  the  object  is  recognized,  the 
image  is  projected  onto  the  object  eigenspace  and  the 
location  of  the  projection  on  the  object’s  parametrized 
hypersurface  determines  its  pose  in  the  scene. 

We  have  conducted  several  experiments  to 
demonstrate  the  power  of  the  parametric  eigenspace 
representation.  The  fundamental  contributions  of  this 
paper  can  be  summarized  as  follows,  (a)  The  paramet¬ 
ric  eigenspace  is  presented  as  a  new  representation  of 
object  appearance,  (b)  Using  this  representation,  ob¬ 
ject  models  are  automatically  learned  from  appearance 
by  varying  pose  and  illumination,  (c)  Both  learning 
and  recognition  are  accomplished  without  prior  knowl¬ 
edge  of  the  object’s  shape  and  reflectance. 

2  Visual  Learning  of  Objects 

In  this  section,  we  discuss  the  learning  of  object  mod¬ 
els  using  the  parametric  eigenspace  representation. 
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First,  we  discuss  the  acquisition  of  object  image  sets. 
The  eigenspaces  are  computed  using  the  image  sets 
and  each  object  is  represented  as  a  parametric  hy¬ 
persurface.  Throughout  this  section,  we  will  use  a 
sample  object  to  describe  the  learning  process.  In  the 
next  section,  we  discuss  the  recognition  and  pose  es¬ 
timation  of  objects  using  the  parametric  eigenspace 
representation. 

2.1  Normalized  Image  Sets 


While  constructing  image  sets  we  need  to  ensure  that 
all  images  of  the  object  are  of  the  same  size.  Each 
digitized  image  is  first  segmented  (using  a  threshold) 
into  an  object  region  and  a  background  region.  The 
background  is  assigned  a  zero  brightness  value  and  the 
object  region  is  re-sampled  such  that  the  larger  of  its 
two  dimensions  fits  the  image  size  we  have  selected 
for  the  image  set  representation.  We  now  have  a  scale 
normalized  image.  This  image  is  written  as  a  vector  x 
by  reading  pixel  brightness  values  from  the  image  in 
a  raster  scan  manner: 

X  =  [il,  X2,  . .xjv]^  (1) 


The  appearance  of  an  object  depends  on  its  shape  and 
reflectance  properties.  These  are  intrinsic  properties 
of  the  object  that  do  not  vary.  The  appearance  of 
the  object  also  depends  on  the  pose  of  the  object  and 
the  illumination  conditions.  Unlike  the  intrinsic  prop¬ 
erties,  object  pose  and  illumination  are  expected  to 
vary  from  scene  to  scene.  If  the  illumination  condi¬ 
tions  of  the  environment  are  constant,  the  appearance 
of  the  object  is  affected  only  by  its  pose.  Here,  we 
eissume  that  the  object  is  illuminated  by  the  ambient 
lighting  of  the  environment  as  well  as  one  additional 
distant  light  source  whose  direction  may  vary.  Hence, 
all  possible  appearances  of  the  object  can  be  captured 
by  varying  object  pose  and  the  light  source  direction 
with  respect  to  the  viewing  direction  of  the  sensor.  We 
will  denote  each  image  as  where  r  is  the  rotation 
or  pose  parameter,  /  represents  the  illumination  direc¬ 
tion,  and  p  is  the  object  number.  The  complete  image 
set  obtained  for  an  object  is  referred  to  as  the  object 
image  set  and  can  be  expressed  as: 
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Here,  R  and  L  are  the  total  number  of  discrete  poses 
and  illumination  directions,  respectively,  used  to  ob¬ 
tain  the  image  set.  If  a  total  of  P  objects  are  to  be 
learned  by  the  recognition  system,  we  can  define  the 
universal  image  set  as  the  union  of  all  the  object 
image  sets: 
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We  assume  that  the  imaging  sensor  used  for  learn¬ 
ing  and  recognizing  objects  has  a  linear  response,  i.e. 
image  brightness  is  proportional  to  scene  radiance.  We 
would  like  our  recognition  system  to  be  unaffected  by 
variations  in  the  intensity  of  illumination  or  the  aper¬ 
ture  of  the  imaging  system.  This  can  be  achieved  by 
normalizing  each  of  the  images  in  the  object  and  uni¬ 
versal  sets  such  that  its  average  brightness  is  zero  and 
the  brightness  variance  is  unity.  This  brightness  nor¬ 
malization  transforms  each  measured  image  x  to  a  nor¬ 
malized  image  x: 


X  =  [xi,  xa, . xjv]^ 

(4) 

where: 
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The  above  described  normalizations  with  respect  to 
scale  and  brightness  give  us  normalized  object  image 
sets  and  a  normalized  universal  image  set.  In  the  fol¬ 
lowing  discussion,  we  will  simply  refer  to  these  as  the 
object  and  universal  image  sets. 

The  images  sets  can  be  obtained  in  several  ways. 
If  the  geometrical  model  and  reflectance  properties  of 
an  object  are  known,  its  images  for  different  pose  and 
illumination  directions  can  be  synthesized  using  well- 
known  rendering  algorithms.  In  this  paper,  we  do  not 
assume  that  object  geometry  and  reflectance  are  given. 
Instead,  we  assume  that  we  have  a  sample  of  each  ob¬ 
ject  that  can  be  used  for  learning.  One  approach  then 
is  to  use  two  robot  manipulators;  one  grasps  the  ob¬ 
ject  and  shows  it  to  the  sensor  in  different  poses  while 
the  other  has  a  light  source  mounted  on  it  and  is  used 
to  vary  the  illumination  direction.  In  our  experiments, 
we  have  used  a  turntable  to  rotate  the  object  in  a  sin¬ 
gle  plane  (see  Fig.  1).  This  gives  us  pose  variations 
about  a  single  axis.  A  robot  manipulator  is  used  to 
vary  the  illumination  direction.  If  the  recognition  sys¬ 
tem  is  to  be  used  in  an  environment  where  the  illumi¬ 
nation  (due  to  one  or  several  sources)  is  not  expected 
to  change,  the  image  set  can  be  obtained  by  varying 
just  object  pose. 
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Figure  1:  Setup  used  for  automatic  acquisition  of  ob¬ 
ject  image  sets.  The  object  is  placed  on  a  motorized 
turntable. 


2.2  Computing  Eigenspaces 

Consecutive  images  in  an  object  image  set  tend  to  be 
correlated  to  a  large  degree  since  pose  and  illumination 
variations  between  consecutive  images  are  small.  Our 
first  step  is  to  take  advantage  of  this  correlation  and 
compress  large  image  sets  into  low- dimensional  rep¬ 
resentations  that  capture  the  gross  appearance  char¬ 
acteristics  of  objects.  A  suitable  compression  tech¬ 
nique  is  the  Karhunen-Loeve  expansion  [Fukunaga  90] 
where  the  eigenvectors  of  the  image  set  are  computed 
and  used  as  orthogonal  basis  functions  for  representing 
individual  images.  Though,  in  general,  all  the  eigen¬ 
vectors  of  an  image  set  are  required  for  the  perfect 
reconstruction  of  an  object  image,  only  a  few  are  suf¬ 
ficient  for  the  representation  of  objects  for  recognition 
purposes.  We  compute  two  types  of  eigenspaces;  the 
universal  eigenspace  that  is  obtained  from  the  univer¬ 
sal  image  set,  and  object  eigenspaces  computed  from 
individual  object  image  sets. 

To  compute  the  universal  eigenspace,  we  first  sub¬ 
tract  the  average  c  of  all  images  in  the  universal  set 
from  each  image.  This  ensures  that  the  eigenvector 
with  the  largest  eigenvalue  represents  the  dimension 
in  eigenspace  in  which  the  variance  of  images  is  max¬ 
imum  in  the  correlation  sense.  In  other  words,  it  is 
the  most  important  dimension  of  the  eigenspace.  A 
new  image  set  is  obtained  by  subtracting  the  average 
image  c  from  each  image  in  the  universal  set: 

X  =  {  -  C,  -  C,  . ,  XR,L^^'>  -  c  }(6) 

The  image  matrix  X  is  NxM,  where  M  =  RLP  is 
the  total  number  of  images  in  the  universal  set,  and 
N  is  the  number  of  pixels  in  each  image.  To  compute 


eigenvectors  of  the  image  set  we  define  the  covariance 
matrix  as: 

Q  =  XX^  (7) 

The  covariance  matrix  is  N  x  N,  clearly  a  very  large 
matrix  since  a  large  number  of  pixels  constitute  an  im¬ 
age.  The  eigenvectors  e,-  and  the  corresponding  eigen¬ 
values  A,-  of  Q  are  to  be  determined  by  solving  the 
well-known  eigenvector  decomposition  problem: 

A<e.  =  Qe.  (8) 

All  N  eigenvectors  of  the  universal  set  together  con¬ 
stitute  a  complete  eigenspace.  Any  two  images  from 
the  universal  image  set,  when  projected  onto  the 
eigenspace,  give  two  discrete  points.  The  distance  be¬ 
tween  these  points  is  a  measure  of  the  difference  be¬ 
tween  the  two  images  in  the  correlation  sense.  Since 
the  universal  eigenspace  is  computed  using  images  of 
all  objects,  it  is  the  ideal  space  for  discriminating  be¬ 
tween  images  of  different  objects. 

Determining  the  eigenvalues  and  eigenvectors  of 
a  large  matrix  such  as  Q  is  a  non-trivial  prob¬ 
lem.  It  is  computationally  very  intensive  and  tra¬ 
ditional  techniques  used  for  computing  eigenvectors 
of  small  matrices  are  impractical.  Since  we  are  in¬ 
terested  only  in  a  small  number  (k)  of  eigenvec¬ 
tors,  and  not  the  complete  set  of  N  eigenvectors, 
efficient  algorithms  can  be  used.  In  our  implemen¬ 
tation,  we  have  used  the  spatial  temporal  adaptive 
(STA)  algorithm  proposed  by  Murase  and  Linden- 
baum  [Murase  and  Lindenbaum  92).  This  algorithm 
was  recently  demonstrated  to  be  substantially  more 
efficient  than  previous  algorithnu.  Using  the  STA  al¬ 
gorithm  the  k  most  prominent  eigenvectors  of  the  uni¬ 
versal  image  set  are  computed.  The  result  is  a  set  of 
eigenvalues  {A,-  |  i  =  1,2, ...,k}  where  {Ai  >  Aj  > 

.  >  At},  and  a  corresponding  set  of  eigenvector 

{ej  I  I  =  1,2,  ...,k}.  Note  that  each  eigenvector  is 
of  size  N,  i.e.  the  size  of  an  image.  These  k  eigen¬ 
vectors  constitute  the  universal  eigenspace;  it  is  an 
approximation  to  the  complete  eigenspace  with  N  di¬ 
mensions.  We  have  found  from  our  experiments  that 
less  than  ten  dimensions  of  the  eigenspace  are  gener¬ 
ally  sufficient  for  khe  purposes  of  visual  learning  and 
recognition  (i.e.  k  <  10).  Later,  we  describe  how  ob¬ 
jects  in  an  unknown  input  image  are  recognized  using 
the  universal  eigenspace. 

Once  an  object  has  been  recognized,  we  are  inter¬ 
ested  in  finding  its  pose  in  the  image.  The  accuracy  of 
pose  estimation  depends  on  the  ability  of  the  recogni¬ 
tion  system  to  discriminate  between  different  images 
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of  the  same  object.  Hence,  pose  estimation  is  best 
done  in  an  eigenspace  that  is  tuned  to  the  appearance 
of  a  single  object.  To  this  end,  we  compute  an  object 
eigenspace  from  each  of  the  object  image  sets.  The 
procedure  for  computing  an  object  eigenspace  is  sim¬ 
ilar  to  that  used  for  the  universal  eigenspace.  In  this 
case,  the  average  of  all  images  of  object  p  is  com¬ 
puted  and  subtracted  from  each  of  the  object  images. 
The  resulting  images  are  used  to  compute  the  covari¬ 
ance  matrix  The  eigenspace  for  the  object  p  is 
obtained  by  solving  the  system: 

\Xp)e^(p)  =  Q(p)e.(P)  (9) 

Once  again,  we  compute  only  a  small  number  (ib<  10) 
of  the  largest  eigenvalues  |  i  =  l,2,...,ib} 

where  >  .  >  and  a  corre¬ 

sponding  set  of  eigenvector  {e/Pl  |  i  =  1,2,  ...,/b}. 
An  object  eigenspace  is  computed  for  each  object  of 
interest  to  the  recognition  system. 

2.3  Parametric  Eigenspace  Representation 

We  now  represent  each  object  as  a  hypersurface  in 
the  universal  eigenspace  as  well  as  its  own  eigenspace. 
This  new  representation  of  appearance  lies  at  the  core 
of  our  approach  to  visual  learning  and  recognition. 
Each  appearance  hypersurface  is  parametrized  by  two 
parameters;  object  rotation  and  illumination  direc¬ 
tion. 

A  parametric  hypersurface  for  the  object  p  is  con¬ 
structed  in  the  universal  eigenspace  as  follows.  Each 
image  (learning  sample)  in  the  object  image  set 

is  projected  onto  the  eigenspace  by  first  subtracting 
the  average  image  c  from  it  and  finding  the  dot  prod¬ 
uct  of  the  result  with  each  of  the  eigenvectors  (dimen¬ 
sions)  of  the  universal  eigenspace.  The  result  is  a  point 
in  the  eigenspace; 

=  [ei,e2, . ,6*]”^  -  c)  (10) 

Once  again  the  subscript  r  represents  the  rotation  pa¬ 
rameter  and  /  is  the  illumination  direction.  By  pro¬ 
jecting  all  the  learning  samples  in  this  manner,  we  ob¬ 
tain  a  set  of  discrete  points  in  the  universal  eigenspace. 
Since  consecutive  object  images  are  strongly  corre¬ 
lated,  their  projections  in  eigenspace  are  close  to  one 
another.  Hence,  the  discrete  points  obtained  by  pro¬ 
jecting  all  the  learning  samples  can  be  assumed  to  lie 
on  a  ib-dimensional  hypersurface  that  represents  all 
possible  poses  of  the  object  under  all  possible  illumi¬ 
nation  directions.  We  interpolate  the  discrete  points 
to  obtain  this  hypersurface.  In  our  implementation. 


we  have  used  a  standard  cubic  spline  interpolation  al¬ 
gorithm  [Press  et  al.  88].  Since  cubic  splines  are  well- 
known  we  will  not  describe  them  here.  The  resulting 
hypersurface  can  be  expressed  as; 

(11) 

where  0i  and  62  are  the  continuous  rotation  and  il¬ 
lumination  parameters.  The  above  hypersurface  is  a 
compact  representation  of  the  object’s  appearance. 

In  a  similar  manner,  a  hypersurface  is  also  con¬ 
structed  in  the  object’s  eigenspace  by  projecting  the 
learning  samples  onto  this  space: 

=  [eiW,  . ,ei(P)]'^  (x,.,(^’)-c('’))(12) 

where,  is  the  average  of  all  images  in  the  object 
image  set.  Using  cubic  splines,  the  discrete  points 

are  interpolated  to  obtain  the  hypersurface: 

(13) 

Once  again,  and  02  are  the  rotation  and  illumi¬ 
nation  parameters,  respectively.  This  continuous  pa¬ 
rameterization  enables  us  to  find  poses  of  the  object 
that  are  not  included  in  the  learning  samples.  It  also 
enables  us  to  compute  accurate  pose  estimates  un¬ 
der  illumination  directions  that  lie  in  between  the  dis¬ 
crete  illumination  directions  used  in  the  learning  stage. 
Fig.2  shows  the  parametrized  eigenspace  representa¬ 
tion  of  the  object  shown  in  Fig.l.  The  figure  shows 
only  three  of  the  most  significant  dimensions  of  the 
eigenspace  since  it  is  difficult  to  display  and  visualize 
higher  dimensional  spaces.  The  object  representation 
in  this  case  is  a  curve,  rather  than  a  surface,  since  the 
object  image  set  was  obtained  using  a  single  illumi¬ 
nation  direction  while  the  object  was  rotated  about 
a  single  axis.  The  discrete  points  on  the  curve  corre¬ 
spond  to  projections  of  the  learning  samples  in  the  ob¬ 
ject  image  set.  The  continuous  curve  passing  through 
the  points  is  parametrized  by  the  rotation  parameter 
01  and  is  obtained  using  the  cubic  spline  algorithm. 

3  Recognition  and  Pose  Estimation 

Consider  an  image  of  a  scene  that  includes  one  or  more 
of  the  objects  we  have  learned  using  the  parametric 
eigenspace  representation.  We  assume  that  the  ob¬ 
jects  are  not  occluded  by  other  objects  in  the  scene 
when  viewed  from  the  sensor  direction,  and  that  the 
image  regions  corresponding  to  objects  have  been  seg¬ 
mented  away  from  the  scene  image.  First,  each  seg¬ 
mented  image  region  is  normalized  with  respect  to 
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Figure  2:  Parametric  eigenspace  representation  of  the 
object  shown  in  Fig.l.  Only  the  three  most  prominent 
dimensions  of  the  eigenspace  are  displayed  here.  The 
points  shown  correspond  to  projections  of  the  learning 
samples.  Here,  illumination  is  constant  and  therefore 
we  obtain  a  curve  with  a  single  parameter  (rotation) 
rather  than  a  surface. 

scale  and  brightness  as  described  in  section  2.1.  This 
ensures  that  (a)  the  input  image  has  the  same  dimen¬ 
sions  as  the  eigenvectors  (dimensions)  of  the  paramet¬ 
ric  eigenspace,  (b)  the  recognition  system  is  invariant 
to  object  magnification,  and  (c)  the  recognition  system 
is  invariant  to  fluctuations  in  the  intensity  of  illumi¬ 
nation. 

As  stated  earlier-  in  the  paper,  the  universal 
eigenspace  is  best  tuned  to  discriminate  between  dif¬ 
ferent  objects.  Hence,  we  first  project  the  normalized 
input  image  y  to  the  universal  eigenspace.  First,  the 
average  c  of  the  universal  image  set  is  subtracted  from 
y  and  the  dot  product  of  the  resulting  vector  is  com¬ 
puted  with  each  of  the  eigenvectors  that  constitute 
the  universal  space.  The  k  coefficients  obtained  are 
the  coordinates  of  a  point  z  in  the  eigenspace; 

z  =  [ei,e2, . efc]’^(y-c)  =  [zi,Z2 . ,zt]^(14) 

The  recognition  problem  then  is  to  find  the  object  p 
whose  hypersurface  the  point  z  lies  on.  Due  to  fac¬ 
tors  such  as  image  noise,  aberrations  in  the  imaging 
system,  and  digitization  effects,  z  may  not  lie  exactly 
on  an  object  hypersurface.  Hence,  we  find  the  object 
p  that  gives  the  minimum  distance  between  its 
hypersurface  O2)  and  the  point  z: 

Z  -  g^^H6l,92)  II  (15) 


If  is  within  some  pre-determined  threshold  value, 
we  conclude  that  the  input  image  is  of  the  object  p.  If 
not,  we  assume  that  input  image  is  not  of  any  of  the 
objects  used  in  the  learning  stage.  It  is  important  to 
note  that  the  hypersurface  representation  of  objects  in 
eigenspace  results  in  more  reliable  recognition  than  if 
the  object  is  represented  as  just  a  cluster  of  the  points 
in  eigenspace.  The  hypersurfaces  of  different 
objects  can  intersect  each  other  or  even  be  intertwined, 
in  which  cases,  using  nearest  cluster  algorithms  could 
easily  lead  to  incorrect  recognition  results. 

Once  the  object  in  the  input  image  y  is  recog¬ 
nized,  we  project  y  to  the  eigenspace  of  the  object. 
This  eigenspace  is  tuned  to  variations  in  the  appear¬ 
ance  of  a  single  object  and  hence  is  ideal  for  pose 
estimation.  Mapping  the  input  image  to  the  object 
eigenspace  gives  the  ib-dimensional  point: 

zip)  =  [ei(p),  e2(P), . ,etW]'^(y  -  c(P))  (16) 

=  . zu^r»f 

The  pose  estimation  problem  may  be  stated  as  follows: 
Find  the  rotation  parameter  0i  and  the  illumination 
parameter  62  that  minimize  the  distance  d2^'’^  between 
the  point  z^p)  and  the  hypersurface  f(P)  of  the  object 
P- 

d2^^'>  =<hlp“||  z-f('»(0i,02)  II  (17) 

The  9i  value  obtained  represents  the  pose  of  the  object 
in  the  input  image.  Fig.  3(a)  shows  an  input  image  of 
the  object  whose  parametric  eigenspru:e  was  shown  in 
Fig.  2.  This  input  image  is  not  one  of  the  images  in 
the  learning  set  used  to  compute  the  object  eigenspace. 
In  Fig.  3b,  the  input  image  is  mapped  to  the  object 
eigenspace  and  is  seen  to  lie  on  the  parametric  curve 
of  the  object.  The  location  of  the  point  on  the  curve 
determines  the  object’s  pose  in  the  image.  Note  that 
the  recognition  and  pose  estimation  stages  are  compu¬ 
tationally  very  efficient,  each  requiring  only  the  pro¬ 
jection  of  an  input  image  onto  a  low-dimensional  (gen¬ 
erally  less  than  10)  eigenspace.  Customized  hardware 
can  therefore  be  used  to  achieve  real-time  (frame-rate) 
recognition  and  pose  estimation. 

4  Experimentation 

We  have  conducted  several  experiments  using  complex 
objects  to  verify  the  effectiveness  of  the  parametric 
eigenspace  representation.  This  section  summarizes 
some  of  our  results.  Fig.  1  in  section  2  shows  the 
set-up  used  to  conduct  the  experiments  reported  here. 
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Figure  3:  (a)  An  input  image,  (b)  The  input  image  is 
mapped  to  a  point  in  the  object  eigenspace.  The  loca¬ 
tion  of  the  point  on  the  parametric  curve  determines 
the  pose  of  the  object  in  the  input  image. 

The  object  is  placed  on  a  motorized  turntable  and  its 
pose  is  varied  about  a  single  axis,  namely,  the  axis  of 
rotation  of  the  turntable.  The  turntable  position  is 
controlled  through  software  and  can  be  varied  with  an 
accuracy  of  about  0.1  degrees.  Most  objects  have  a 
finite  number  of  stable  configurations  when  placed  on 
a  planar  surface.  For  such  objects,  the  turntable  is 
adequate  as  it  can  be  used  to  vary  pose  for  each  of  the 
object’s  stable  configurations. 

We  assume  that  the  object  is  illuminated  by  the 
ambient  lighting  conditions  of  the  environment  that 
are  not  expected  to  change  between  the  learning  and 
recognition  stages.  This  ambient  illumination  is  of 
relatively  low  intensity.  The  main  source  of  brightness 
is  an  additional  light  source  whose  direction  can  vary. 
In  most  of  our  experiments,  the  source  direction  was 
varied  manually.  We  are  currently  using  a  6  degree- 
of-freedom  robot  manipulator  (see  Fig.  1)  with  a  light 
source  mounted  on  its  end-effector.  This  enables  us  to 
vary  the  illumination  direction  via  software.  Images 
of  the  object  are  sensed  using  a  512x480  pixel  CCD 
camera  and  are  digitized  using  an  Analogies  frame- 
grabber  board. 

Table  1  summarizes  the  number  of  objects,  light 
source  directions,  and  poses  used  to  acquire  the  image 
sets  used  in  the  experiments.  For  the  learning  stage,  a 
total  of  4  objects  we.-;  used.  These  objects  (cars)  are 
shown  in  Fig.  4(a).  For  each  object  we  have  used  5 
different  light  source  directions,  and  90  poses  for  each 
source  direction.  This  gives  us  a  total  of  1800  images 
in  the  universal  image  set  and  450  images  in  each  ob¬ 
ject  image  set.  Each  of  these  images  is  automatically 
normalized  in  scale  and  brightness  as  described  in  sec¬ 


tion  2.  Each  normalized  image  is  128x128  pixels  in 
size.  The  universal  and  object  image  sets  were  used 
to  compute  the  universal  and  object  eigenspaces.  The 
parametric  eigenspace  representations  of  the  four  ob¬ 
jects  in  their  own  eigenspaces  are  shown  in  Fig.  4(b). 

Table  1:  Image  sets  obtained  for  the  learning  and 
recognition  stages.  The  1080  test  images  used  for 
recognition  are  different  from  the  1800  images  used 
for  learning. 


Leming  Samples 

Test  Samples  for  Recognition 

4  Objects 

5  Light  Source  Directions 
90  Poses 

4  Objects 

3  Light  Source  Directions 
90  Poses 

1800  Images 

1080  Images 

A  large  number  of  images  were  also  obtained  to 
test  the  recognition  and  pose  estimation  algorithms. 
All  of  these  images  are  different  from  the  ones  used 
in  the  learning  stage.  A  total  of  1080  inpu^  (test)  im¬ 
ages  were  obtained.  The  illumination  directions  and 
object  poses  used  to  obtain  the  test  images  are  differ¬ 
ent  from  the  ones  used  to  obtain  the  object  image  sets 
for  learning.  In  fact,  the  test  images  correspond  to 
poses  and  illumination  directions  that  lie  in  between 
the  ones  used  for  learning.  Each  input  image  is  first 
normalized  in  scale  and  brightness  and  then  projected 
onto  the  universal  eigenspace.  The  object  in  the  image 
is  identified  by  finding  the  hypersurface  that  is  closest 
to  the  input  point  in  the  universal  eigenspace.  Unlike 
the  learning  process,  recognition  is  computationally 
simple  and  can  be  accomplished  on  a  Sun  SPARC  2 
workstation  in  less  than  0.2  second. 

To  evaluate  the  recognition  results,  we  define  the 
recognition  rate  as  the  percentage  of  input  images  for 
which  the  object  in  the  image  is  correctly  recognized. 
Fig.  5(a)  illustrates  the  sensitivity  of  the  recogni¬ 
tion  rate  to  the  number  of  dimensions  of  the  universal 
eigenspace.  Clearly,  the  discriminating  power  of  the 
universal  eigenspace  is  expected  to  increase  with  the 
number  of  dimensions.  For  the  objects  used,  the  recog¬ 
nition  rate  is  poor  if  less  than  4  dimensions  are  used 
but  approaches  unity  as  the  number  of  dimensions  ap¬ 
proaches  10.  In  general,  however,  the  number  of  di¬ 
mensions  needed  for  robust  recognition  is  expected  to 
increase  with  the  number  of  objects  learned  by  the 
system.  It  also  depends  on  the  appearance  character¬ 
istics  of  the  objects  used.  From  our  experience,  10 
dimensions  are  sufficient  for  representing  objects  with 
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fairly  complex  appearance  characteristics  such  as  the 
ones  shown  in  Fig.  4. 

Finally,  we  present  experimental  results  related 
to  pose  estimation.  Once  the  object  is  recognized,  the 
input  image  is  projected  onto  the  object’s  eigenspace 
and  its  pose  is  computed  by  hnding  the  closest  point 
on  the  parametric  hypersurface.  Once  again  we  use  all 
1080  input  images  of  the  4  objects.  Since  these  images 
were  obtained  using  the  controlled  turntable,  the  ac¬ 
tual  object  pose  in  each  image  is  known.  Fig.  5(b,  and 
(c)  shows  histograms  of  the  errors  (in  degrees)  in  the 
poses  computed  for  the  1080  images.  The  error  his¬ 
togram  in  Fig.  5(b)  is  for  the  case  where  450  learning 
samples  (90  poses  and  5  source  directions)  were  used  to 
compute  the  object  eigenspace.  The  eigenspace  used 
has  8  dimensions.  The  histogram  in  Fig.  5(c)  is  for  the 
case  where  90  learning  samples  (18  poses  and  5  source 
directions)  were  used.  The  pose  estimation  results  in 
both  cases  were  found  to  be  remarkably  accurate.  In 
the  first  case,  the  average  of  the  absolute  pose  error 
computed  using  all  1080  images  is  found  to  be  0.5  de¬ 
grees,  while  in  the  second  case  the  average  error  is 
found  to  be  1.2  degrees. 

5  Conclusion 

In  this  paper,  we  presented  a  new  representation 
for  machine  vision  called  the  parametric  eigenspace. 
While  representations  previously  used  in  computer  vi¬ 
sion  are  based  on  object  geometry,  the  proposed  repre¬ 
sentation  describes  object  appearance.  We  presented 
a  method  for  automatically  learning  an  object’s  para¬ 
metric  eigenspace.  Such  learning  techniques  are  fun¬ 
damental  to  the  advancement  of  visual  perception.  We 
developed  efficient  object  recognition  and  pose  esti¬ 
mation  algorithms  that  are  based  on  the  parametric 
eigenspace  representation.  The  learning  and  recogni¬ 
tion  algorithms  were  tested  on  objects  with  complex 
shape  and  reflectance  properties.  A  statistical  anal¬ 
ysis  of  the  errors  in  recognition  and  pose  estimation 
demonstrate  the  proposed  approach  to  be  very  robust 
to  factors,  such  as,  image  noise  and  quantization.  We 
believe  that  the  results  presented  in  this  paper  are  ap¬ 
plicable  to  a  variety  of  vision  problems.  This  is  the 
topic  of  our  current  investigation. 
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Figure  4;  (a)The  four  objects  used  in  the  experiments, 
(b)  The  parametric  hypersurfaces  in  object  eigenspace 
computed  for  the  four  objects  shown  in  (a).  For  dis¬ 
play,  only  the  three  most  important  dimensions  of  each 
eigenspace  are  shown.  The  hypersurfaces  are  reduced 
to  surfaces  in  three-dimensional  space. 


Figure  5;  (a)  Recognition  rate  plotted  as  a  function  of 
the  number  of  universal  eigenspace  dimensions  used  to 
represent  the  parametric  hypersurfaces.  The  recogni¬ 
tion  rates  were  computed  using  all  1080  input  images 
detailed  in  Table  1.  (b)  Histogram  of  the  error  (in 
degrees)  in  computed  object  pose  for  the  case  where 
90  poses  are  used  in  the  learning  stage,  (c)  Pose  error 
histogram  for  the  case  where  18  poses  are  used  in  the 
learning  stage.  The  average  of  the  absolute  error  in 
pose  for  the  complete  set  of  1080  test  images  is  0.5  in 
the  first  case  and  1 .2  in  the  second  case 
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Abstract 

The  Schema  Learning  System  (SLS)  automates  the 
construction  of  knowledge-directed  recognition  strate¬ 
gies  by  learning  object-specific  strategies  that  inte¬ 
grate  many  different  visual  procedures.  Starting  from 
training  images  with  ground  truth  object  positions,  an 
object  model,  and  a  library  of  visual  procedures,  SLS 
learns  which  procedures  aid  in  the  recognition  of  a  par¬ 
ticular  object,  and  which  are  unreliable,  unneccessary 
or  too  expensive.  It  then  builds  a  strategy  for  control¬ 
ling  selected  procedures  that  will  reliably  recognise  the 
target  at  a  minimum  cost. 

This  paper  does  not  present  the  algorithm,  published 
previoudy,  by  which  SLS  infers  recognition  strategies. 
Instead,  it  focuses  on  SLS’s  role  as  an  integrator  of 
disparate  visual  procedures,  and  on  an  analysis,  based 
on  a  lemma  by  Valiant  [1984],  giving  a  probabilistic 
upper  bound  on  the  likel^ood  that  a  strategy  will  fail 
to  recognise  an  object  in  a  test  image. 

1  Introduction 

Most  knowledge-directed  vision  systems  are  hand¬ 
crafted  to  recognise  a  fixed  set  of  objects  within  a 
known  context.  Generally,  the  programmer  or  knowl¬ 
edge  engineer  who  constructs  them  begins  with  an  in¬ 
tuitive  notion  of  how  each  object  might  be  recognised, 
a  notion  which  is  refined  by  trial-and-error.  Eventually 
the  programmer  finds  a  combination  of  features  (e.g. 
color,  shape  or  context)  and  methods  (e.g.  geometric 
model  matching,  minimum-distance  classification  or 
generalised  Hough  transforms)  that  allow  the  objects 
to  be  reliably  identified  within  the  domain.  In  this 
way,  separate,  hand-crafted  strategies  are  constructed 
for  all  the  objects  in  a  domain. 

Unfortunately,  there  is  a  growing  concensus  that  hu¬ 
man  engineering  is  not  cost-effective  for  many  appli¬ 
cations.  Moreover,  even  when  hand-crafted  systems 

*Tku  work  wu  supported  by  Rome  Labs  under  con¬ 
tract  F30602-91-C-0037  and  by  DARPA/TACOM  under 
contract  DAAE07-91-C-R035 


can  be  constructed,  there  is  no  way  to  ensure  their 
validity,  since  their  performance,  in  terms  of  accuracy 
and  reliability,  is  unknown,  and  there  is  no  objective 
test  for  comparing  one  strategy  to  another.  Worst  of 
all,  when  the  domain  is  changed,  hand-crafted  systems 
often  have  to  be  rebuilt  from  scratch. 

If  we  look  at  the  knowledge  engineering  process  more 
closely,  we  find  that  it  is  convenient  to  draw  a  dis¬ 
tinction  between  object  models  and  the  visual  pro¬ 
cesses  used  to  extract  them.  In  a  typical  application, 
a  knowledge  engineer  starts  with  a  model  of  the  ob- 
ject(s)  to  be  recognised.  In  some  cases,  the  model  is  a 
complete  geometric  description  of  the  shape  of  an  ob¬ 
ject,  as  in  a  wire-frame  or  generalised-cylinder  model. 
In  other  cases,  the  model  describes  other  features,  such 
as  the  color  or  texture  of  an  object.  In  still  other  cases, 
the  model  may  be  in  the  form  of  constraints. 

Whatever  the  model,  the  knowledge  engineer’s  task 
is  to  select  visual  procedures  for  matching  the  object 
model  to  image  data  or,  equivalently,  reconstructing 
the  target  object  from  the  image  data  consistent  with 
the  constraints  of  the  object  model.  Visual  procedures 
are  selected  based  on  both  the  form  of  the  object  model 
and  knowledge  of  the  domain,  such  as  lighting  condi¬ 
tions  and  the  likelihood  of  occlusion. 

In  general,  recognition  strategies  require  more  than  a 
single  visual  procedure.  Most  visual  procedures  are  de¬ 
signed  to  solve  particular  subtasks,  and  several  must 
be  sequenced  together  in  order  to  reconstruct  an  ob¬ 
ject  model.  For  example.  Figure  1  shows  several  al¬ 
ternative  methods  for  finding  the  pose  of  a  roadsign, 
each  of  which  depends  on  one  or  more  intermediate 
representations  of  the  data.  To  build  a  recognition 
strategy,  a  knowledge  engineer  must  find  a  series  of 
visual  procedures  that  lead  from  the  original  image 
data  to  the  desired  representation.  In  addition,  most 
visual  procedures  are  prone  to  occassional  failures,  so 
system  designers  must  consider  how  much  redundancy 
to  include  in  each  strategy. 

Despite  the  labor-intensive  knowlegde-engineering 
process,  many  knowledge-directed  systems  have  been 
built  and  successfully  demonstrated,  e.g.  [Draper,  et. 
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Figure  1:  Several  methods  for  determining  the  pose  of  a  roadsign. 
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al.,  1989,  McKeown,  et.  al.,  1985,  Hwang,  et.  al., 
1986].  In  general,  the  success  of  these  systems  can  be 
traced  to  a  “small  world”  assumption,  in  which  the 
number  of  objects  in  the  domain  are  few,  the  con¬ 
straints  on  their  descriptions  are  tight,  and  a  complete 
world  model  is  at  leeist  a  possibility.  Consequently, 
knowledge  engineers  are  able  to  select  the  appropriate 
visual  procedures  for  the  domain  and  object  modeb, 
and  to  devise  effective  control  procedures  for  applying 
them.  However,  as  the  scope  of  a  system  broadens 
towards  a  domain-independent,  general-purpose  sys¬ 
tem,  system  designers  are  faced  with  au  unfortunate 
dilemma:  either  they  craft  more  and  more  special- 
purpose  strategies,  and  option  that  quickly  becomes 
infeasible  in  terms  of  labor,  or  they  generalize  their 
strategies  to  recognize  more  and  more  objects  under  a 
wider  range  of  circumstances.  In  the  latter  case,  con¬ 
straints  on  object  descriptions  become  looser  to  ac¬ 
count  for  wider  variability,  the  system  can  make  fewer 
assumptions  about  the  types  of  image  descriptions  nec¬ 
essary  for  matching,  and  the  complexity  of  matching 
increases  substantially. 

2  Knowledge  Base  Construction 

In  general,  the  difRculty  and  expense  of  knowledge 
base  construction  has  relegated  knowledge-directed  vi¬ 
sion  to  the  laboratory,  where  the  domain  can  be  re¬ 
stricted  to  a  few  objects  in  a  controlled  context.  Al¬ 
though  artificial  intelligence  researchers  handle  thb 
problem  by  extracting  knowledge  from  experts,  thb 
scenario  does  not  apply  to  computer  vision.  Vi¬ 
sion  researchers  have  therefore  concentrated  instead 
on  knowledge  base  specification.  The  SPAM  project 
developed  a  high-level  language  for  describing  objects 
[McKeown,  et.  al.,  1989]  and  a  series  of  tools  de¬ 
signed  to  automate  pieces  of  the  knowledge  acquisition 
process  [Harvey,  et.  al.,  1992].  The  UMass  Schema 
System  divided  both  the  interpretation  process  and 
the  knowledge  base  by  object,  increasing  modularity 
and  making  the  system  easier  to  modify  [Draper,  et. 
al.,  1989].  Work  in  Japan  has  involved  both  auto¬ 
matic  programming  efforts  and  higher-level  languages 
for  specifying  image  operations  [Matsuyama  1989]. 

3  The  Schema  Learning  System  (SLS) 

The  Schema  Learning  System  (SLS)  presents  a  differ¬ 
ent  solution  to  the  knowledge  base  construction  prob¬ 
lem.  SLS  is  a  system  that  automatically  learns  object- 
specific  or  task-specific  recognition  strategies  from  ob¬ 
ject  models,  training  images  and  a  library  of  visual 
procedures,  as  shown  in  Figure  2.  The  idea  behind 
SLS  is  that  a  general-purpose  vision  system  can  be 
constructed  of  hundreds  of  special-purpose  recognition 
strategies,  each  learned  from  experience,  rather  than 
from  a  single,  highly-general,  and  therefore  highly  in¬ 
efficient,  recognition  strategy. 


As  described  in  this  paper,  SLS’s  task  is  to  learn  strate¬ 
gies  that  recognize  a  particular  object  and  recover  its 
three-dimensional  position  and  orientation.  (SLS  can 
also  be  used  for  two-dimensional  recognition,  or  for 
learning  predicate  strategies  that  determine  if  an  ob¬ 
ject  is  present  or  not.)  It  is  trained  on  images  in  which 
the  position  and  orientation  of  the  target  object  are 
known.  Its  gocd  is  to  learn  a  control  strategy  for  in¬ 
voking  visual  procedures  from  its  library  that  mini¬ 
mizes  the  expected  cost  of  recognizing  the  target  ob¬ 
ject  across  the  training  images.  As  will  be  discussed 
in  the  next  section,  it  is  also  able  to  predict  the  ex¬ 
pected  cost  of  each  recognition  strategy  and  to  bound 
the  probability  of  failure. 

3.1  The  SLS  Process  Model 

SLS  is  similar  to  a  blackboard  system  in  that  recog¬ 
nition  is  viewed  as  a  process  of  applying  visual  pro¬ 
cedures  (VPs)  to  hypotheses.  Hypotheses  are  repre¬ 
sentations  of  the  image  or  3D  world  such  as  points, 
lines,  regions  or  surfaces;  visual  procedures  are  algo¬ 
rithms  from  the  image  understanding  literature  such 
as  line  extraction  or  geometric  model  matching  [Bev¬ 
eridge  and  Riseman,  1992].  Recognition  strategies  trdce 
the  place  of  dynamic  schedulers  in  traditional  black¬ 
board  systems,  selecting  which  procedure(s)  to  apply 
at  each  step. 

Therefore,  recognition  can  be  described  as  a  branch¬ 
ing  sequence  of  VP  invocations.  The  sequence  be¬ 
gins  when  data  arrives,  typically  in  the  form  of  im¬ 
age  hypotheses^ .  Visual  procedures  are  applied  to  im¬ 
ages,  producing  low-level  hypotheses  such  as  points, 
lines  or  regions.  New  VPs  are  then  applied  to  these 
low-level  hypotheses,  transforming  them  into  more  ab¬ 
stract  hypotheses.  Still  more  VPs  are  applied  to  these 
hypotheses  in  a  repeating  cycle,  until  eventu2dly  goal- 
level  hypotheses  are  created. 

3.1.1  Transformation  Procedures  (TPs) 

Unlike  most  blackboard  systems,  however,  SLS  refines 
its  processing  model  by  dividing  visual  procedures  into 
two  classes,  transformation  procedures  (TPs)  and  fea¬ 
ture  measurement  procedures  (FMPs)*.  Transforma¬ 
tion  procedures  transform  old  hypotheses  into  new  hy¬ 
potheses  at  a  higher  level  of  representation.  Examples 
include  vanishing  point  analysis,  which  creates  surface 
orientation  hypotheses  from  pencils  of  image  lines,  and 
stereo  line  matching,  which  creates  world  (3D)  line 
hypotheses  from  pmrs  of  image  (2D)  line  hypotheses. 
Feature  measurement  procedures,  by  way  of  compari¬ 
son,  measure  properties  of  hypotheses  without  chang¬ 
ing  their  underlying  representations. 

'Typically,  but  not  necessarily.  Active  strategies  may 
invoke  procedures  to  acquire  image  data. 

^We  will  still  use  the  generic  term  visual  procedures 
(VPs)  when  refering  to  both  TPs  and  FMPs. 
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Figure  2:  Top-level  view  of  SLS  aichitectnre 


Although  TPs  ate  described  as  transformation  oper¬ 
ators,  the  word  ‘transformation*  should  not  be  con¬ 
strued  as  implying  a  one-to-one  mapping  between  old 
and  new  hypotheses.  TPs  can  combine  information 
from  multiple  hypotheses  and  may  generate  an  arbi¬ 
trary  number  of  new  hypotheses.  Stereo  line  matching, 
for  example,  combines  two  image  (2D)  line  hypotheses 
to  generate  a  single  world  (3D)  line  hypothesis.  In  ad¬ 
dition,  TPs  do  not  consume  their  arguments,  so  mul¬ 
tiple  TPs  may  be  applied  to  a  single  hypothesis.  Some 
readers  may  therefore  find  it  helpful  to  think  of  TPs 
as  procedures  that  generate  new  hypotheses  from  old 
hypotheses,  rather  than  as  a  transformation  operator. 

5.1.2  Feature  Measurement  Procedures 
(FMPs) 

Feature  measurement  procedures  (FMPs)  calculate 
features  of  hypotheses,  such  as  orientations  of  pla¬ 
nar  surfaces  and  intensities  of  regions.  During  the 
recognition  process,  many  properties  of  a  hypothesis 
may  remain  uncalculated,  so  the  set  of  known  fea¬ 
tures  describing  a  hypothesis  is  refered  to  as  its  knowl¬ 
edge  state.  Applying  a  FMP  to  one  or  more  hypothe¬ 
ses  computes  a  feature  of  those  hypotheses,  advancing 
them  to  new  knowledge  states.  The  number  of  knowl¬ 
edge  states  is  finite,  since  continuous  features  are  di¬ 
vided  into  discrete  feature  ranges. 

3.1.3  Object  Models 

Many  visual  procedures,  including  various  types  of 
graph  matching,  require  data  from  an  object  model 
to  be  compared  with  hypotheses  extracted  from  the 
image.  Since  the  object  model  does  not  ch2Mige  from 
image  to  image,  SLS  considers  object  model  compo¬ 
nents  to  be  compile-time  parameters  of  visual  proce¬ 
dures.  When  a  strategy  is  trained,  the  object  model. 


supplied  by  the  user,  determines  which  visual  proce¬ 
dures  are  enabled  (either  because  they  do  not  require 
model  data,  or  because  the  necessary  data  is  included 
in  the  model),  and  therefore  which  procedures  can  be 
used  in  the  recognition  strategy.  For  example,  Bev¬ 
eridge’s  geometric  model  matcher  [Beveridge  and  Rise- 
man,  1992]  matches  3D  model  lines  to  2D  data  lines.  If 
the  object  model  includes  a  wire-frame  shape  descrip¬ 
tion,  the  edges  of  the  wire-frame  become  compQe-time 
parameters  to  the  model  matcher  spedfjring  the  3D 
model  lines  to  be  matched.  If,  on  the  other  hand, 
a  wire-frame  description  is  not  included  as  part  of  the 
object  model,  then  the  geometric  model  matcher  is  not 
enabled  and  cannot  be  used  as  part  of  a  recognition 
strategy. 

3.2  Recognition  Graphs 

Interpretation  strategies  are  represented  in  SLS  as 
generalised  multi-level  decision  trees  called  recogni¬ 
tion  graphs  that  direct  both  hypothesis  formation  and 
hypothesis  verification,  as  shown  in  Figure  3.  The 
premise  behind  the  formalism  is  that  object  recogni¬ 
tion  is  a  series  of  small  verification  tuks  interleaved 
with  representational  transformations.  Recognition 
begins  with  trying  to  verify  hypotheses  at  a  low  level 
of  representation,  separating  to  the  extent  possible  hy¬ 
potheses  that  are  reliable  from  those  that  are  not.  Ver¬ 
ified  hypotheses  (or  at  least,  hypotheses  that  have  not 
been  rejected)  are  then  transformed  to  a  higher  level  of 
representation,  where  a  new  verification  process  takes 
place.  The  cycle  of  verification  followed  by  transfor¬ 
mation  continues  until  3D  pose  hypotheses  are  verified, 
or  until  all  hypotheses  have  been  rejected. 

The  structure  of  the  recognition  graph  reflects  the  ver¬ 
ification/transformation  cycle.  Each  level  of  the  recog¬ 
nition  graph  is  a  decision  tree  that  controls  hypothe- 
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Figuie  3:  A  recognition  graph.  Levels  of  the  graph  are  dednon  trees  that  verify  hypotheses  unng  feature  measure¬ 
ment  procedures  (FMPs).  Hypotheses  that  reach  a  subgoal  are  transformed  to  the  next  level  of  representation  by  a 
transformation  procedure  (TP). 


sit  verification  at  one  level  of  representation  by  invok¬ 
ing  feature  measurement  procedures  (FMPs)  to  gather 
support  for  or  against  each  hypothesis.  When  a  hy¬ 
pothesis  is  determined  to  be  reliable  within  the  deci¬ 
sion  tree,  a  transformation  procedure  (TP)  transforms 
it  to  another  level  of  representation,  where  the  process 
repeats  itself. 

As  defined  in  the  field  of  operations  research,  decision 
trees  are  a  form  of  state-space  representation  com¬ 
posed  of  alternating  choice  states  and  chance  states. 
When  searching  for  a  path  from  the  start  state  to 
a  goal  state,  an  agent  is  only  allowed  to  choose 
where  to  go  next  from  a  choice  state.  If  the  cur¬ 
rent  state  is  a  chance  state  the  next  state  is  selected 
probabilistically^.  The  search  process  is  therefore  sim¬ 
ilar  to  using  a  game  tree  against  a  probabilistic  oppo¬ 
nent. 

In  SLS,  the  choice  states  are  hypothesis  knowledge 
states  as  represented  by  sets  of  hypothesis  feature  val¬ 
ues.  The  choice  to  be  made  at  each  knowledge  state  is 
which  FMP  (if  any)  to  execute  next.  Chance  states  in 
the  tree  represent  FMP  applications,  where  the  chance 
concerns  which  value  the  FMP  will  return.  Hypothesis 
verification  is  an  alternating  cycle  in  which  the  control 


^Operations  research  terminology  is  baaed  on  trees 
rather  than  spaces,  so  it  refers  to  choice  nodes  and  chance 
nodes  rather  than  choice  states  and  chance  states,  and  to 
leaf  nodes  and  root  nodes  rather  than  goal  states  and  start 
states. 


strategy  selects  which  FMP  to  invoke  next  (i.e.,  which 
feature  to  compute),  and  the  FMP  probabilistically  re¬ 
turns  a  feature  value.  Thus  hypotheses  advance  from 
knowledge  states  to  FMP  application  states  and  then 
on  to  new  knowledge  states.  The  cycle  continues  for 
each  hypothesis  until  it  reaches  a  subgoal  state,  indi¬ 
cating  that  it  has  been  verified  and  should  be  trans¬ 
formed  to  a  higher  level  of  representation,  or  a  failure 
state,  indicating  that  the  hypothesis  is  unreliable  and 
should  be  rejected. 

In  general,  SLS  learns  in  advuce  which  FMP  to  choose 
at  each  knowledge  state  in  order  to  avoid  making  run¬ 
time  control  decisions.  As  a  result,  when  SLS  builds 
a  recognition  graph  it  leaves  just  one  option  at  each 
choice  node.  Often,  however,  the  readiness  of  a  FMP 
to  be  executed  cannot  be  determined  until  run-time, 
in  which  case  SLS  will  leave  several  options  at  a  choice 
node,  sorted  in  order  of  desirability*.  At  run-time  the 
system  will  choose  the  highest-ranking  FMP  that  is 
ready  to  be  executed. 

3.S  Inference  Algorithms 

This  paper  will  not  repeat  the  inference  algorithm  by 
which  SLS  infers  recognition  graphs  from  training  im- 

*Thu  is  just  one  of  many  complications  that  arise  from 
multiple-argument  visual  procedures.  In  general,  we  will 
describe  SLS  as  if  all  VPs  took  just  one  argument  in  order 
to  keep  the  description  brief;  see  Draper  [1993]  for  a  more 
complete  description. 


ages;  interested  readers  are  refered  to  [Draper  et.  al., 
1993,  Draper  1993,  Draper  et.  al.  1992].  However,  we 
will  briefly  mention  some  of  its  most  important  charac¬ 
teristics.  First,  SLS’s  inference  algorithm  is  a  syntac¬ 
tic,  logic-based  algorithm  that  makes  no  assumptions 
about  the  visual  procedures  or  representations  it  is  ma¬ 
nipulating  other  than  the  declarative  knowledge  about 
the  type  of  hypotheses  that  each  takes  as  input  and 
produces  cis  output.  As  a  result,  although  Table  3.3 
shows  some  of  the  visual  procedures  and  representa¬ 
tions  that  have  been  used  in  SLS  experiments  to  date, 
this  list  is  by  no  means  exclusive:  SLS  could  just  as 
easily  manipulate  generalized  cylinders  as  wire-frame 
models,  and  probably  any  algorithm  found  in  the  Im¬ 
age  Understanding  literature  could  be  included  in  a 
SLS  strategy. 

Secono,  SLS’s  learning  algorithm  tries  to  minimize  the 
expected  cost  of  recognition,  subject  to  the  constraint 
that  a  strategy  must  recognize  every  object  instance 
in  the  training  set.  As  a  prerequisite  for  the  minimiza¬ 
tion  process,  SLS  estimates  the  expected  cost  of  every 
visual  procedure  and  the  likelihood  of  each  feature, 
based  on  information  g^ed  from  the  training  images. 

Finally,  SLS’s  inference  algorithm  is  strictly  a  gener¬ 
alization  algorithm.  SLS  starts  by  learning  a  strategy 
for  finding  the  target  object  in  the  first  training  im¬ 
age;  it  then  generalizes  this  strategy  to  account  for  the 
second  training  image,  and  the  third,  and  so  on,  until 
eventually  the  strategy  is  general  enough  to  find  the 
object  in  every  training  image.  As  it  generalizes,  each 
new  strategy  is  guarenteed  to  be  less  likely  to  fail  on  a 
new  image  than  the  strategy  it  was  generalized  from, 
and  this  implies  that  SLS’s  strategies  can  be  analyzed 
using  techniques  introduced  by  Valiant  [1984]  (see  be¬ 
low). 


4  Statistical  Properties  of  Learned 
Strategies 

As  discussed  in  the  introduction,  the  most  immedi¬ 
ate,  practical  problem  with  knowledge-directed  vision 
is  the  time  needed  to  construct  knowledge  bases,  a 
problem  that  SLS  solves  by  automatically  learning 
recognition  strategies.  Another  problem  with  hand¬ 
crafted  strategies,  however,  is  that  even  when  they  can 
be  constructed,  there  is  no  way  to  ensure  their  valid¬ 
ity,  since  their  performance,  in  terms  of  accuracy  and 
reliability,  is  unknown.  SLS  addresses  this  problem 
by  estimating,  for  each  recognition  strategy  it  learns, 

1)  the  expected  cost  of  applying  that  strategy,  and 

2)  a  probabilistic  bound  on  the  likelihood  of  failure. 
Given  the  recognition  graph  formalism,  predicting  the 
expected  cost  is  trivial;  predicting  robustness,  on  the 
other  hand,  requires  more  complex  probabilistic  rea¬ 
soning. 


4.1  Predicting  Expected  Cost 

As  was  discussed  in  Section  3.2,  recognition  strate¬ 
gies  are  represented  as  multi-level  decision  trees  called 
recognition  graphs.  Each  level  of  a  graph  corresponds 
to  one  level  of  representation  (e.g.  points,  lines,  sur¬ 
faces),  and  the  decision  tree  at  that  level  controls  the 
order  in  which  features  are  measured.  Hypotheses  with 
features  that  SLS  has  learned  are  reliable  are  then 
transformed  to  the  next  level  of  representation,  whUe 
unreliable  hypothesis  are  discarded. 

Since  SLS  estimates  the  likelihood  of  each  feature  dur¬ 
ing  training,  as  well  as  the  expected  cost  of  eerch  fea¬ 
ture  measurement  procedure  (FMP),  it  is  a  straight¬ 
forward  procedure  to  estimate  the  cost  of  verifying  a 
hypothesis  at  any  level  of  representation,  given  a  recog¬ 
nition  graph  (the  equations  can  be  found  in  [Draper  et. 
al.  1992,  Draper  1993]).  Furthermore,  since  the  cost 
of  transforming  hypotheses  from  one  level  to  another, 
as  well  as  the  average  number  of  hypotheses  per  level, 
can  also  be  estimated  from  the  training  data,  the  total 
expected  cost  of  recognition  is  easily  obtmned. 

ExperimentaUy,  Draper  [1993]  describes  three  experi¬ 
ments  in  which  the  expected  cost  of  a  strategy,  over 
twenty  test  images,  was  predicted  to  within  four  per¬ 
cent  of  its  actual  cost.  In  all  three  cases,  the  error  was 
a  slight  overestimate  of  the  cost,  due  to  differences  in 
the  paging  behavior  of  visual  procedures  during  train¬ 
ing  and  testing. 

4.2  Predicting  Robustness 

A  more  difficult  task  is  to  predict  the  robustness  of  a 
strategy.  Intuitively,  a  robust  strategy  is  one  that  reli¬ 
ably  recognizes  objects  in  test  images.  For  the  sake  of 
analysis,  however,  we  wiU  concentrate  on  the  subprob¬ 
lem  of  how  robustly  a  strategy  generates  goal-level  hy¬ 
potheses  from  images  through  chains  of  intermediate- 
level  hypotheses^ .  For  example,  if  the  goal  is  to  locate 
a  building  to  within  three  feet  of  its  actual  position, 
what  is  the  probability  that  at  least  one  correct  hy¬ 
pothesis  wiU  be  generated  when  presented  with  a  pic¬ 
ture  of  the  building?  The  analysis  has  to  take  into 
account  the  possibility  of  failure  at  any  step  in  the 
process,  as  well  as  the  redunancy  in  many  strategies. 

4.3  Assumptions 

Any  analysis  of  an  algorithm  must  make  certain  as¬ 
sumptions  about  the  data.  In  this  case,  the  analysis 
rests  on  three  assumptions  about  the  knowledge  base 
and  training  set: 

1.  Deterministic  VPs.  The  behavior  of  a  visual 
procedure  is  fully  determined  by  the  properties  of 

^SLS’s  learning  algorithm  is  not  a  strict  generalization 
algorithm  when  goal-level  verification  is  included  in  the 
analysis. 
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'Kansformation  Procedure 

Input 

Output 

Ref 

Moravec  Operator 

Image  or  ROI 

2D  Points 

[Moravec  1981] 

Line  Extraction 

Image  or  ROI 

2D  Lines 

[Boldt  and  Weiss  1989] 

Region  Segmentation 

Image  or  ROI 

Region  Set 

[Beveridge,  et.  al.  1989] 

Color  Classification 

Region  Set 

Region  (sub)Set 

[Duda  and  Hart  1973] 

Polygonal  Approx. 

Region 

2D  Lines 

Parabola  Fitting 

Region 

Parabola 

Pencil  Grouping 

2D  Lines 

Pencil 

Collins  and  Weiss  1990] 

Vanishing  Point  Anal. 

Pencil 

3D  Orientation 

Collins  and  Weiss  1990] 

Trihedral  Grouping. 

2D  Lines 

THhedral  Jets 

Trihedral  Angle  Anal. 

Trihedral  Jets 

3D  Orientation 

[Kanatani  1988] 

Reprojection(l) 

3D  Orientation  L  Region 

3D  Plane 

Reprojection(2) 

3D  Orientation  A  2D  Points 

3D  Points 

Absolute  Orientation 

3D  Points 

Pose 

[Horn  1987] 

Subgraph  Isomorphism 

2D  Lines 

2D  Lines 

[Ullman  1976] 

Point  Resection 

2D  Points 

Pose 

[USGS] 

Geometric  Model  Match 

2D  Lines 

Pose 

[Beveridge  and  Ri^man  1992] 

Table  1:  The  current  library  of  transformation  procedures  (TPs)  in  SLS.  SLS  integrates  procedures  at  a  syntactic 
level,  based  solely  on  knowledge  of  the  types  of  hypotheses  a  procedure  takes  as  input  and  produces  as  output. 
As  a  result,  new  procedures  can  be  easily  added  to  its  library. 


its  arguments. 

2.  Knowledge  base  sufficiency.  Every  object  in¬ 
stance  can  be  recognised  by  some  sequence  of  VPs 
in  the  knowledge  base.  By  definition,  when  this 
assumption  is  violated  there  is  no  good  recogni¬ 
tion  strategy. 

3.  Randomly  selected  training  images.  Train¬ 
ing  images  are  drawn  at  random  from  the  same 
image  distribution  as  the  test  images.  Although 
often  violated  in  practice,  this  assumption  pro¬ 
vides  the  theoretical  basis  for  predicting  a  strat¬ 
egy’s  performance  on  test  images  from  its  perfor¬ 
mance  on  training  images. 

4.4  PAC  Analysis 

A  method  for  formally  analysing  algorithms  that  gen¬ 
eralises  from  positive  examples  was  introduced  by 
Valiant  as  part  of  his  work  on  probably  almost  correct 
(PAC)  learning  [Valiant  1984].  Valiant  proved  (based 
on  earlier  work  by  Chernoff)  that  the  probability  of 
fewer  than  S  successes  in  n  independent  Bernoulli  tri¬ 
als,  each  with  probability  h~^  or  greater,  is  less  than 
h~^,  where: 

n  <  2h(S In  h).  (1) 

As  an  example  of  how  Equation  1  might  be  used. 
Valiant  considered  the  traditional  problem  of  select¬ 
ing  marbles  from  an  urn.  Assuming  5  distinct  colors 
of  marbles,  the  probability  that  the  (n  -f  l)th  marble 
selected  at  random  will  be  of  a  different  color  from  all 
of  its  n  predecessors  is  less  than  h~^,  by  Equation  1. 
(Alert  readers  may  notice  that  the  probability  of  see¬ 
ing  a  new  color  drops  each  time  a  new  color  is  seen,  but 
that  it  is  always  at  least  as  high  as  the  final  probabil¬ 
ity,  which  is  sufficient  to  satisfy  the  lemma.  In  effect. 


the  lemma  overestimates  the  number  of  training  sam¬ 
ples  needed  by  assuming  only  that  the  probability  of 
seeing  a  new  color  on  the  first  sample  was  at  least  as 
high  as  the  probability  on  the  last  sample.  The  lemma 
applies  because  the  probability  of  seeing  a  new  color 
decreses  monotonicaUy.)  Significantly,  the  probability 
bound  h  holds  for  any  distribution  of  colors. 

Valiant  notes  in  his  proof  that  is  used  in  two  sepa¬ 
rate  probabilistic  bounds.  Qualitatively  speaking,  the 
first  (call  it  addresses  the  possibility  that  the 
randomly  selected  training  samples  may  not  be  rep¬ 
resentative  and  therefore  may  not  include  a  frequently 
occurring  sample  type.  The  second  prcbability  (call 
it  hj^)  reflects  the  observation  that  if  some  colors  are 
very  rare,  they  will  probably  not  be  seen  during  train¬ 
ing,  even  though  there  is  a  finite  probability  that  they 
may  occur  during  testing.  It  is  these  double  proba¬ 
bilities  that  give  probably  almost  correct  learning  its 
name:  with  probability  hj  the  learned  concept  or 
strategy  accounts  for  aU  but  hj  ^  of  the  samples  in  the 
underlying  distribution,  hence  it  will  probably  be  al¬ 
most  correct.  Moreover,  there  is  no  reason  why  hi  has 
to  be  equal  to  h2.  Nonetheless,  we  will  follow  Valiant 
in  setting  hi  =  h2  and  using  Equation  1  (See  [Kearns 
1990]  for  a  treatment  that  considers  hi  and  h2  separ¬ 
ately.). 


4.5  Analysing  Strategies 

Valiant  used  Equation  1  as  the  foundation  for  a  com¬ 
putational  theory  of  machine  learning  [Valiant84],  but 
we  will  use  it  for  a  much  more  modest  purpose,  namely 
estimating  the  robustness  of  recognition  strategies.  At 
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each  step  in  its  learning  process,  SLS  has  a  strategy^ 
that  is  capable  of  recognising  every  training  instance 
seen  so  far.  When  a  new  training  instance  is  presented, 
the  current  strategy  is  tested  on  it;  if  the  strategy  is  al¬ 
ready  capable  of  recognizing  the  new  object  instance, 
then  the  current  strategy  is  not  changed,  otherwise  it 
is  generalized  to  account  for  the  new  example.  The 
situation  is  exactly  analagous  to  Valiant’s  example  of 
drawing  balls  from  an  urn,  where  the  current  strat¬ 
egy  corresponds  to  the  set  of  colors  seen  so  far.  As  a 
result  we  can  place  a  lower  bound  on  the  robustness 
of  a  strategy  (i.e.  rtn  upper  bound  on  the  probability 
that  it  will  fi^)  by  counting  the  number  of  training 
instances  and  how  often  during  training  the  strategy 
had  to  be  generalized,  and  applying  Equation  1. 

Unfortunately,  Equation  1  is  not  in  a  convenient  form 
for  determining  the  robustness  of  a  strategy  h  from 
the  number  of  training  samples  n  and  the  number  of 
failures  during  trzdning  5.  Doing  some  algebra  (and 


substituting  m  for  |): 


(e")*  = 


hiS  +  lnh) 

gSh^k  Ink 


(e^h)'^ 

ie^hy‘ 


Substituting  c  for  and  y  for  (e^h),  we  get  an 

equation  of  the  form  0  =  1^.  Solving  for  y: 

lnc  =  ylny  (2) 

In  In  c  =  In  y  +  In  In  y 
ss  (1 -t-o(l))lny 


1  +  0(1) 


Substituting  for  In  y  from  Equation  2  yields: 

Inc  1  ,  , 

-  %  - y—r  In  In  c 

y  1  +  o(l) 


’1+0(1) 
(1  +  0(1)) 


In  Inc 


In  Inc 


Resubstituting  for  c  and  y  we  get: 

....  e^lne”* 

%  (l  +  o(l))-;-y- - ^ 

ln(e*lne”*) 

*Actually,  SLS  maintains  a  set  of  potential  strategies, 
and  does  not  select  the  optimal  one  until  after  the  last 
training  image. 


(1+0(1)) 

(1+0(1)) 


In  e^  +  In  In  «*• 


S  +  In  m 


Implying  that: 


h  ss  (1  +  o(l)) 


2(S  +  In  n  —  In  2) 


Equation  3  estimates  a  lower  bound  on  the  robustness 
of  a  strategy  from  the  size  of  its  training  set  and  the 
number  of  generalizations  during  training,  assuming 
only  that  n  >  2(5  +  In  n  —  In  2)  >  0.  In  particular,  it 
asserts  that  the  probability  of  learning  a  strategy  that 
fails  more  often  than  is  less  than  h~^. 

5  conclusion 

For  small  domains,  knowledge-directed  vision  systems 
can  recognise  objects  accurately  and  efficiently,  in  part 
because  knowledge  engineers  are  able  to  select  the  ap¬ 
propriate  procedures  and  devise  efficient  control  pro¬ 
cedures  for  applying  them.  Unfortunately,  as  problem 
domiuns  become  more  complex,  it  becomes  impossible 
to  hand-craft  solutions  to  every  recognition  problem. 
SLS  presents  a  solution  to  this  knowledge  engineer¬ 
ing  problem  by  having  the  system  learn  specialized 
recognition  strategies  from  training  images  and  object 
models,  without  the  aid  of  a  human  knowledge  en¬ 
gineer.  SLS  selects  the  most  appropriate  recognition 
techniques  for  a  target  object  from  a  library  of  visual 
procedures,  and  builds  a  strategy  that  efficiently  rmd 
effectively  controls  their  application.  Just  as  impor¬ 
tant,  SLS  does  what  most  knowledge  engineers  were 
unable  to  do,  namely  analyse  its  strategies  and  pre¬ 
dict  their  expected  cost  and  robustness. 
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Abstract 

The  recognition  of  repetitive  movements  char¬ 
acteristic  of  walking  people,  galloping  horses, 
or  flying  birds  is  a  routine  function  of  the 
human  visual  system.  It  has  been  demon¬ 
strated  that  humans  can  recognize  such  ac¬ 
tivity  solely  on  the  basis  of  motion  informar 
tion.  We  present  a  novel  computational  ap¬ 
proach  for  detecting  such  activities  in  real  im¬ 
age  sequences  on  the  basis  of  the  periodic  na¬ 
ture  of  their  signatures.  The  approach  sug¬ 
gests  a  low-level  feature  based  activity  recog¬ 
nition  mechanism.  This  contrasts  with  earlier 
model-based  approaches  for  recognizing  such 
activities. 

1  Introduction 

The  motion  recognition  ability  of  the  human  visual  sys¬ 
tem  is  remarkable.  People  are  able  to  distinguish  both 
highly  structured  motion,  such  as  that  produced  by 
walking,  running,  swimming  or  flying  birds,  and  more 
statistical  patterns  such  as  that  due  to  blowing  snow, 
flowing  water  or  fluttering  leaves.  More  subtle  move¬ 
ment  characteristics  can  be  distinguished  as  well.  For 
example,  human  observers  can  not  only  distinguish  walk¬ 
ing  from  other  activities,  but  can  also  recognize  a  friend 
walking  at  a  distance  by  his  or  her  gait.  Similar  dis¬ 
crimination  abilities  have  been  observed  in  non-human 
animals.  This  biological  use  of  motion  probably  reflects 
the  fact  that  for  certain  tasks,  visual  motion  provides 
more  effective  cues  than  other  modes  of  visual  percep¬ 
tion.  Motion  is  a  particularly  useful  cue  for  certain  types 
of  recognition  due  to  the  fact  that  it  is  relatively  easy  to 
extract  the  motion  field  independent  of  illumination  and 
shading  of  the  image. 

The  classic  demonstration  of  pure  motion  recognition 
by  humans  is  provided  by  Moving  Light  Display  experi¬ 
ments.  In  these  experiments,  reflective  pads  are  attached 
to  the  joints  of  a  person  and  his  or  her  movements  are 
filmed  against  a  black  background  so  that  only  the  light 
reflected  off  the  pads  is  visible.  When  people  are  shown 
these  images  they  dismiss  single  frames  as  meaningless 
dot  patterns,  but  they  recognize  the  sequential  presen¬ 
tation  of  them  as  walking,  running  or  jumping  etc.  Sub¬ 
jects  can  also  identify  the  actor’s  gender  and  even  iden¬ 


tify  the  actor  if  known  to  them  [Johansson,  1973].  For 
certain  recognition  tasks,  as  illustrated  above,  motion 
alone  is  sufficient. 

Duplication  of  some  of  these  motion  recognition  abil¬ 
ities  in  machine  systems  would  be  useful  in  a  number 
of  applications.  One  area  is  in  automated  surveillance. 
Motion  detection  via  image  differencing  can  be  used  for 
intruder  detection;  however  such  systems  are  subject  to 
false  alarms,  especially  in  outdoor  environments,  since 
the  system  is  triggered  by  anything  that  moves,  whether 
it  be  a  human,  a  dog,  or  a  tree  blown  by  the  wind. 
Motion  recognition  te^niques,  both  of  the  discrete  and 
textural  variety  have  the  potential  to  disambiguate  the 
motions  of  different  origin.  Another  application  is  in 
industrial  monitoring.  Many  manufacturing  operations 
involve  a  long  sequence  of  simple  operations  each  per¬ 
formed  repeatedly  and  at  high  spe^  by  a  specialized 
mechanism  at  a  particular  location.  It  should  be  possi¬ 
ble  to  set  up  one  or  more  fixed  cameras  that  cover  the 
area  of  interest,  and  to  characterize  the  allowed  motions 
in  each  region  of  the  image(s). 

A  useful  first  step  in  recognizing  motion  from  gray- 
level  image  sequences  is  to  classify  motions  according  to 
the  spatial  and  temporal  uniformity  they  exhibit.  We 
define  temporal  textures  to  be  the  motion  patterns  of 
indeterminate  spatial  and  temporal  extent,  activities  to 
be  motion  patterns  which  are  temporally  periodic  but 
are  limited  in  spatial  extent,  and  motion  events  to  be 
isolated  simple  motions  that  do  not  exhibit  any  tem¬ 
poral  or  spatial  repetition.  Examples  of  temporal  tex¬ 
tures  include  wind  blown  trees  or  grass,  turbulent  flow 
in  cloud  patterns,  ripples  on  water,  the  motion  of  a  flock 
of  birds  etc.  Examples  of  activities  are  walking,  running, 
or  flying  individual,  rotating  or  reciprocating  machinery, 
etc.  Examples  of  motion  events  are  isolated  instances  of 
opening  a  door,  starting  of  a  car,  throwing  a  ball  etc. 

It  turns  out  that  temporal  textures  can  be  effectively 
treated  with  statistical  techniques  analogous  to  those 
used  in  gray-level  texture  discrimination.  A  previous 
paper  [Polana  and  Nelson,  1992]  describes  this.  Activ¬ 
ities  and  motion  events,  on  the  other  hand,  are  more 
discretely  structured,  and  techniques  similar  to  those 
used  in  static  object  recognition  would  be  expected  to 
be  useful  in  their  classification.  Since  different  sorts  of 
techniques  must  be  used  to  distinguish  the  different  sorts 
of  motion,  it  would  be  useful  to  have  a  method  for  mak- 
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ing  a  preliminary  classification  of  the  motions  present  in 
an  image.  In  this  paper,  we  describe  a  robust  method 
for  detecting  and  localizing  periodic  activities,  including 
ones,  such  as  walking  or  flying,  that  involve  simultane¬ 
ous  translation  of  the  actor.  The  method  is  based  on 
frequency  domain  analysis  of  a  image  in  which  low-level 
motion  information  has  been  used  to  isolate  and  track 
likely  locations  of  activity.  The  method  also  suggests 
a  way  of  using  low-level  structural  features  to  classify 
activities  once  they  have  been  detected. 

2  Related  Work 

Although  motion  plays  an  important  role  in  biological 
recognition  tasks,  motion  recognition  in  general,  has  re¬ 
ceived  little  attention  in  the  literature  compared  to  the 
volume  of  work  on  static  object  recognition.  Most  com¬ 
putational  motion  work  in  motion  in  fact,  has  been  con¬ 
cerned  with  various  aspects  of  the  structure-from-motion 
problem.  There  is  a  large  body  of  psychophysical  litc 
ature  addressing  the  perception  of  motion,  most  of  it 
concerned  with  primitive  percepts.  A  modest  amount 
of  this  work  addresses  more  complicated  motion  recog¬ 
nition  issues  [Johansson,  1973,  Cutting,  1981,  Hoffman 
and  Flinchbuagh,  1982,  Hildreth  and  Koch,  1987],  but 
the  models  and  descriptions  have  typically  not  been  im¬ 
plemented.  Various  computational  models  of  tempo¬ 
ral  structure,  have  been  proposed  (e.g.  [Chun,  1986, 
Feldman,  1988])  but  much  of  this  work  is  at  a  fairly  high 
level  of  abstraction,  and  has  not  actually  been  applied 
to  visual  motion  recognition  except  in  rather  artificial 
tests. 

Goddard  [1989]  considers  recognizing  event  sequences 
from  Moving  Light  Display  (MLD)  images.  His  work 
addresses  the  representation  of  motion  event  sequences 
and  their  recognition  assuming  certain  invariant  im¬ 
age  features.  His  input  consists  of  the  joint  angles 
and  angular  velocities  computed  from  the  motion  of 
the  dots  in  the  light  displays.  The  joint  angles  and 
angular  velocities  are  invariant  to  rotation  in  the  im¬ 
age  plane,  scale  and  translation.  A  challenging  part 
in  computing  these  invariants  is  to  recover  the  con¬ 
nectivity  of  the  individual  dots  (by  body  parts)  in  the 
MLD  images.  A  domain  independent  approach  to  this 
problem  is  given  by  Rashid.  Rashid  [Rashid,  1980, 
O’Rourke  and  Badler,  1980]  considered  the  computa¬ 
tional  interpretation  of  moving  light  displays,  particu¬ 
larly  in  the  context  of  g  it  determination.  This  work  em¬ 
phasized  rather  high-level  symbolic  models  of  temporal 
sequences,  an  approach  made  possible  by  the  discrete  na¬ 
ture  of  the  moving  light  displays.  The  results  were  quite 
sensitive  to  discrete  errors  and  thus  highly  dependent 
on  the  ability  to  solve  the  correspondence  problem  and 
accurately  track  joint  and  limb  positions.  This  severely 
limits  the  general  applicability  of  the  method. 

Anderson  et  al.  [Anderson  el  a/.,  1985]  describe  a 
met  hod  of  change  detection  for  surveillance  applications 
based  on  the  spectral  energy  in  a  temporal  difference 
image.  This  was  not  generalized  to  other  motion  fea¬ 
tures  or  more  sophisticated  recognition.  Gould  and  Shah 
[1989]  represent  motion  characteristics  of  moving  objects 
by  recording  the  important  events  in  their  trajectory. 


They  propose  the  use  of  the  resulting  trajectory  primal 
sketch  in  a  motion  recognition  system.  Koller,  Heinze 
and  Nagel  [l99l]  developed  a  system  that  tracks  moving 
vehicles  and  characterizes  their  trajectory  segments  in 
terms  of  natural  language  concepts. 

A  few  studies  have  considered  highly  specific  aspects 
of  motion  recognition  computationally.  Pentland  [Pent- 
land  and  Mase,  1989]  considered  lip  reading,  and  imple¬ 
mented  a  system  that  could  recognize  spoken  digits  with 
70%-90%  accuracy  over  5  speakers.  The  system  required 
the  location  of  the  lips  to  be  entered  by  hand,  and  de¬ 
pended  on  an  explicitly  constructed  lip  model.  Some 
temporal  pattern  recognition  work  has  been  done  in  the 
context  of  speech  processing  [Juang  and  Rabiner,  1985, 
Tank  and  Hopfield,  1987,  Elaman,  1988].  But  the  appli¬ 
cability  of  the  techniques  to  motion  recognition  has  not 
been  considered. 

3  Activity  Detection 

Activities  involve  a  regularly  repeating  sequence  of  mo¬ 
tion  events.  If  we  conside'  an  image  sequence  as  a  spatio- 
temporal  solid  with  two  spatial  dimensions  and  one  time 
dimension,  then  repeated  activity  tends  to  give  rise  to 
periodic  or  semi-periodic  gray  level  signals  along  smooth 
curves  in  the  image  solid.  We  refer  to  these  curves  as 
reference  curves.  If  these  curves  could  be  identified  and 
samples  extracted  along  them  over  several  cycles,  then 
frequency  domain  techniques  could  be  used  in  order  to 
judge  the  degree  of  periodicity. 

Two  important  cases  are  stationary  activities,  and  ac¬ 
tivities  that  result  in  a  more  or  less  uniform  translation 
of  the  actor,  i.e.  locomotory  activities.  If  the  activity  is 
stationary,  the  reference  curves  are  lines  parallel  to  the 
temporal  dimension.  For  example,  a  circularly  rotating 
ring  gives  rise  to  a  temporally  periodic  signal  at  every 
pixel.  In  the  case  of  uniform  translation,  the  curves  are 
straight  lines  at  some  angle  that  depends  on  the  velocity. 
For  general  translation  and  perspective  projection,  the 
lines  associated  with  a  given  actor  form  a  bundle  with 
a  common  intersection,  the  vanishing  point.  For  many 
practical  situations,  however,  the  vanishing  point  is  far 
enough  removed  that  the  lines  can  be  considered  to  be 
effectively  parallel. 

Consider,  for  example  the  case  of  walking.  This  is 
an  example  of  a  non-stationary  activity;  that  is,  if  we 
attach  a  reference  point  to  the  person  walking,  that 
point  does  not  remain  at  one  location  in  the  image.  If 
the  person  is  walking  with  constant  velocity,  however, 
and  is  not  too  close  to  the  camera,  then  reference  point 
moves  across  the  image  on  a  path  composed  of  a  con¬ 
stant  velocity  component  modulated  by  whatever  pe¬ 
riodic  motion  the  reference  point  undergoes.  Thus,  if 
we  know  the  average  velocity  of  the  person  over  sev¬ 
eral  cycles,  we  can  compute  the  spatio-temporal  line  of 
motion  along  which  the  periodicity  can  be  observed.  If 
the  person  moves  with  average  velocity  (u, «)  the  spatio- 
temporal  line  of  motion  will  be  determined  by  the  equa¬ 
tions  (*,y)  =  (u,v)  *  t  +  (xo,!/o),  where  (i,y)  is  the 
position  of  the  object  in  space  at  time  t  and  (xo,fo)  is 
the  position  at  time  zero.  This  applies  to  any  object 
undergoing  constant  velocity  locomotion. 
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3.1  Periodicity  Detection 

From  Fourier  theory  we  know  that  any  periodic  signal 
can  be  decomposed  into  a  fundamental  and  harmonics. 
That  is,  we  can  consider  the  energy  of  a  periodic  signal  to 
be  concentrated  at  frequencies  which  are  integral  multi¬ 
ples  of  some  fundamental  frequency.  This  implies  that  if 
we  compute  the  discrete  Fourier  transform  of  a  sampled 
periodic  signal,  we  will  observe  peaks  at  the  fundamen¬ 
tal  frequency  and  its  harmonics.  Hence,  in  theory,  the 
periodicity  of  a  signal  can  be  detected  by  obtaining  its 
Fourier  transform  and  checking  whether  all  the  energy 
in  the  spectrum  is  contained  in  a  fundamental  frequency 
and  its  integral  multiples. 

The  real-world  signals,  however  are  seldom  perfectly 
periodic.  In  the  case  of  signals  arising  from  activity 
in  image  sequences,  disturbances  can  arise  from  errors 
in  the  uniform  translation  assumption,  varying  back¬ 
ground  and  lighting  behind  a  locomoting  actor,  and 
other  sources.  In  addition,  for  computational  purposes, 
we  need  to  truncate  the  signal  at  some  finite  length  which 
may  not  be  an  exact  integral  multiple  of  its  period.  Nev¬ 
ertheless,  the  frequency  defined  by  the  highest  amplitude 
often  represents  the  fundamental  frequency  of  the  signal. 
Hence  we  can  get  an  idea  of  the  periodicity  in  a  signal  by 
summing  the  energy  at  the  highest  amplitude  frequency 
and  its  multiples,  and  comparing  that  quantity  to  the 
energy  at  the  remaining  frequencies.  In  practice,  since 
peaks  in  a  Fourier  transform  tend  to  be  slightly  broad¬ 
ened  for  a  variety  of  reasons,  including  the  finite  length  of 
the  sample,  we  define  the  periodicity  measure  py  of  a  sig¬ 
nal  /  as  a  normalized  difference  of  the  sum  of  the  power 
spectrum  values  at  the  highest  amplitude  frequency  and 
its  multiples,  and  the  sum  of  the  power  spectrum  values 
at  the  frequencies  halfway  between.  That  is, 

P/  =  ~  +  y^.Fiw+w/2) 

i  <  «■  I 

where  F  is  the  energy  spectrum  of  the  signal  /  and  w  is 
the  frequency  corresponding  to  the  highest  amplitude  in 
the  energy  spectrum. 

The  measure  is  normalized  with  respect  to  the  total 
energy  at  the  frequencies  of  interest  so  that  it  is  one  for  a 
completely  periodic  signal  and  zero  for  a  flat  spectrum. 
In  general,  if  a  signal  consists  of  frequencies  other  than 
one  single  fundamental  and  its  multiples,  its  periodicity 
measure  will  be  low. 

Because  the  signal  along  emy  given  reference  curve  in 
the  image  solid  may  be  ambiguous,  we  need  a  way  of 
combining  periodicity  measures  of  a  number  of  signals 
from  reference  curves  associated  with  the  same  actor. 
The  simplest  idea  would  be  simply  to  sum  the  power 
spectra  of  the  various  signals,  and  apply  the  periodic¬ 
ity  measure  to  the  resultant  curve.  Unfortunately,  this 
does  not  work,  primarily  because,  although  there  is  a 
fair  amount  of  energy  at  the  fundamental  frequency,  and 
quite  a  few  signals  in  which  high  periodicity  is  present, 
there  are  also  a  lot  of  samples  where  the  periodicity  is 
not  evident,  or  which  appear  periodic  at  some  other  fre¬ 
quency.  The  net  affect,  is  that  all  this  energy  at  other  fre¬ 
quencies  can  swamp  the  main  signal  if  they  are  combined 
additively.  What  does  work,  is  a  form  of  non-maximum 


suppression,  where  the  periodicity  measure  is  obtained 
for  each  power  spectrum  separately.  Each  frequency  w 
is  then  assigned  a  value  equal  to  the  sum  of  the  peri¬ 
odicity  measures  Fy,  from  ail  the  signals  whose  hipest 
amplitude  occurred  at  that  frequency.  The  result  is  the 
same  as  suppressing  all  but  the  maximum  frequency  in 
each  transform,  weighting  each  by  the  periodicity  mea¬ 
sure  of  the  signrd,  and  summing  them.  The  maximum 
value  of  this  combined  signal  is  taken  as  the  fundamental 
frequency,  and  the  associated  periodicity  measure  is  the 
average  of  the  periodicity  measures  of  the  contributing 
signals. 

Thus,  the  periodicity  measure  P  for  an  entire  image 
sequence  is  defined  as 

P  =  maxfPu; /nui) 

U) 

where  and  Pu,  are  the  number  of  pixels  at  which 
the  highest  amplitude  frequency  is  w  and  the  sum  of 
periodicity  measures  at  those  pixels  respectively. 

Finally,  in  order  to  apply  the  technique  to  real  data, 
we  need  a  way  of  extracting  reference  curves  and  the 
associated  signals  from  an  image  sequence.  In  the  fol¬ 
lowing,  we  assumed  that  any  activity  that  existed  in  the 
data  would  be  either  stationary,  or  locomotory  in  a  man¬ 
ner  that  produced  an  overall  translatory  motion.  We 
also  assumed  that  there  was  at  most  one  actor  in  the 
scene,  though  a  certain  amount  of  background  motion 
could  be  tolerated.  The  first  assumption  turns  out  not 
to  be  too  restrictive  -  a  large  number  of  natural  periodic 
activities  fit  into  one  of  the  two  categories.  The  second 
can  be  relaxed  with  some  additional  preprocessing.  The 
first  step  is  to  identify  locations  in  the  scene  where  move¬ 
ment  of  any  sort  is  occurring.  This  is  done  by  computing 
the  normal  flow  magnitude  at  each  pixel  between  each 
successive  pair  of  frames  using  a  spatio-temporal  deriva¬ 
tive  method.  Those  pixels  at  which  significant  motion  is 
present  are  marked,  and  the  centroid  of  the  marked  pix¬ 
els  computed  in  each  frame.  The  mean  velocity  (if  any) 
of  the  actor  is  then  computed  by  fitting  a  linear  tra¬ 
jectory  to  the  sequence  of  centroids.  This  is  where  the 
one-actor  assumption  comes  into  play.  If  several  actors 
were  present,  simple  clustering  techniques  could  be  used 
to  isolate  the  regions  in  the  scene  corresponding  to  differ¬ 
ent  activities.  The  reference  curves  were  taken  the  lines 
in  the  spatiotemporal  solid  parallel  to  that  generated  by 
the  linear-fitted  trajectory  of  the  centroid.  Signals  were 
extracted  along  these  curves,  and  those  that  displayed 
significant  spread  over  a  period  of  at  least  half  as  long 
as  the  signal  were  selected  for  processing.  This  had  the 
effect  of  eliminating  the  need  to  process  regions  in  which 
no  motion  occurred,  as  well  as  regions  affected  only  by 
an  occasional  blip. 

3.2  Experiments 

We  ran  experiments  on  four  different  activities,  and  a 
number  of  non-periodic  motions.  The  sequences  were 
first  recorded  on  video  and  then  digitized  later  with  suit¬ 
able  temporal  sampling  so  that  at  least  four  cycles  of  the 
activity  were  captured  in  128  frames.  Following  ^  a  de¬ 
scription  of  each  activity  and  the  conditions  under  which 
they  were  digitized. 
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•  Walk:  A  person  walking  across  a  room  viewed  in 
profile.  Six  sequences  of  128  frames  of  size  128x128 
pixels  were  obtained.  Half  the  sequences  contained 
one  person  and  the  other  half  a  second. 

•  Exercise:  A  person  performing  jumping  jacks.  Four 
sequences  of  128  frames  of  128x128  pixels,  two  each 
of  two  different  people. 

•  Swing:  A  person  swinging  viewed  from  the  side. 
Six  sequences  of  128  frames  of  128x128  pixels,  three 
each  of  two  different  people. 

•  Frog:  A  toy  frog  simulating  swimming  activity 
viewed  from  above.  Four  sequences  of  128  frames 
of  64x256  pixels. 

•  Nonperiodic:  Various  sequences  taken  from  televi¬ 
sion  shows  and  live  outdoor  shots:  splashing  wa¬ 
ter,  closeup  of  crowd  at  a  political  rally,  a  plane 
flying  overhead,  a  robot  hand  picking  up  and  ma¬ 
nipulating  objects  (2  sequences),  the  input  to  an 
eye  tracker  (eyeball  movements),  leaves  fluttering 
in  the  wind,  turbulent  flow  in  a  stream.  In  all,  8 
sequences  of  128  frames  of  128x128  pixels. 

The  swing  and  exercise  activities  were  shot  outdoors  and 
contained  background  motion  as  well.  Among  the  peri¬ 
odic  activities,  a  single  sequence  of  uniform  rotation  is 
included  as  well.  Sample  images  of  periodic  activities 
are  shown  in  figures  1. 

The  periodicity  measures  computed  using  the  above 
algorithm  are  plotted  for  all  20  periodic  and  all  8  non¬ 
periodic  sequences  in  figure  2.  As  is  evident  from  the 
graphs  and  the  projected  scatter  plot,  the  technique  sep>- 
arates  complex  periodic  from  non-periodic  motion  nicely. 
The  requirement  that  an  empirically  determined  thresh¬ 
old  be  used  is  not  a  great  drawback  in  this  case,  nor 
is  it  particularly  surprising,  since  even  the  the  intuitive 
notion  of  periodic  activity  falls  on  a  continuum.  Is  the 
motion  of  a  branch  waving  somewhat  irregularly  in  the 
wind  periodic  or  non-periodic?  Here,  we  classified  it  as 
non-periodic,  but  it  had  one  of  the  higher  periodicity 
measures,  as  might  be  expected. 

4  Discussion 

The  periodicity  detection  method  we  described  satisfies 
the  several  desirable  invariances.  It  is  invariant  to  image 
illumination,  contrast,  translation,  rotation  and  scale.  It 
is  also  invariant  to  the  magnitude  of  locomotory  motion 
and  the  speed  of  the  activity.  It  is  also  fairly  robust  with 
respect  to  small  changes  in  viewing  angle.  The  period¬ 
icity  measure  does  not  depend  on  the  number  of  pixels 
involved  in  the  activity.  If  desired,  a  restriction  on  the 
minimum  number  of  pixels  can  be  imposed  so  that  only 
activities  of  a  minimum  size  can  be  recognized.  The 
swing  and  exercise  sequences  were  taken  outdoors  where 
there  is  a  small  amount  of  background  motion.  This 
comprises  not  only  moving  trees  and  plants,  but  also 
moving  people  and  occasional  crossing  of  a  car.  That  pe¬ 
riodicity  can  be  detected  even  in  this  case  demonstrates 
that  the  technique  is  reasonably  tolerant  of  background 
clutter  and  an  occasional  disturbance.  The  technique 
also  provides  a  method  for  localizing  activity  in  the  scene 


by  back-projecting  the  reference  curves  having  high  pe¬ 
riodicity  measures  into  the  image  solid. 

So  far  we  have  assumed  that  the  actors  giving  rise 
to  the  activity  move  with  const^ult  velocity  along  lin¬ 
ear  paths.  The  case  of  nonlinearly  moving  objects  can 
be  handled  by  tracking  the  object  of  interest  given  a 
coarse  estimate  of  its  initial  location  and  velocity.  This 
would  generate  reference  curves  that  were  not  straight 
lines.  We  have  already  demonstrated  the  usefulness  of 
the  centroid  of  motion  for  computing  the  velocity  of  lin¬ 
early  moving  objects.  It  could  also  be  used  for  tracking 
the  actors  moving  on  more  complex  trajectories.  Use  of 
the  motion  centroid  can  be  unreliable  in  estimating  the 
centroid  of  the  object  if  the  shape  of  the  object  changes 
as  it  moves.  In  this  case  use  of  a  prediction  and  correc¬ 
tion  mechanism  using  past  values  over  a  sufficiently  long 
period  can  help. 

The  detection  scheme  also  assumes  that  there  is  only 
one  activity  in  the  scene  except  for  some  background 
clutter.  If  there  are  multiple  activities  in  the  scene,  this 
detection  technique  can  still  be  applied  provided  the  ac¬ 
tivities  can  be  spatially  isolated  so  that  they  do  not  inter¬ 
fere  with  each  other.  In  this  case  they  can  segmented  us¬ 
ing  the  motion  information  and  later  tracked  separately. 
Even  an  occasion2d  crossing  of  different  activities  can  be 
tolerated  as  long  as  the  regions  can  be  separated  again 
later. 

The  complexity  of  detection  is  proportional  to  the 
number  of  pixels  involved  in  the  activity.  About  half  the 
work  is  computing  the  fast  Fourier  transforms  at  each  of 
the  pixels.  Most  of  the  rest  of  the  time  is  occupied  by 
the  motion  detection  process.  The  detection  procedure 
currently  runs  on  an  SGI  machine  using  four  processors 
and  it  take  approximately  15  seconds  to  process  a  128 
frame  sequence  of  128x128  images. 

4.1  Recognition  of  Activities 

The  first  stage  in  recognizing  an  activity  is  to  detect  that 
an  activity  exists,  and  localize  it  in  the  scene.  This  paper 
has  described  a  technique  for  accomplishing  this.  Future 
work  will  utilize  information  computed  in  the  detection 
stage  for  recognition  and  classification  of  specific  activi¬ 
ties.  The  detection  scheme  utilizes  only  the  magnitude  of 
the  Fourier  transform  to  obtain  the  periodicity  measure. 
The  phase  of  the  Fourier  transform  is  also  computed  at 
each  location  in  the  image  and  we  propose  to  use  this  in¬ 
formation  along  with  other  low-level  information  in  the 
image,  for  recognition.  For  example,  W2dking  can  be  de¬ 
scribed  as  a  sequence  of  motion  events  regularly  occur¬ 
ring  at  each  spatial  location.  The  cycle  of  motion  events 
at  different  spatial  locations  in  the  image  have  a  fixed 
phase  difference.  These  phase  differences  are  valuable  in 
characterizing  the  activities. 

5  Conclusion 

We  have  described  a  method  of  activity  detection.  This 
technique  uses  a  periodicity  measure  on  gray-level  signals 
extracted  along  spatio-temporal  reference  curves.  We 
have  illustrated  the  technique  using  real-world  examples 
of  activities,  and  shown  that  it  robustly  detects  complex 
periodic  activities,  while  excluding  non-periodic  motion. 
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Figure  2:  Periodicity  measure  for  Periodic  and  Nonperi¬ 
odic  sequences 


We  proposed  a  technique  to  recognize  these  activities  us¬ 
ing  the  detection  scheme  described  here.  It  is  not  clear 
how  much  the  periodicity  alone  is  useful  for  recognition 
but  we  believe  the  phase  information  is  valuable  for  ac¬ 
tivity  recognition.  Future  work  will  concentrate  on  the 
development  of  robust  phase  features  that  can  be  used  in 
conjunction  with  previously  developed  motion  and  gray- 
level  features  to  classify  activities. 
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Figure  1:  Sample  images  from  periodic  sequences:  walk,  exercise,  swing  and  rotation 
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Abstract 

We  hypothesize  that  selective  perception  allows 
more  accurate  solutions  to  visual  tasks  to  be 
found  in  less  wall-clock  time  than  non-selective 
techniques.  The  best  way  to  assess  the  practi¬ 
cal  truth  of  this  hypothesis  is  by  studying,  de¬ 
signing  and  building  complete  vision  systems 
—  the  issues  are  fundamentally  systems  issues. 

On  the  other  hand,  special-case  systems  are  not 
convincing:  we  present  the  T-world  problem  as 
an  abstraction  of  an  interesting  class  of  real- 
world  vision  problems.  T-world  has  enough 
structure  to  support  basic  study  of  fundamen¬ 
tal  tradeoffs  inherent  in  selective  computer  per¬ 
ception.  Our  complete  system  is  called  TEA-1: 
it  is  a  purposive  and  sufficing  vision  system  that 
solves  a  version  of  the  T-world  problem.  TEA- 
1  is  a  fully  implemented  system,  and  extensive 
experiments  in  the  laboratory  and  simulation 
have  explored  the  key  factors  that  make  the 
selective  perception  approach  appealing,  ana¬ 
lyzing  how  each  factor  affects  the  overall  per¬ 
formance  when  solving  a  set  of  automatically 
generated  (in  simulation)  T-world  domains  and 
tasks. 

1  Selective  Perception 

1.1  General  Concepts 

Purposive  vision.  A  purposive  vision  system  works 
to  achieve  a  goal  (t.e.  solve  a  visual  task)  in  minimal 
wall-clock  time.  Goal-directed  operation  can  make  the 
system  fast  by  limiting  the  amount  of  data  processed  and 
by  limiting  the  extent  to  which  that  data  is  processed. 

Sufficing  vision.  Sufficing  vision  is  the  use  of  (usu¬ 
ally  simple,  cheap,  and  general)  vision  modules  whose 
output  is  ambiguous  unless  considered  in  a  known  con¬ 
text. 

It  is  essentially  impossible  to  design  a  vision  module 
that  performs  well  in  all  possible  contexts.  However,  it 
is  quite  possible  that  through  partial  visual  analysis  or 
prior  knowledge,  the  vision  system  has  some  informa¬ 
tion  about  the  situation  in  the  scene  and  the  contexts 

‘This  material  is  based  on  work  supported  by  DARPA 
Contract  MDA972-92-J-1012.  The  Government  has  certain 
rights  in  this  material. 


in  which  objects  appear  or  are  expected  to  appear.  The 
idea  of  sufficing  vision  is  that  vision  modules  are  designed 
for  and  only  executed  in  such  contexts.  For  example, 
when  looking  for  the  carrots  at  a  dinner  table  it  may  be 
sufficient  to  look  for  a  big  blob  of  orange  and  then  check 
the  orange  things  are  roughly  elongated.  In  another  situ¬ 
ation  it  may  be  sufficient  simply  to  look  for  a  big  blob  of 
orange.  Two  things  are  required  in  order  for  sufficing  vi¬ 
sion  to  work.  First,  a  system  is  needed  that  establishes 
contexts  2tnd  uses  them  to  specify  exactly  what  vision 
modules  to  run.  Second,  a  large  repertoire  of  flexible 
vision  modules  is  needed. 

Historically,  vision  modules  have  been  engineered  to 
produce  relatively  high-level  outputs,  ones  humans  can 
reason  about.  Sufficing  vision  systems  may  be  less  trans¬ 
parent:  A  human  may  And  it  difficult  to  understand  or 
specify  exactly  what  a  vision  module  in  a  sufficing  vi¬ 
sion  system  is  really  doing  and  why,  since  the  context 
may  not  be  known  to  the  human  and  the  significance  of 
the  extracted  information  within  that  context  may  not 
be  obvious.  While  in  general  it  may  be  difficult  to  design 
and  integrate  such  vision  modules,  there  are  two  causes 
for  optimism.  First,  learning  tediniques  may  be  able 
to  tune  and  select  modules  in  the  context  of  the  whole 
system,  even  though  humans  can  not  easily  specify  the 
modules  explicitly  and  a  priori.  Second,  one  aspect  of 
the  sufficing  vision  idea  is  that  existing  (relatively  sim¬ 
ple)  vision  modules  may  be  more  useful  than  they  may 
seem,  when  intelligently  applied  and  interpreted  within 
specific  contexts. 

Selective  perception.  A  selective  perception  system 
is  purposive  and  sufficing. 

Control  of  selective  perception.  The  general 
problem  that  we  are  interested  in  is  to  control  (select 
actions  and  make  decisions  in)  a  computer  vision  system 
that  has  a  repertoire  of  actions  (sufficing  vision  modules) 
such  that  the  system  operates  in  a  purposive  manner. 

We  madce  the  following  general  assumptions:  The  com¬ 
puter  vision  system  has  (high-level)  knowledge  about  a 
domain  and  a  set  of  tasks.  A  repertoire  of  actions  is 
available  that  provides  many  different  ways  to  obtain  im¬ 
perfect  information  about  a  scene.  Characteristics  (re¬ 
liability,  time  needed)  of  the  actions  are  known.  The 
computer  vision  system  has  a  pointable  vision  sensor, 
requiring  that  it  make  decisions  about  moving  the  vision 
sensor. 
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The  tasks  the  system  must  solve  are  in  one  sense  very 
simple.  If  a  complete  symbolic  description  of  the  scene 
were  available,  a  subset  of  that  information  directly  de¬ 
termines  the  solution  to  a  task.  The  problem  is  that  (a) 
the  system  must  try  to  retrieve  information  about  each 
member  of  the  subset  one  at  a  time,  (b)  the  scene  is 
complex  so  it  is  difficult  to  locate  the  desired  informa¬ 
tion,  and  (c)  only  imperfect  information  can  be  obtained. 
The  visual  actions  are  parameterized,  providing  infor¬ 
mation  of  varying  quzdity.  The  control  problem,  crudely 
described  as  deciding  “what  to  do  next” ,  really  consists 
of  several  problems;  what  to  look  for  next,  where  to  look 
next,  and  how  to  look  for  things. 

1.2  Hypotheses 

Our  goal  is  to  prove  or  at  least  support  the  following 
hypotheses. 

A  (high-level)  computer  vision  system  (with  limited 
resources  and  working  in  complex  environments)  that  is 
purposive  and  sufficing  is  better  than  one  that  is  not, 
meaning  that  it  solves  tasks  in  less  wall-clock  time. 

Bayesian  and  decision  theoretic  techniques  offer  a 
sound  formal  basis  for  control  of  a  purposive  and  suf¬ 
ficing  computer  vision  system. 

A  control  algorithm  based  on  a  carefully  designed  eval¬ 
uation  function  with  lookahead  can  implement  purposive 
and  sufficing  vision. 

The  performance  of  a  purposive  and  sufficing  vision 
system  depends  on  the  amount  of  “structure”  there  is  in 
a  scene,  meaning  the  various  ways  that  objects  may  be 
grouped  in  a  scene  and  the  spatial  relationships  between 
objects  and  groups. 

A  sufficing  computer  vision  system  requires  a  large 
repertoire  of  actions  whose  performance  characteristics 
are  controlled  by  a  set  of  parameters. 

1.3  Fundamental  Contributions 

We  believe  that  the  best  way  to  address  many  of  our  hy¬ 
potheses  is  by  studying,  designing  and  building  complete 
vision  systems  —  the  issues  are  fundamentally  systems 
issues.  The  contributions  of  our  work  so  far  are  summa¬ 
rized  below. 

We  define  the  T-world  problem,  a  simple  class  of  vi¬ 
sion  problems  that  still  contains  many  of  the  key  factors 
motivating  the  selective  perception  approach.  We  be¬ 
lieve  T-world  is  an  adequate  problem  for  easily  studying 
some  of  the  basic  issues  in  selective  computer  perception, 
and  that  the  T-world  problem  can  be  mapped  to  a  va¬ 
riety  of  real-world  applications  [Rimey,  1993].  We  have 
worked  with  the  abstract  T-world  problem  in  static  and 
dynamic  scene  simulations  and  a  real-world  application 
(dinner  table  scenes)  in  the  lab. 

We  present  the  TEA-1  system,  an  example  of  a  pur¬ 
posive  and  sufficing  vision  system  that  solves  a  version 
of  the  T-world  problem.  Analysis  and  experiments  are 
provided  that  explore  the  advantages  of  the  TEA-1  sys¬ 
tem’s  purposive  and  sufficing  behavior  on  the  T-world 
problem.  We  explore  the  key  factors  that  make  the  selec¬ 
tive  perception  approach  appealing,  analyzing  how  each 
factor  affects  the  overall  performance  when  solving  a  set 
of  automatically  generated  (in  simulation)  T-world  do¬ 


mains  and  tasks.  One  crucial  factor  is  the  amount  and 
nature  of  structure  in  the  organization  of  objects  in  the 
scene. 

TEA-1  shows  one  reasonable  way  to  design  a  computer 
vision  system  using  Bayes  nets  (aka  influence  diagrams) 
and  decision  theory.  Other  computer  vision  and  robot 
systems  have  been  built  that  use  decision  theoretic  tech¬ 
niques,  Bayes  nets,  Dempster-Shafer  or  similar  modeling 
techniques  (see  Section  5),  usually  in  the  lower-level  ca¬ 
pacity  of  supporting  sensor  fusion  for  single  object  recog¬ 
nition  or  building  environmental  descriptions.  TEA-1 
addresses  high-level  computer  vision,  is  one  of  only  a 
few  systems  that  consider  how  to  control  an  actively 
pointable  sensor,  and  is  the  first  to  emphasize  purposive 
and  sufficing  control  in  high-level  vision.  Since  TEA-1  is 
a  fully  implemented  system,  we  have  been  able  to  per¬ 
form  extensive  experiments  in  simulation  and  in  the  lab, 
and  to  analyze  factors  that  affect  selective  perception  — 
these  are  fundamentally  systems  issues  —  using  complete 
runs  on  a  large  number  of  scenes. 

There  are  many  approaches  to  control  in  a  selective 
perception  system  ranging  from  brute-force  and  heuris¬ 
tic  search  through  hand-crafted  evaluation  functions  to 
a  formal  planning  system.  Our  work  explores  this  spec¬ 
trum  of  choices,  studying  and  experimenting  with  some 
of  the  choices  and  exploring  the  issues  in  control  and 
how  each  choice  deals  with  those  issues. 

2  The  T-world  Problem 

This  section  defines  the  T-world  problem,  a  formaliza¬ 
tion  of  some  key  problem  characteristics  that  can  be  ex¬ 
ploited  by  a  selective  perception  system. 

A  scene.  A  scene  consists  of  many  objects  within 
a  large  two-dimensional  rectangular  area.  Each  object 
has  a  location  (for  its  centroid),  rectangular  dimensions, 
a  type,  and  a  set  of  properties.  Each  property  has  a 
set  of  possible  values.  There  may  be  any  number  of 
objects  in  the  scene.  Objects  may  overlap  each  other, 
but  this  does  not  affect  the  performance  of  a  visual  action 
(see  below).  The  objects  in  the  scene  may  be  organized 
into  a  set  of  mutually  exclusive  groups,  and  groups  may 
have  subgroups,  subsubgroups,  etc.  Subgroup  structure 
is  determined  by  the  domain  rules  (see  below). 

The  sensor.  The  sensor,  called  a  camera,  is  a  fixed- 
size  rectangular  window  that  is  in  the  plane  of  the  scene 
and  is  much  smaller  than  the  extent  of  the  scene.  The 
window  may  be  moved  (by  an  camera  movement  ac¬ 
tion,  see  below)  to  any  location  in  the  scene,  specified 
by  the  coordinates  for  the  center  of  the  window  in  a 
two-dimensional  world  coordinate  system.  The  window 
defines  the  camera’s  field  of  view.  A  fixed-size  rectangu¬ 
lar  window,  called  the  “fovea” ,  is  much  smaller  than  the 
field  of  view’s  size,  and  may  be  moved  around  inside  the 
field  of  view. 

A  low  spatial-resolution  image  that  covers  the  entire 
field  of  view  is  available  and  is  called  the  “peripheral 
image” .  A  high  spatial-resolution  image  that  covers  the 
fovea  is  available  and  is  called  the  “foveal  image” .  (Al¬ 
ternatively,  a  resolution  pyramid  of  images  may  be  used.) 

A  domain.  A  “domain”  consists  of  a  set  of  scene 
types  and  a  set  of  probabilistic  rules  for  each  scene  type 
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that  specifies  the  number,  type,  location  (and  grouping 
structure)  and  properties  of  objects  in  a  scene. 

A  task.  A  task  is  defined  as  determining  the  value 
of  a  task  variable,  which  is  a  (probabilistic)  function  of 
a  subset  of  the  number,  type,  location  and  properties  of 
objects  in  the  scene. 

Camera  movement  actions.  Given  a  specified  lo¬ 
cation  in  the  scene  (in  the  two-dimensional  world  coor¬ 
dinate  system)  a  “camera  movement  action”  moves  the 
camera  so  it  is  centered  on  the  specified  location.  This 
action  always  moves  the  camera  exactly  to  that  loca¬ 
tion.  The  time  to  execute  a  camera  movement  action  is 
a  function  of  the  distance  moved. 

Visual  actions.  Visual  actions  try  to  obtain  infor¬ 
mation  about  the  portion  of  the  scene  visible  inside  the 
field  of  view  (i.c.  from  image  data).  There  is  a  large 
collection  of  visual  actions  designed  to  obtain  many  dif¬ 
ferent  types  of  information.  We  currently  use  two  types 
of  visual  action:  one  tries  to  detect  a  specific  type  of  ob¬ 
ject  in  an  image,  and  the  other  tries  to  obtain  the  value 
of  a  specified  property  of  a  specified  object  in  an  image. 

The  behavior  of  an  action  depends  on  whether  the 
target  object  is  truly  in  the  field  of  view  or  not,  the  true 
type,  location  and  properties  of  the  target  object  and 
of  all  the  other  objects  in  the  field  of  view.  An  action 
may  have  a  precondition  that  must  be  satisfied  before 
the  action  can  be  executed. 

The  performance  of  an  action  is  a  function  of  several 
parameters,  which  must  be  specified  for  each  action:  the 
image  resolution  (currently  either  foveal  or  peripheral 
resolution,  and  generally  a  level  in  a  resolution  pyramid), 
the  image  area  to  process,  and  the  length  of  time  to 
process  a  unit  of  image  data.  Note  that  several  actions 
may  have  the  same  purpose,  but  different  performance 
characteristics.  The  time  to  execute  a  visual  action  is  a 
function  of  the  specified  parameters. 

The  problem.  Given  a  scene  from  an  identified  T- 
world  domain  and  a  specified  T-world  task,  the  problem 
is  to  sequentially  collect  evidence  from  the  scene  to  sup¬ 
port  a  decision  about  the  answer  to  the  task,  with  a  de¬ 
sired  level  of  confidence,  so  that  the  total  wall-clock  time 
for  executing  the  actions  is  minimized.  Solving  the  prob¬ 
lem  involves  the  following  general  steps:  decide  what  ac¬ 
tion  to  execute  next,  execute  that  action,  incorporate 
the  results  from  that  action,  decide  on  the  answer  to  the 
task,  and  decide  whether  to  gather  more  evidence  or  to 
stop. 

A  set  of  programs  has  been  written  that  allows  TEA-1 
to  analyze  scenes  and  solve  tasks  in  a  simulated  T-world 
(in  addition  to  a  version  that  runs  in  the  lab).  One  pro¬ 
gram  simulates  an  instance  of  T-world  (scene,  camera, 
actions,  etc.)  as  specified  by  a  database  of  rules  and 
models.  Another  program  automatically  generates  the 
database  files  that  specify  new  instances  of  T-world  do¬ 
mains,  and  scenes  and  tasks  for  each  domain.  The  same 
program  automatically  generates  the  knowledge  repre¬ 
sentation  structures  used  by  the  TEA-1  system. 


3  TEA-1:  A  Decision-Theoretic 
Solution  for  Control  of  Selective 
Perception  in  the  T-world  Problem 

TEA-1  is  an  implemented,  compact,  flexible,  selective 
computer  vision  system,  which  solves  a  version  of  the 
T-world  problem  and  has  a  solid  foundation  of  well- 
established  formalisms  —  Bayesian  statistics  and  deci¬ 
sion  theory.  TEA-1  uses  Bayes  nets  for  representation 
and  a  cost  and  benefit  analysis  extending  over  action  se¬ 
quences  to  decide  which  visual  or  non-visual  2u:tion  to 
perform  next.  We  believe  TEA-Ts  current  design  pro¬ 
vides  a  general  software  tool  sufficient  to  study  a  variety 
of  basic  issues  in  high-level  and  low-level  selective  anal¬ 
ysis  and  behavior  in  computer  perception. 

A  probabilistic  knowledge  representation  is  appropri¬ 
ate  for  a  selective  system,  and  Bayes  net  and  Dempster- 
Shafer  approaches  are  two  obvious  alternatives.  We 
choose  the  Bayes  net  approach  because  it  is  flexible  and 
easy  to  use,  and  works  well  for  the  variety  of  tasks  and 
domains  we  have  in  mind.  We  developed  a  version  of 
Bayes  nets,  called  a  composite  Bayes  net,  which  con¬ 
sists  of  domain-specific  knowledge  (including  geometri¬ 
cal)  and  a  specification  of  the  desired  task.  The  compos¬ 
ite  net  includes  a  new  application  of  Bayes  nets  to  repre¬ 
sent  relative  object  locations  and  geometric  relations.  A 
task  is  specified  by  a  net  that  makes  explicit  the  relation 
of  evidence  needed  to  accomplish  a  specific  perceptual 
task  to  the  components  of  the  domain-dependent  knowl¬ 
edge  representation. 

We  have  used  generic,  easily-tuned,  sufficing  vision 
utilities  (histograms.  Hough  transforms)  from  our  soft¬ 
ware  library.  These  sufficing  algorithms  are  in  general 
simple  and  fragile;  in  a  known  context  they  are  simple 
and  robust.  Our  goal  is  to  be  able  to  use  intelligently 
whatever  visual  operators  are  at  hand.  The  control  sys¬ 
tem  can  apply  a  vision  module  in  a  very  specific  spatial 
or  semantic  context,  knowing  how  the  context  affects  the 
performance  of  that  module. 

TEA-Ts  design  assumes  all  the  details  in  the  T-world 
definition  in  Section  2.  TEA-1  programs  can  transpar¬ 
ently  run  either  with  a  T-world  simulator  providing  in¬ 
put  and  accepting  output  or  in  the  laboratory  (for  a 
dinner  table  domain). 

More  details  about  TEA-1  are  available  in  [Rimey  and 
Brown,  1992a,  1992b,  1992c,  1992d,  Rimey,  1993],  in¬ 
cluding  the  various  decision  making  algorithms,  example 
runs  of  the  system,  and  other  experimental  results. 

4  Factors  Affecting  the  Performance  of 
Selective  Perception 

We  are  currently  analyzing  the  relationship  of  several 
key  factors  to  the  overall  performance  of  a  selective  per¬ 
ception  system,  using  T-world  and  TEA-T.  (1)  auto¬ 
matically  generate  a  large  number  of  simulated  T-world 
domains,  scenes,  and  tasks;  (2)  run  TEA-1  on  the  gen¬ 
erated  scenes  and  tasks;  and  (3)  compute  the  average 
solution  time  over  all  scenes  for  each  task.  This  ap¬ 
proach  lets  us  show  how  each  factor  affects  the  average 
solution  time.  Factors  falling  in  four  categories  are  be- 
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Figure  1 :  Plots  of  belief  over  time  for  various  action  eval¬ 
uation  functions.  Belief;  that  a  table  setting  is  not  fancy 
when  it  actually  is,  averaged  over  10  scenes  generated  by 
the  T-world  simulator.  U5  is  a  constant  evaluation,  or 
random  choice  of  operation,  and  sophistication  increases 
from  U4  to  UO.  Better  performance  means  both  that  the 
final  belief  value  is  lower  and  that  it  gets  lower  faster. 


ing  analyzed:  control  method,  scene  structure,  systems 
parameters,  and  parameterizable  actions  [Rimey,  1993]. 

4.1  Control  Method 

Our  early  work  explored  a  variety  of  evaluation  functions 
(and  related  issues)  for  deciding  what  camera  movements 
to  make  and  what  visual  actions  to  execute.  For  exam¬ 
ple,  Figure  1  shows  the  performance  of  TEA-1  solving 
a  table  setting  task  with  six  different  action  evaluation 
functions,  holding  other  factors  constant.  We  have  also 
explored  several  different  evaluation  functions  for  mak¬ 
ing  camera  movement  decisions,  and  have  compared  the 
evaluation  function  approach  with  a  state-space  search  of 
eill  possible  action  sequences  [Rimey  and  Brown, 1992a, 
1992c]. 

Many  fundamental  questions  remain,  however,  as  to 
the  most  effective  (  ontrol  and  cost/benefit  evaluation 
mechanisms.  For  example,  the  original  TEA-1  design 
used  a  V^ate  =  value /cost  measure,  though  some  (cam¬ 
era  movement  calculations)  used  Vimm  =  value  —  cost. 
Calibration  between  value  and  cost  is  necessary  in  either 
method.  Computing  the  value  of  an  action  as  Vimm  em¬ 
phasizes  finding  the  best  single  action  to  perform  now, 
but  maximizing  Vrate  ensures  the  fastest  improvement 
over  time,  which  is  also  an  important  consideration.  The 
latest  TEA-1  design  maximizes  a  combination  of  Vrate 
and  Vimm  (stnd  action  value  is  now  based  on  the  expected 
value  of  sample  information  rather  than  on  average  mu¬ 
tual  information).  Preliminary  results  using  T-world  and 
TEA-1  regarding  these  questions  are  encouraging.  We 
are  currently  changing  some  more  details  of  the  imple¬ 
mentation  to  be  more  consistent,  so  tighter  results  can 
be  obtained  for  comparisons,  and  so  more  extensive  com¬ 
parisons  can  be  made. 


(a)  (b) 


Figure  2;  Some  examples  of  how  a  scene  of  nine  objects 
could  be  structured.  The  smallest  squares  are  objects. 
The  other  squares  depict  groups  or  subgroups,  (a)  No 
grouping,  (b)  Three  groups  of  three  objects,  (c)  Two 
rather  dense  groups,  with  five  and  four  objects  in  each, 
(d)  A  subgroup  structure. 


4.2  Scene  Structure 

Several  aspects  of  scene  structure  can  have  a  significant 
impact  on  performance,  mainly  because  geometric  rela¬ 
tions  define  contexts  in  a  scene,  which  help  locate  things 
in  the  scene  and  thus  help  obtun  more  accurate  infor¬ 
mation  more  cheaply.  See  [Rimey  and  Brown,  1992a, 
1992b,  1992d]  for  an  example  in  a  dinner  table  domain 
that  shows  how  a  cup’s  expected  area  gets  narrower  and 
how  an  actual  cup  detection  action  performs  after  each 
of  the  table,  plate,  and  napkin  have  been  located  (in  that 
order). 

We  are  currently  studying  several  specific  scene- 
tructure  factors  via  simulated  T-world  domains,  such 
as:  number  of  groups  and  number  of  subgroup  levels 
that  objects  are  organized  into,  average  number  of  geo¬ 
metric  relations  between  objects,  and  shape  (and  type) 
of  geometric  relation  distributions  between  objects.  For 
example,  Figure  2  shows  several  different  ways  that  a 
scene  of  nine  objects  could  be  structured. 

Another  factor  that  we  classify  under  scene  structure 
is  whether  it  is  inherently  easier  to  detect  and  obtain 
properties  of  some  objects  than  others.  For  example, 
certain  large  objects  (like  runways)  may  be  easier  to  find 
than  small  objects  (like  service  vehicles),  while  highly 
constraining  the  location  (or  properties)  of  smaller  ob¬ 
jects.  Wixson  demonstrated  that  the  eflBciency  of  ac¬ 
tively  searching  for  a  specified  target  object  in  a  room 
can  be  improved  by  a  factor  to  2  to  8  when  a  related 
intermediate  object  is  located  first  [Wixson,  1992]. 
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The  T-world  problem  contains  significantly  more  op¬ 
portunity  for  and  kinds  of  scene  structure  than  Wixson’s 
object  search  problem,  which  contains  only  the  simple 
“look  near”  relation.  The  solution  to  a  T-world  problem 
can  involve  several  types  of  information  about  several  ob¬ 
jects,  rather  than  simply  detecting  one  object.  It  will  be 
interesting  to  compare  the  performance  gains  obtained 
by  using  relations  with  more  than  one  object  to  those 
obtained  by  using  relations  with  only  one  object. 

Some  of  the  task  complexity  factors  that  T-world  en¬ 
ables  us  to  analyze  experimentally  are:  the  minimum 
number  of  (independent)  scene  features  needed  to  solve 
the  task;  loosening  the  independence  restriction;  and 
whether  all  features  have  the  same  impact  on  the  task,  or 
some  features  contribute  more  information  to  the  task’s 
solution. 

4.3  Parameterizable  Actions 

Selective  perception  (and  qualitative  vision)  tightly  cou¬ 
ples  high-level  control  with  the  low-level  vision  modules, 
specifically  so  the  vision  modules  can  be  asked  to  provide 
only  the  minimal  information  needed  to  solve  the  current 
task.  How  much  and  what  type  of  flexibility  must  exist 
in  the  vision  modules,  and  how  does  the  control  system 
mesh  with  such  a  repertoire  of  very  flexible  vision  mod¬ 
ules  {e.g.  with  several  parameters)?  This  question  has 
not  received  much  attention:  most  vision  modules  have 
been  designed  to  recover  as  complete  a  description  as 
possible,  to  support  traditional  vision  tasks  like  single 
object  recognition  or  scene  property  reconstruction. 

The  value  of  some  vision  algorithms  is  affected  by  pa¬ 
rameters  (depth  of  search,  spatial  resolution,  iteration 
count,  annealing  schedule,  etc.)  that  change  operation 
cost.  Our  decision-theoretic  formalism  can  be  extended 
to  a  continuous  range  of  beneflt/cost  choices  that  will 
allow  the  control  of  monotonic  or  anytime  algorithms 
whose  results  get  better  with  longer  running  times. 

4.4  System  Parameters 

The  system  parameters  category  includes:  performance 
model  of  a  visual  action  (a  table  of  probabilities),  rel¬ 
ative  costs  of  visual  and  non-visual  actions,  size  of  the 
camera’s  field  of  view,  size  of  the  fovea,  relative  speed  of 
computation  (in  the  multiprocessor  version  of  TEA-1). 
Figure  3  shows  the  effect  of  varying  the  cost  of  a  camera 
movement,  which  is  generally  to  stretch  the  performance 
curve  over  time  [Rimey  and  Brown,  1992a].  (Wixson  dis¬ 
cusses  the  effect  of  some  similar  parameters  on  an  object 
search  task  in  [Wixson,  1992].) 

Varying  some  system  parameters  will  change  the  sys¬ 
tem’s  overall  pattern  of  behavior,  meaning  the  best  se¬ 
quence  of  actions  to  execute,  which  can  produce  inter¬ 
esting  effects  and  raises  interesting  issues.  For  example, 
making  camera  movements  more  expensive  means  that 
more  time  is  spent  analyzing  more  of  the  things  visible 
at  each  camera  fixation. 

4.4.1  Cycles 

The  combination  of  cheap  camera  movements  and 
“anytime”  visual  actions  [Dean  and  Wellman,  1991] 
could  encourage  cyclic  fixation  and  analysis  patterns  to 


Figure  3:  Performance  for  a  dinner  table  task  as  the 
camera  movement  cost  is  varied. 

emerge.  T-world  circumvents  knowledge  engineering  and 
other  practical  difficulties  in  experiments  with  complex 
scenes,  so  we  can  study  these  issues. 

With  cheap  vision,  humans  may  not  use  their  innate 
powers  of  representation  and  memory  and  may  prefer 
just  to  update  short-term  memory.  This  strategy  seems 
to  be  found  in  humans  [Ballard  ei  al.,  1993]  in  repeti¬ 
tive  sequential  hand-eye  tasks.  On  the  other  hand,  hu¬ 
man  eye  fixations  during  even  simple  tasks  clearly  show 
evidence  of  rational  sequential  control  ([Yarbus,  1967], 
and  see  Figure  4).  Further,  vision  is  expensive  when 
peripheral  vision  is  reduced,  when  there  are  distractors, 
noise,  low  contrast,  etc.  Humans  do  manage  their  vi¬ 
sual  resources,  even  for  static  scenes,  and  their  manage¬ 
ment  strategies  are  open  to  investigation  through  several 
avenues.  We  are  hopeful  that  we  can  relate  decision- 
theoretic  control  to  human  performance  by  using  mod¬ 
ern  eye-,  head-,  and  hand-tracking  technology  to  observe 
humans  performing  T-world  tasks. 

5  Related  Work 

Extensive  references  can  be  found  in  our  other  papers 
[Rimey  and  Brown,  1992a,  1992b,  1992c,  1992d,  Rimey, 
1993].  Our  work  on  task-based  vision  is  most  directly 
comparable  to  [Hutchinson  and  Kak,  1989],  which  put 
a  carefully  designed  version  of  model-based  hypothesis 
verification  vision  into  a  Dempster-Shafer  setting,  and 
the  related  [Wu  and  Cameron,  1990,  Hager,  1990].  Ad¬ 
ditional  key  computer  vision  and  robotics  applications 
of  decision  theory  are  [Levitt  et  al.,  1989]  and  [Dean  et 
al,  1990]. 
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Figure  4:  Eye-fixations  in  visual  tasks,  (a)  Where  is  the 
knife?  (S  habitually  places  knife  to  left  of  plate),  (b) 
How  many  people  drink?  (S  uses  plates  somehow). 
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Abstract 

What  is  an  active  observer?  In  the  literature, 
it  is  considered  to  be  an  observer  capable  of  con¬ 
trolling  its  sensory  apparatus,  and  thus  its  image 
acquisition  process.  Activities  can  be  movement, 
tracking,  looming,  focusing,  etc.  when  the  sen¬ 
sory  modality  is  vision.  But  whatever  the  par¬ 
ticular  activity  may  be,  its  effect  is  to  alter  the 
visual  input  so  that  it  becomes  easier  to  extract 
from  it  the  visual  quantities  of  interest. 

If  we  examine  various  activities  more  closely, 
we  find  that  they  amount  to  to  the  addition  of  a 
motion  field  to  the  visual  input.  In  cases  such  as 
movement  and  tracking  (whether  smooth  pursuit 
or  saccade  tracking),  this  field  is  a  rigid  motion 
field  corresponding  to  a  3D  motion.  It  was  shown 
((AloimSSb])  that  equipping  the  observer  with 
any  movement  capability  makes  various  visual  re¬ 
covery  problems  easier  and  gives  them  uniqueness 
properties  as  well. 

But  what  happens  at  the  algorithmic  level? 

Are  all  activities — rigid  motions  in  this  case — the 
same  as  regards  stability  issues?  In  other  words, 
can  the  active  observer  explore  the  space  of  all  ac¬ 
tivities  in  order  to  discover  the  one  that  provides 
the  most  stable  solution?  This  is  the  problem  we 
investigate  in  this  paper. 

1.  Introduction 

1.1  Overview 

FVom  a  theoretical  point  of  view,  an  active  observer 
needs  to  perform  various  partial  recovery  tasks  in  or¬ 
der  to  take  the  appropriate  action,  and  the  question  we 
pose  is  “what  activity  (motion)  will  provide  the  most 
robust  solution  regarding  the  structure  of  the  scene  in 
view?”  While  addressing  this  problem,  we  present  as  a 
by-product  of  our  analysis  a  solution  that  unifies  shape 
from  shading,  shape  from  texture,  and  shape  from  mo¬ 
tion.  No  optical  flow  or  correspondence  are  used  in  our 
approach. 


We  need  to  emphasise  that  this  work  is  of  a  theoret¬ 
ical  nature.  By  no  means  do  we  suggest  that  an  active 
observer  needs  to  perform  a  annplete  reconstruction  of 
the  scene  in  view.  On  the  contrary,  our  recent  stud¬ 
ies  indicate  that  this  is  not  necessary.  In  this  paper, 
we  simply  show  that  an  active  observer  which  needs  to 
recover  information  about  the  structure  of  the  scene  in 
view  can  employ  many  activities.  We  prove  that  one 
of  the  activities  in  this  set  will  provide  the  most  stable 
solution  (in  an  algorithmic  sense),  and  we  show  how  it 
can  be  found.  This,  in  turn,  suggests  that  it  is  fhiit- 
ful  to  study  exploratory  active  perception,  whatever  the 
sensory  modality  or  the  activity  may  be. 

1.2  Motivation  for  this  work 

One  of  the  main  topics  of  research  in  modern  com¬ 
puter  vision  is  the  “shape  from  x”  problem.  Following 
the  paradigm  introduced  by  David  Marr  ([Marr82]),  nu¬ 
merous  models  and  algorithms  have  been  proposed  that 
attempt  to  explain  or  mimic  the  behavior  of  modules  ob¬ 
served  in  the  human  or  animal  visual  83r8tem  and  that 
recover  the  geometry  of  an  observed  scene,  using  cues 
such  as  shading,  texture,  motion,  stereo,  etc.  The  goal 
of  Marr’s  paradigm  was  not  only  to  provide  models  to  ex¬ 
plain  animal  vision,  but  also  to  offer  a  methodology  and 
tools  for  designing  artificial  visual  systems  that  could  be 
used  in  robotics  tasks,  visual  navigation  tasks  in  partic¬ 
ular.  By  clearly  separating  the  visual  module  from  the 
motion  planning  module,  it  allows  them  to  be  studied 
and  developed  independently  from  each  other. 

The  great  ^peal  of  the  Marr  paradigm  resides  in  its 
generality:  regardless  of  the  task  to  be  performed  (and, 
therefore,  regardless  of  the  type  of  robot  in  questicm), 
the  interface  between  sensing  and  planning  is  provided 
by  the  depth  map  computed  by  the  visual  module.  In 
this  respect,  the  static  (open-loop)  position-based,  look- 
and-move  controller  (using  the  classification  proposed 
by  [Sande83])  represented  in  Figure  1,  corresponding 
for  example  to  the  hand-eye  system  of  Tsai  and  Lents 
([T8aiL89]),  can  be  considered  to  be  the  general  model 
of  a  visual  reconstruction-based  system.  Although  a  dy¬ 
namic  (closed-loop)  version  of  this  model  is  theoretically 
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possible  (and  was  discussed  in  [Sande83]),  it  has  virtu¬ 
ally  never  been  realised,  mainly  due  to  the  prohibitive 
computational  cost  of  visual  reconstruction  (calibration) 
processes.  What  is  generally  presented  as  closed-loop 
control  is  therefore  a  sequence  of  error-reducing  runs  of 
an  open-loop  controller. 


Figure  1  —  Poeition-bssed  open  loop  control 


At  the  other  end  of  the  robotics  problem,  current 
work  on  robot  motion  planning,  even  when  assuming  the 
existence  of  a  visual  sensor  (whether  in  a  theoretical  anal¬ 
ysis,  as  in  [ChatiSS],  or  in  an  implemented  system  such 
as  the  ones  described  in  [Weiss84]  and  [Fedde89]),  com¬ 
pletely  ignores  the  sensing  process.  In  particular,  it  is 
always  assumed  that  the  visual  module  can  provide  reli¬ 
able  information,  regardless  of  the  activities  of  the  robot 
observer  . 

The  results  presented  in  [Aloim88b]  clearly  demon¬ 
strate  that  this  is  not  the  case;  some  activities  (motions) 
of  the  observer  can  be  shown  to  make  the  visual  algo¬ 
rithms  more  stable.  The  question  that  naturally  comes 
next  is:  Can  such  a  good  activity  be  determined  a  priori? 
This  question  is  immediately  followed  by:  How  can  this 
choice  of  a  “good”  activity  (in  terms  of  visual  computa¬ 
tions)  be  incorporated  into  the  motion  planning  process 
(in  terms  of  the  task)? 

1.3  Shape  from  x  modules 

The  goal  of  this  paper  is  not  to  demonstrate  that 
extraction  of  information  about  the  structure  of  the  vi¬ 
sual  environment  can  be  facilitated  by  the  employment 
of  an  active  observer.  This  has  been  demonstrated  in 
a  previous  paper  [Aloim88b].  We  simply  demonstrate 
that  among  the  infinite  class  of  activities  that  an  ob¬ 
server  can  employ,  there  exists  one  which  is  optimal  in 
the  sense  of  stability-i.e.,  when  the  observer  employs 
this  activity,  information  about  shape  is  recovered  in  the 
most  robust  manner.  Since  information  about  shape  is 
hidden  in  shading,  texture,  and  motion,  in  a  sense  our 
work  unifies  the  following  shape  from  x  modules: 

•  shape  from  shading,  which  is  concerned  with  the 

’  Work  dealing  with  uncertainty  in  motion  planning  ia  no  ex¬ 
ception  to  this  obaervatioo,  since  at  one  point  or  another,  perfect 
information  about  the  scene  or  the  motion  is  needed  in  all  work 
published  so  far. 


smooth  variations  of  light  intensity  over  the  image 
([GibsoSO],  [Horn??]). 

•  ahsq>e  firom  texture,  which  is  concerned  with  the 
variation  of  distributions  of  image  discontinuities  or 
elementary  discontinuity  patterns  ([GibsoSO]). 

•  shape  from  motion  (or  structure  from  motion,  as 
it  is  more  often  called),  which  attempts  to  extract 
depth  information  from  the  displacement  of  image 
features  ([TsaiH84])  or  the  modification  of  the  image 
intennty  ([HornS81])  resulting  from  the  moti<ni  of 
the  observer  or  of  objects  in  the  scene. 

1.4  The  active  observer 

The  concept  of  Active  Perceptirn  was  introduced  in 
[Bajcs86]  and  further  analyzed  in  [Aloim88b].  An  active 
observer,  when  engaged  in  an  activity,  modifies  the  con¬ 
straints  underlying  a  given  phenomenon  (and  the  equa¬ 
tions  describing  them)  and  thus  creates  new  information 
that  helps  to  eliminate  ambiguities  and  make  the  solution 
easier  to  find  and,  often,  more  reliable  (that  is,  more  ro¬ 
bust).  It  was  shown  in  [Aloim88b]  that  classical,  difficult, 
or  even  ill-posed  vision  problems  can  be  made  simpler  if 
the  observer  accomplishes  some  activity  chosen  from  th** 
space  of  possible  activities  such  as  movement,  tracking, 
focusing,  eye  convergence,  touch,  etc.  Active  vision  can 
also  be  seen  as  a  technique  for  the  integration  of  shape 
from  X  modules,  for  example,  the  shape  from  texture  and 
shape  from  shading  modules  ([Aloim89]). 

2.  Shape  Recovery 

2.1  Notation 

The  observer  considered  here  is  a  monocular  optic 
system  (camera),  which  we  represent  by  a  classicsd  pin¬ 
hole  model  (Figure  2). 


Figure  2  —  Pinhole  model  of  the  observer 


The  3D  space  is  referenced  to  a  viewer-based  coor¬ 
dinate  system  Tl  =  (0,0X,0Y,0Z),  where  O  and  OZ 
ate  the  optical  center  and  the  optical  axis  of  the  camera 
respectively,  and  OZ  intersects  the  image  plane  orthog¬ 
onally  at  o,  with  d(Oo)  =  /,  the  focal  length  of  the 
camera.  OX  and  OY  ate  defined  so  as  to  be  parallel  to 
the  axes  of  the  image  plane. 
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Let  M  =  {X,Y,Zy  be  a  point  of  the  3D  world.  M 
projects  onto  the  image  plane  as  a  point  m  =  (x,  »./)'• 
Since  we  use  a  pinhole  model,  the  following  relation  holds 
between  M  and  m: 

m=|M.  (1) 

This  allows  us,  if  we  consider  m  as  a  function  of  M,  to 
compute  the  Jacobian  matrix 


dm  1 

m  =  z°  “  = 


(2) 


2.2  Motion  of  the  observer 

The  observer  is  moving  with  a  known  rigid  motion 
composed  of  a  translation  T  =  (fi,t3,t3y  and  a  rota¬ 
tion  u  =  (wi,U2,U3y.  A  point  M  in  the  observed  scene 
is  therefore  seen  as  moving  with  the  apparent  velocity 
V  =  —u  X  M  —  T.  In  order  to  simplify  the  expressions 
to  follow,  we  define  O  to  be  the  antisymmetric  matrix 
associated  with  the  linear  operator  u  x  .  : 


(0  —U3  U2  \ 

U)3  0  —ui  I  . 

— wj  <Ji  0  / 

Using  these  equations,  we  can  now  express  the  appar¬ 
ent  motion  of  world  point  M,  as  seen  by  the  observer,  as 
M  =  — n  •  M  —  T.  The  motion  of  M’s  projection  on  the 
image  plane,  m,  defines  the  motion  Sow 

.  dm  dM  i  ^  m  m  /»V 

"  =  ^  -3r  =  :z“  “-’f)-  <’> 

2.3  The  optical  flow  equation 

For  a  given  motion  of  the  observer  and  two  consec¬ 
utive  images  of  a  scene  or  surface,  /(to)  and  /(to  -f-  At), 
with  At  assumed  to  be  small  enough  to  justify  a  differ¬ 
ential  approximation,  we  want  to  reconstruct  the  shape 
of  the  surface,  t.e.  recover  Z{x,y). 

An  apparently  simple  idea  would  be  to  compute  the 
optical  flow  Ill  and  report  it  in  (3)  to  obtain  Z  for  each 
point  in  the  image.  The  optical  flow  constraint  tradition¬ 
ally  used  to  relate  the  unknown  optic  flow  to  the  image 
intensity  data  is  the  one  proposed  by  [HornSSl]: 

^=v/.d.+  |=0.  (4) 

This  equation  merely  states  that  the  images  of  M  at 
times  t  and  t  +  dt,  m(t)  and  in(f  -f-  dt)  respectively,  can 
be  considered  to  have  the  same  intensity.  Although  this 
simple  model  somewhat  lacks  realism — for  example,  it 
does  not  take  into  account  specularity  phenomena — it 
has  nevertheless  been  adopted  by  most  researchers  in 
this  domain,  partly  due  to  its  intuitiveness  and  to  its 
analytic  simplicity,  but  also  due  to  the  fact  that  more 
complex  equations  do  not  perform  significantly  better. 
We  can,  however,  formulate  the  following  remarks  before 
going  any  further  in  our  analysis: 


•  Equation  (4)  does  not  give  the  optical  flow,  but  the 
aormalBow,  i.e.  the  component  of  the  flow  along  the 
direction  of  V/.  Providing  we  add  extra  constraints 
(smoothness),  regularisation  techniques  cam  give  us 
a  solution  ([HornSfll]);  unfortunately,  they  do  not 
handle  discontinuities  well.  Theories  of  discontinuous 
regularization  have  been  proposed  ([ShulmSS]),  but 
have  only  been  partially  applied  to  the  problem  of 
optical  flow  computation  ([Shulm89]). 

•  A  direct  exploitation  of  (4)  requires  the  computation 
of  V/,  which  is  known  to  be  an  ill-posed  problem 
([PoggiSS]).  Furthermore,  it  restricts  the  applicabil¬ 
ity  of  the  module  to  the  case  of  smooth  variations  of 
the  intensity  (t.e.  shape  from  shading). 

•  Even  assuming  the  flow  can  be  calculated,  the  crite¬ 
rion  optimized  by  the  depth  map  thus  obtained  re¬ 
mains  unnatural  (smoothness  of  the  flow)  or  unclear. 
What  does  it  imply  about  the  flow  when  we  minimize 
norms  of  its  derivatives?  What  does  the  depth  map 
corresponding  to  such  a  flow  look  like? 

It  seems  clear  that  an  important  flaw  in  flow-based  meth¬ 
ods  is  that  they  require  pointwise  calculations  of  deriva¬ 
tives  and  other  operators,  while  the  data  we  are  given 
(intensities)  are  locally  inaccurate  .  On  the  other  hand, 
the  reliability  of  averages  computed  on  portions  of  the 
image  is  quite  satisfactory.  This  is  why  we  introduce 
linear  features  ([Amari86])  in  our  theory. 

2.4  Lineaur  features 

A  linear  feature  is  a  triple  LFi  =  !S{)  where 

•  E|  is  an  image  window. 

•  The  measuring  function  is  a  differentiable  function 

defined  over  E/. 

•  ^1  is  defined  as  a  moment  over  E{: 

<pi  =  li  ml  ds  =  IL  A‘»(*,  y)  I{x,  y)  dzdy. 

It  can  be  shown  (see  Appendix)  that  if  we  adopt  (4) 
as  a  model  of  intensity  variation  the  time  derivative  of 
<pi  takes  the  following  for  .n: 

^  UJs  ~  IL 

If  we  now  replace  m  in  (5)  by  the  expression  (3)  for  the 
motion  flow,  we  obtain  the  following  equation,  where  Z 
is  the  unknown: 

W  =  -jj^  i/V/i,  (D  T)d5 

-jJJ  /V/i,  (D-n  m)ds.  (6) 

2.5  Model-based  solution  of  the  equation 

The  right  hand  term  of  equation  (6)  is  known;  what 
we  need  now  is  a  way  to  get  l/Z  out  of  the  integral 
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sign.  This  can  be  accomplished  if  we  model  it  as  a  lin¬ 
ear  combination  of  differentiable  functions  (for  example, 
monomials,  Gabor  functions,  or  Fourier  components): 


4  =  £«*  (7) 

k=l 

Replacing  \/Z  by  its  model  (7)  makes  the  vector 
of  coeflScients  A  =  (at)t=i,m  the  unknown  of  the  final 
equation: 

=  -^1  -  i  (/VAn  D  n-m)*.  (8) 

or,  in  a  more  compact  form: 

(S,  •  T)'  •  A  =  r, 

where  S|  is  an  m  x  3  matrix  and  n  is  a  scalar. 


2.6  The  least  squares  solution 

Each  linear  feature  „  provides  a  linear  equa¬ 

tion  similar  to  (8).  A  least  squares  minimization  exploit¬ 
ing  this  system  of  equations  gives  us  the  following  equa¬ 
tion: 


Ll=i 


•A  = 


(9). 


2.7  Conunents  on  equation  (9) 

We  do  not  need  to  compute  any  pointwise  derivatives. 
In  particular,  the  term  V/  does  not  appear  in  our  equa¬ 
tions.  This  allows  discontinuities  to  occur  in  the  image: 
the  method  does  not  require  any  particular  property  of 
the  intensity  map. 

The  only  derivatives  we  need  to  calculate  in  order  to 
recover  Z  are  the  /j  ~  At)  — ^i(t)]/At,  where  we 

have  to  keep  in  mind  that  (pi  is  obtained  by  integration 
over  the  image  window  Ej  and  is  thus  more  stable  and 
reliable  than  values  at  individual  points. 

We  are  not  restricted  to  the  use  of  the  optical  fiow 
equation  (4);  every  linear  feature  gives  us  a  linear  equa¬ 
tion.  Choosing  a  big  enough  linear  feature  vector,  we 
can  get  a  solution  by  Hough  transform  or  least  squares 
approximation. 

The  criterion  optimized  by  the  solution  is  observer- 
based;  for  a  given  shape  model,  the  at’s  we  compute 
are  such  that  the  expected  intensity  variation  (which  we 
can  predict,  given  I{x,y,to)  over  E|,  and  the  motion  pa¬ 
rameters)  is  closest  to  what  is  actually  observed  in  the 
image. 

Satisfau:tory  results  obtained  by  similar  techniques 
(using  a  simplified  version  of  (8))  have  been  reported  in 
earlier  papers  ([Aloim89],  [Aloim88a])  and  will  therefore 
not  be  reproduced  here,  since  reconstructing  the  depth 
map  is  not  the  main  focus  of  this  work. 


3.  Search  for  a  Best  Activity 

3.1  Optimisation  in  the  space  of  activities 

We  have  seen  that  an  active  observer  can  create  new 
information  and  will  be  able  to  solve  problems  which 
were  ill-posed  for  a  passive  observer.  Still,  common  sense 
tells  us  that  not  all  activities  are  equally  good;  some  will 
not  actually  help  eliminate  ambiguities.  If  we  want  to 
choose  a  “good”  activity,  we  need  to  answer  the  following 
questions: 

•  is  there  a  goodness  criterion  valid  in  the  space  of  all 
possible  activities? 

•  if  such  a  criterion  exists,  can  we  find  an  activity  which 
optimizes  it? 

If  the  problem  treated  here  were  purely  mathematical, 
one  could  think  of  the  accuracy  of  the  computed  shape 
model  (i.e.  its  distance  from  the  actual  Z{x,y))  as  a 
good  estimate  of  the  activity.  Unfortunately,  we  are  deal¬ 
ing  with  the  real  world:  data  are  noisy  and  the  reliability 
of  the  camera  calibration  and  of  the  motion  parameters  is 
questionable.  Under  such  conditions,  the  only  criterion 
that  makes  sense  is  the  stability  of  the  solution  under 
perturbation  of  the  input  and  parameters:  we  want  to 
pick  an  activity  which  will  optimize  the  stability  of  the 
solution  calculated.  But  can  we? 

3.2  Stability  of  the  least  squares  solution 

We  first  comment  that  the  rotational  part  of  the  mo¬ 
tion  has  no  effect  on  the  stability  of  equation  (9),  only 
on  the  recomputed  depth  map,  which  is  not  the  main 
focus  of  the  work  presented  here.  Our  search  for  a  best 
activity  will  therefore  be  reduced  to  an  optimization  in 
the  two-dimensional  space  of  translation  directions. 

An  obvious  condition  that  T  has  to  satisfy  is  for  Q 
to  be  regular,  i.e.  DetQ  ^  0.  But  this  is  not  enough; 
not  only  do  we  want  Q  to  be  regular,  we  want  it  to  be 
“as  regular  as  possible”.  In  other  words,  we  want  Q  to 
behave  well  during  the  numerical  solution  of  the  equation 
(for  example,  small  pivots  should  not  be  encountered  in 
a  triangulation  of  the  matrix).  This  imposes  conditions 
on  the  eigenvalues  of  Q.  If  Q  has  small  eigenvalues  (at 
this  stage,  “small”  is  purposely  vague,  meaning,  roughly 
“anything  that  will  bring  intermediate  results  close  to  the 
roundoff  error”),  it  will  not  be  singular,  but  it  will  be  ill- 
conditioned  and  the  computed  solution  will  be  unstable. 

Let  A], . . . ,  Am  he  Q’s  eigenvalues  listed  in  increasing 
order  (since  Q  is  obtained  by  least  squares  minimization, 
the  A,’s  are  real  and  positive).  The  condition  number  of 
Q  is  defined  as  the  ratio 


i.e.  the  ratio  of  the  largest  to  the  smallest  eigenvalue  of 
Q.  If  the  condition  number  is  infinite,  Q  is  singular;  if  C 
is  large,  Q  is  ill-conditioned.  The  best  T  we  could  choose 
would  be  one  which  would  minimize  the  condition  num¬ 
ber.  This  problem,  however,  is  difficult  since  we  cannot 
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give  an  analytic  expression  for  C.  Instead,  we  propose  to 
minimize  the  following  function: 

a(Q)  =  =  ln(DetQ)-mln('nQ) 

=  In  (Ai  A2  ■  • .  Am)  m  ■  In  (Ai  +  A2  + . . .  +  Am)  • 

This  choice  of  Q  may  seem  somewhat  arbitrary  and  needs 
to  be  justified  here: 

•  First,  S  is  well  defined,  since  the  determinant  and  the 
trace  of  Q  are  invariant  under  a  diagonalization  pro¬ 
cess,  and  the  fraction  cannot  be  equal  to  zero  (since 
Am  >  Ai  >  0),  unless  Q  is  equal  to  the  null  matrix. 

•  The  logarithmic  term  is  used  here  to  simplify  the 
computation  of  the  derivative,  and  does  not  affect  the 
location  of  the  extrema.  As  opposed  to  the  condition 
number,  which  is  dimensionless,  G  is  scaled  by  the 
translational  speed,  ||T||.  We  will  therefore  minimise 
it  on  the  space  of  translation  directions,  that  is,  on 
the  Gaussian  sphere. 

•  The  ratio  Det  Q/Tt  Q  has  the  advantage  of  reaching 
its  (theoretical)  absolute  maximum  for  Ai  =  Am  and 
its  absolute  minimum  for  Ai,  just  as  the  condition 
number  does.  It  does  not  require  a  ranking  of  the 
eigenvalues;  it  can  be  computed  directly  without  di¬ 
agonalizing  Q;  finally,  we  can  derive  it  to  study  its 
extrema,  which  is  the  problem  we  will  address  in  the 
next  section. 

3.3  Deriving  G 

We  use  the  notation  of  [RogerSO]  with  regard  to  matrix 
derivatives.  Since  ||Tl|  =  1,  we  can  represent  T  as  a 
function  of  the  spherical  angle  vector  u  =  with 

respec .  to  which  we  perform  the  minimization 

T(u)  =  (cos  ^  sin  $,  sin  ^  sin  0,  cos  0)* . 

The  function  to  differentiate  is 

G  o  Q(u)  =  e  {  ^  S,  T(u)  r(u)  •  S,' 

\i=i 

Applying  the  chain  rule  for  matrix  differentials,  we  get 
^(Q(U))  ^  |^(X)|  ^  fdQin)\ 

^  I  dx  Jx=Q  V  ^ 

We  chose  G  in  put  for  the  simplicity  of  its  derivatives: 


Inversion  of  Q  is  justified  since  we  expect  to  find  nonzero 
eigenvalues.  In  order  to  simplify  the  computations,  Q 
can  be  expressed  as  a  sum  of  matrices: 

n 

Q(u)  =  ^  Qi(u)  where  Q/  =  Sj  •  T2  •  Sj'  and 
'=‘  Ta  =  T  •  T'. 


dQ/du  is  computed  as  the  sum  of  the  (dQ{/du)ieN.  nnd 


-^-5::(S»TaS,) 


du 

=  (S/®l2)~  (S/®Ii) 
=  (S,®l2)-^  S/ 


Finally,  we  obtain 


|[(Q(«))  ] 

l=m 

*  2((S.®l2)  Taa  S,')  =  0.  (10) 

1=1 

which  is  the  condition  for  the  existence  of  an  extremum 
(whether  an  extremum  Uo  is  a  maximum  or  a  minimum  is 
then  settled  by  analysis  of  the  eigenvalues  of  the  Hessian 
matrix  at  uo). 


Fignre  3  —  One  of  the  640  x  480  images  used  for  the  experi¬ 
ments 


3.4  Experimental  results:  Part  1 

In  this  series  of  experiments,  the  condition  number  of 
the  Q  matrix  was  computed  for  a  640  x  480  image  such 
as  the  one  shown  in  Figure  3.  The  plotted  suface  corre¬ 
sponds  to  a  scanning  of  the  translation  direction  over  the 
half  Gaussian  sphere,  that  is,  [— z'/2,  x/2]  x  [— z’/2,  z’/2]. 

In  the  case  of  the  plot  shown  in  Figure  4,  the  mea¬ 
suring  functions  were  defined  over  128  x  128  image 
windows,  and  were  generated  by  all  combinations  of 
the  form  cos(ar  +  by  +  e),  with  a  and  b  chosen  among 
{0,5,20,45,100},  and  c€  {0,  — ir/2}.  The  “reconstruc¬ 
tion”  functions  were  defined  to  be  the  unit  function 
over  32  x  32  image  windows,  thus  defining  a  coarser  grid 
over  the  image. 
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Fignre  4  —  Ck>nditioii  number  for  data  act  1 

In  the  case  of  the  plot  shown  in  Figure  5,  a  coarser 
64  X  64  reconstruction  grid  was  used.  The  measur¬ 
ing  functioiu  fit  were  still  defined  over  128  x  128  im¬ 
age  windows,  generated  by  all  combinations  of  the  form 
cos(ax-l-6y  +  c).  This  time,  however,  a  larger  number 
of  measuring  functions  was  used,  thus  capturing  more 
information  about  the  image.  Coefficients  a  and  6  were 
picked  in  {0,3,5,11,20,45,60,100},  and  c€  {0,-ir/2}. 


20 


Figure  5  —  Condition  number  for  data  set  2 

One  of  the  most  important  conclusions  we  can  draw 
from  the  experiments  we  have  performed  so  far  is  that 
the  resolution  of  the  reconstruction  grid,  as  well  as  the 
number  of  measuring  functions,  if  they  affect  the  general 
shape  of  the  plotted  surface,  do  not  seem  to  have  any 
significant  influence  on  the  locations  of  strong  minima. 
The  validity  of  this  conclusion  is  much  easier  to  see  when 
the  problem  is  reduced  to  the  case  of  a  robot  constrained 
to  move  in  a  plane,  as  we  shall  see  in  the  next  subsection. 


3.5  Application  to  the  case  of  a  mobile  robot 

A  simplified  form  of  equation  (10)  is  be  obtained 
in  the  case  of  a  mobile  robot  since  the  translation  dis¬ 
placements  then  take  place  in  a  plane.  The  translation 
vector  can  thus  be  described  by  a  single  heading  angle: 
T  =:  ||T||  •(co8^,sin^,0)'.  Equation  (10)  then  simplifies 
to  the  following  scalar  equation: 

'R(UV)  =  0,  (11) 

where 

and 

/=in 

V=  E(S«Ta4S,') 

/si 

Let  us  consider  a  scene  such  as  the  one  shown  in  Fig¬ 
ure  6.  The  camera’s  optical  axis  is  parallel  to  the  motion 
surface,  which  means  that  the  focus  of  expansion  corre¬ 
sponding  to  the  translation  is  situated  on  the  z  axis  (the 
white  horizontal  line  in  the  image). 


Figure  6  —  Directions  of  translation  for  a  mobile  robot 


What  G  provides  is  an  estimate  of  which  activities  are 
informative  for  the  robot,  in  the  sense  that  they  make  its 
visual  algorithms  more  robust,  and  hence,  their  results 
more  reliable.  It  should  therefore  be  used  as  an  addi¬ 
tional  constraint  by  the  planning  algorithm,  the  preim¬ 
age  backchaining  algorithms  ([Latom89]).  An  even  bet¬ 
ter  use  of  G  could  be  made  by  artificial  potential  field  al¬ 
gorithms  ([Khati86]):  In  addition  to  the  attraction  force 
exerted  by  the  goal  and  the  repulsion  forces  exerted  by 
the  obstacles,  the  robot  could  be  made  to  be  influenced 
by  an  “information”  force,  aiming  at  maximizing  the 
quality  of  the  data  collected  by  the  robot’s  visual  sen¬ 
sors.  This  extension  of  the  original  planning  algorithms 
would  be  feasible  whether  the  visual  data  exploited  by 
the  controller  is  a  classical  depth  map  or  the  free  space 
doors  described  in  [Herve91]. 
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3.6  Experimental  result*:  Part  n 

As  was  the  case  with  general  3D  motion  (over  the  Gaus¬ 
sian  sphere),  linear  features  were  defined  by  measuring 
functions  /i|  of  the  form  co8(ar  +  iy)  or  sin(az-(-6y) 
(with  a  and  6  belonging  to  {0,3,5,11,20,45,60,100}), 
whose  domains  were  square  windows  over  the  image.  The 
“reconstruction”  functions  were  again  chosen  to  be 
unit  functions  defined  over  square  image  windows. 


Fignie  7  —  (a)  Plot  for  data  set  1.  (b)  Plot  for  data  set  2 


Table  1  —  Parameters  for  the  first  image 


The  first  series  of  experiments  were  performed  with 
the  640  X  480  image  shown  in  Figure  3;  their  results  are 
presented  in  Figures  7,  8,  and  9.  Table  1  gives,  for  each 
figure,  the  size  of  the  linear  feature  window,  the  param¬ 
eters  used  to  generate  the  measuring  functions,  and  the 
size  of  the  reconstruction  window. 


Figure  8  —  (a)  Plot  for  data  set  3.  (b)  Plot  for  data  set  4 


Figure  9  —  (a)  Plot  for  data  set  5.  (b)  Plot  for  data  set  6 


The  first  conclusion  we  can  draw  from  these  six  para¬ 
metric  (polar)  curves  derived  from  our  experiments  is 
that,  for  given  choices  of  linear  feature  vector  and  of  re¬ 
construction  function  modelling  the  depth  map,  there  are 
directions  of  motion — actions — for  which  the  visual  algo¬ 
rithms  perform  much  better  than  for  '>thers.  Since  the 
curves  plot  the  condition  number  of  matrix  Q,  a  point 
close  to  the  origin  coresponds  to  a  more  stable  solution 
of  the  reconstruction  equations;  a  displacement  in  that 
direction  therefore  corresponds  to  an  optimization  of  the 
algorithmic  stability  of  the  visual  module.  Conversely, 
the  points  most  distant  from  the  origin  correspond  to 
“bad”  actions  of  the  observer,  resulting  in  maximum  al¬ 
gorithmic  unstability  of  the  visual  module,  and  should 
therefore  be  avoided. 
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The  second  conclusion  suggested  by  our  results  is 
that  the  location  of  the  best  and  worst  directions  of  dis¬ 
placement  (the  strong  extrema  of  the  condition  number) 
are  not  perturbed  much  by  changes  in  the  parameters  of 
the  visual  algorithms: 

•  The  resolution  of  the  reconstruction  grid,  from  16  x 
16  for  Fig.  7(a)  to  64  x  64  for  Fig.  7(b).  If  these 
initial  results  are  confirmed  by  further  experiments, 
it  would  mean  that  the  information  we  obtain  for 
low  resolutions  about  best  and  worst  actions  could 
extend  to  higher  resolutions. 

•  As  Figure  9(b)  clearly  shows,  increasing  the  resolu¬ 
tion  of  the  reconstruction  grid  improves  the  details  of 
the  condition  number  curve,  but  at  the  cost  of  nar¬ 
rowing  the  “good”  repons,  now  surrounded  by  sharp 
peaks  of  bad  performance.  This  may  not  be  a  very 
surprising  observation,  judging  by  the  difiBculty  en¬ 
countered  in  all  shape  from  x  problems  that  aim  at 
extracting  dense  depth  maps. 

•  Similarly,  changes  in  the  linear  feature  vector  have 
little  effect  on  the  location  of  strong  extrema,  whether 
the  changes  affect  the  size  of  the  linear  features’  win¬ 
dows  (compare  Figures  7(b)  and  8(a))  or  the  mea¬ 
suring  functions  themselves  (Figures  7(a)  and  8(b)). 

•  As  with  the  resolution  of  the  reconstruction  grid,  it 
is  possible  to  change  the  linear  feature  vector  so  as  to 
obtain  a  more  detailed  condition  number  curve,  but 
again  at  the  cost  of  increasing  the  overall  instability 
and  narrowing  the  “good”  areas  of  the  curve. 


Figure  10  —  Another  example  of  640  X  480  input  image 


Similar  results  have  been  obtained  with  other  images, 
such  as  the  640  x  480  image  in  Figure  10.  Table  2  gives  for 
Figures  12, 11(a),  and  11(b)  the  size  of  the  linear  feature 
window,  the  parameters  used  to  generate  the  measuring 
functions,  and  the  size  of  the  reconstruction  windows. 


Table  2  —  Parameters  for  the  second  image 


Fig. 

Id  windows 

n  and  b  in 

windows 

12 

128  X  128 

{0,3,5,11,20, 

45,60,100} 

64x64 

11(a) 

256  x  256 

{0,3,5,11,20, 

45, 60, 100} 

64x64 

11(b) 

128  X  128 

{0, 5,20,45, 100} 

32  X  32 

Figure  11  —  (a)  Plot  for  data  set  1.  (b)  Plot  for  data  set  2 

4.  Discussion 

4.1  Motion  and  perception 

The  first  part  of  this  paper  presented  an  active  ap¬ 
proach  to  the  problem  of  visual  reconstruction,  which 
provides  us  with  a  model-based  depth  map  of  the  ob¬ 
served  scene.  However,  there  are  reasons  to  believe  that 
a  depth  map  may  not  be  necessary  to  accomplish  com¬ 
plex  navigation  tasks  such  as  obstacle  avoidance  or  visual 
servoing.  In  fact,  the  process  in  which  the  visual  re¬ 
construction  community  has  been  engaged  over  the  last 
decade  may  prove  much  more  worthwhile  than  the  goal  it 
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has  set  for  itself.  On  the  way  to  the  realization  of  shape 
from  X  modules,  we  have  learned  about  the  computa¬ 
tional  issues  of  the  visual  process  and  have  accumulated 
results  and  algorithms  which  all  describe  parts  of  the  vi¬ 
sion  problem,  and  can  be  directly  exploited  in  the  control 
of  robotic  tasks. 


Figure  12  —  Plot  for  data  set  3 

Traditionally,  it  has  been  assumed  that  the  role  of 
vision  was  to  provide  a  depth  map  to  a  planning  system 
which  would  then  decide  on  a  next  move  for  the  robot. 
This  scheme  implicitly  incorporates  elements  of  feedback 
(if  only  by  the  effects  of  the  motion  on  the  visual  input), 
but  it  separates  task  planning  from  action  and  sensing 
too  sharply.  Clearly,  the  boundary  between  them  is  much 
fuzzier,  and  we  would  like  sensing  to  take  a  greater  part 
in  the  decision  process. 

Even  if  we  avoid  reconstructing  a  depth  map,  the  al¬ 
gorithms  we  will  be  using  in  our  systems  will  be  based  on 
reconstruction  algorithms,  since  in  one  form  or  another, 
some  three-dimensional  information  is  needed,  whether 
it  is  the  time  to  coUisioa  or  the  distance  to  a  particular 
feature  of  the  workspace  or  of  the  configuration  space.  In 
this  context,  it  becomes  useful — and  even  necessary — to 
be  able  to  determine  a  priori  which  action  will  result  in 
good  behaviors  of  the  vision  algorithms  and  which  will 
provoke  instability  in  the  computations. 

The  determination  of  an  “optimal”  motion  which  we 
proposed  in  the  previous  section  is  a  first  step  in  this  di¬ 
rection.  Naturally,  we  cannot  expect  the  planning  mod¬ 
ule  to  apply  this  action,  regardless  of  the  task  it  has  to 
accomplish,  but  it  should  treat  as  one  of  its  conatrtaata 
the  need  to  keep  the  direction  of  displacement  in  the 
neighborhood  of  one  of  the  “good”  directions  determined 
for  the  visual  process. 


4.2  LimitatioiiB  and  drawbacks  of  the  method 

As  is  always  the  case  when  one  attempts  to  extract 
information  from  of  real  images,  ad  hoc  (or  if  one  prefers 
a  gentler  term,  heuristic)  choices  had  to  be  made  in  the 
work  we  presented  here; 

•  We  have  already  discussed  the  validity  and  limita¬ 
tions  of  the  optical  flow  equation,  whether  used  point- 
wise  or  integrated  over  image  windows.  The  method 
we  presented  is  not  restricted  to  the  use  of  the  equa¬ 
tion  proposed  in  [HornSSl].  Any  more  sophisticated 
constraint,  however,  involves  the  inconvenience  of  re¬ 
quiring  additional  (unknown  and  variable)  parame¬ 
ters  describing,  for  example,  the  reflective  properties 
of  the  objects  in  the  scene. 

•  As  is  generally  the  case  when  least  squares  optimizar 
tion  is  used,  the  main  justification  for  the  choice  of 
this  criterion  is  that  it  keeps  the  mathematical  ex¬ 
pressions  simple.  We  have  at  this  point  no  way  of 
estimating  the  independence  of  the  variables 

•  Choosing  the  Vt  functions  to  represent  the  depth 
poses  more  practical  than  theoretical  problems,  due 
to  the  fact  that  the  number  of  such  functions  deter¬ 
mines  the  size  of  the  system  of  linear  equations  to 
solve.  For  example,  if  not  for  the  prohibitive  com¬ 
putational  cost  of  such  a  choice,  one  could  imagine 
defining  the  V't’s  to  be  constant  functions  (giving  the 
value  1.0),  over  pixel-size  windows. 

•  Choosing  the  measuring  functions  (/i{)tsi,n  >  however, 
poses  more  serious  theoretical  problems:  what  types 
of  functions  capture  the  most  information  about  the 
image,  or  about  a  class  of  images?  Could  this  be 
determined  a  priori?  Here  again,  the  goal  is  to  make 
matrix  Q  as  well  conditioned  as  possible,  but  one 
would  now  be  looking  for  a  minimax  solution:  the 
one  giving  the  lowest  condition  number  in  the  case  of 
the  worst  observer  motion  (the  “best”  displacement 
may  not  be  feasible  in  the  context  of  the  observer’s 
task). 

5.  Conclusion 

We  have  presented  a  theory  about  the  extraction  of 
the  shapes  of  observed  objects.  Our  approach  combines 
the  modules  of  shape  from  shading  by  fusing  the  infor¬ 
mation  relevant  to  each  of  them,  without  having  to  resort 
to  segmentation  or  explicit  selection  of  one  algorithm  or 
another  for  a  given  area  of  the  image.  By  using  linear  fea¬ 
tures  (i.e.  by  considering  variations  of  the  intensity  over 
areas  and  not  at  isolated  points),  we  avoid  the  pointwise 
computation  of  unreliable  operators,  which  is  a  flaw  of 
classical  methods.  Finally,  we  show  that  the  observer 
can  determine  a  motion  that  optimizes  the  stability  of 
the  equation  to  be  solved,  and  therefore  the  reliability  of 
the  solution. 
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APPENDIX  A  Derivation  of  Equation  (5) 

We  are  trying  to  prove  that 

In  order  to  do  this,  we  have  to  express  the  conservation 
of  a  function  A  over  a  3D  domain  V. 


Conservation  Law:  Let  V  be  a  compact  subset  of  R* 
with  oriented  boundary  dV^.,  subject  to  deformations 
and  displacements,  and  let  ^  be  a  function  defined  on 
V.  Then  the  following  applies: 

iUIJM = 

where  V  is  the  velocity  at  the  boundary  point  under 
consideration  and  da  is  directed  along  the  outer  normal. 


N.B.  The  2D  equivalent  of  this  law  (i.e.  where  triple 
2uid  double  integrals  are  replaced  by  double  and  simple 
integrals  respectively)  is  valid  when  the  surface  is  planar 
and  all  deformations/motions  occur  in  this  plane.  It  is 
also  valid  for  some  pathological  forms  of  A,  one  of  which, 
as  will  be  shown  here,  happens  to  be  our  example  (/i|  /). 
Ostrogradsky's  divergence  theorem  tells  that 

//«.*  ■“*•  =  ///v"'-  *■'" 

Applying  this  to  (A.l)  yields; 

We  now  try  to  pick  a  V  such  that  this  volume  equation 
can  be  transformed  into  a  surface  equation  for  the  par¬ 
ticular  type  of  problem  we  are  studying  here.  In  order 
to  simplify  the  definition,  we  will  call  C{P,S)  the  cone  of 
vertex  P  generated  by  the  surface  S. 

£i  is  the  window  on  which  the  linear  feature  is  computed. 
We  choose  V  to  be  the  part  of  C{0,  E/)  lying  between  E| 
and  a  surface  E'l  defined  as  follows:  for  every  surface 
element  ds  C  E/  ,  the  part  of  C{0,  dt)  included  in  V  has 
a  volume  equal  to  1  ■  ds  (or  L  •  ds  for  any  given  L,  as  long 
as  E'l  lies  between  O  and  E|). 

If  A  is  such  that  A{X,  Y,  Z)  =  A'{  then  (A.2)  can 
be  written  as  follows: 


If  the  displacement  of  V  is  rigid  (V  =  T  -(•  w  x  M)  then 
^  ■  V  =  0  and  the  conservation  law  (A.l)  simplifies  to 


In  the  case  that  interests  us,  A'(x,  y)  =  /(x,  y)  ■  mix,  y) 
(the  light  intensity  along  a  ray  passing  through  the  op¬ 
tical  center  is  considered  constant)  and  ipi  =  f  f  A'6b. 
The  expression  turns  out  to  be  remarkably  simple: 


dA‘ 

dt 


£ 

dt 


in  ■¥  I  ■  m 


=  /  -  in- 

which,  combined  with  (A.2),  gives  the  expected  result. 
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Abstract 

Target  retrieval  tasks  characterize  an  important 
class  of  problems  in  mobile  robotics.  In  such 
problems,  a  robot  searches  through  a  cluttered 
environment  and  identifies  objects  matching  a 
specified  description.  To  speed  search,  the  robot 
should  avoid  using  slow,  high-accuracy  sensing 
in  regions  that  are  unlikely  to  contain  the  tar¬ 
get.  We  describe  an  approach  in  which  search 
focuses  on  probable  regions  found  with  inex¬ 
pensive  sensing.  Decision  making  is  performed 
by  three  modules:  a  high-level  decision  making 
component  based  on  Bayesian  decision  theory 
provides  the  overall  search  strategy,  a  path  plan¬ 
ner  adds  navigational  refinements,  and  a  low- 
level  controller  executes  the  strategy  while  cop¬ 
ing  with  obstacles  and  unexpected  events.  We 
describe  an  implementation  of  our  approach  and 
a  series  of  initial  experiments. 

1  Introduction 

Many  robotics  applications  require  a  mobile 
robot  to  fetch  an  instance  of  a  particular  type  of 
object  in  an  unfamiliar  environment.  Such  tar¬ 
get  retrieval  tasks  represent  an  important  class 
of  problems  in  mobile  robotics.  For  example,  re¬ 
searchers  in  a  lab  may  ask  a  robot  assistant  to 
locate  and  retrieve  a  particular  piece  of  equip¬ 
ment  while  they  attend  to  more  important  mat¬ 
ters.  Similarly,  workers  at  a  construction  site 
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under  C:ontract  No.  F30602-91-C-0041,  and  by  the  Na¬ 
tional  Science  foundation  in  conjunction  with  the  Ad¬ 
vanced  Research  Projects  Agency  of  the  Department  of 
Defense  under  Contract  No.  IRl-8905436. 


might  use  robots  to  fetch  tools  and  building 
materials.  To  expedite  search  the  robot  must 
carefully  deploy  its  sensors  to  deal  with  the  un¬ 
certainty  in  sensing. 

In  the  target  retrieval  tasks  considered  in  this 
paper,  we  assume  that  the  boundary  of  the 
search  area  is  known  but  that  there  is  little  if 
any  prior  information  concerning  the  interior  of 
the  search  area.  In  particular,  there  is  no  prior 
information  concerning  the  location  of  the  tar¬ 
get  or  locations  of  other  objects  (possibly  sim¬ 
ilar  in  appearance  to  the  target)  that  may  im¬ 
pede  exploration  or  occlude  the  target.  Due  to 
the  computational  cost  of  high-accuracy  object 
recognition  routines,  we  wish  to  use  such  rou¬ 
tines  sparingly;  the  robot  searches  those  areas 
that  are  believed  with  high  probability  to  con¬ 
tain  the  target,  as  determined  by  lower-cost,  less 
accurate  sensing  routines. 

Our  robot  must  also  deal  with  the  problem  of 
navigation.  In  particular,  it  must  move  from 
one  location  to  another  eflRciently  while  avoid¬ 
ing  obstacles  along  the  way.  This  involves  both 
high-level  path  planning  and  low-level  naviga¬ 
tion  routines,  which  must  be  able  to  commu¬ 
nicate  in  a  straightforward  manner.  The  path 
planner  itself  is  required  to  build  and  maintain 
a  map  that  facilitates  both  planning  and  navi¬ 
gation. 

In  the  next  section,  we  describe  a  three-level 
solution  to  the  task  retrieval  problem.  Then 
we  consider  the  decisions  made  by  each  level. 
Finally,  we  describe  our  experiments  in  active 
perception  and  present  our  conclusions. 
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2  Proposed  Solution 

Our  solution  consists  of  three  levels  of  control: 
supertnsory  control,  path  planning,  and  low-level 
control.  Each  level  has  its  own  spatial  represen¬ 
tation  that  is  consistent  with  the  other  two  lev¬ 
els  yet  is  appropriate  for  its  particular  scale  of 
spatial  reasoning. 

The  highest  level  of  representation  is  the  super¬ 
visory  control.  The  supervisor  directs  the  robot 
to  square  areas  called  regions  believed  with  high 
probability  to  contain  the  target  object.  This 
belief  is  based  on  sensor  information  provided 
by  the  robot  itself,  and  may  include  data  ac¬ 
quired  in  the  course  of  the  robot’s  low-level  con¬ 
trol  phase. 

To  reduce  computational  complexity,  these  re¬ 
gions  should  be  large.  In  our  case  the  prac¬ 
tical  bmit  of  our  sensors’  ability  to  detect  ob¬ 
jects  is  five  meters,  so  these  regions  are  modeled 
as  square  areas  approximately  four  meters  on  a 
side.  These  areas  can  be  efficiently  represented 
by  a  coarse  occupancy  grid  [Moravec  and  Elfes, 
1985],  with  each  grid  element  corresponding  to 
a  region.  The  path  planning  module  has  some 
freedom  to  direct  the  search  because  it  is  not 
told  exactly  where  in  the  region  to  go.  The  ex¬ 
pected  time  to  find  the  target  is  reduced  by  first 
choosing  regions  likely  to  contain  it. 

The  path  planner  directs  the  robot’s  route  to 
the  region  given  by  the  supervisory  control.  A 
direct  path  may  be  impossible,  risky,  or  in¬ 
efficient  in  certain  circumstances.  Path  plan¬ 
ning  could  use  a  direct  model  of  each  obstacle’s 
boundary,  but  this  much  detail  is  not  required. 
It  is  much  more  convenient  to  model  important 
locations  in  the  world  with  nodes.  Pathways 
between  locations  are  represented  by  edges  in  a 
graph. 

To  use  this  model  for  path  planning,  we  make 
the  assumption  that  traversing  an  edge  requires 
only  local  information.  This  allows  us  two  ad¬ 
vantages.  First,  we  are  able  to  separate  the  low- 
level  control  module  from  the  rest  of  the  system. 
Thus,  multiple  low-level  control  routines  may  be 
selected  for  different  circumstances.  Second,  we 


have  abstracted  away  detail  which  would  make 
path  planning  expensive.  The  low-level  control 
module  need  only  give  the  path  planner  the  es¬ 
timated  time  cost  of  traversing  the  given  path¬ 
way. 

The  low-level  control  traverses  edges  efficiently, 
avoiding  any  obstacles  in  the  way.  Low-level 
control  requires  its  own  spatial  model.  Our  sen¬ 
sors  return  obstacle  information  as  the  common 
data  structure  that  connects  low-level  control 
with  the  path  planner.  Our  robot  uses  a  laser 
light  striper  for  ranging  information,  along  with 
near  infrared  sensors  for  obstacle  avoidance  and 
navigation  information. 

Once  the  robot  has  arrived  at  a  location,  search 
for  the  target  object  can  proceed.  A  Ust  of  can¬ 
didate  target  objects  is  generated  using  a  quick 
scan  with  the  Ught  striper.  The  robot  can  then 
move  adjacent  to  the  object  and  examine  it  with 
its  stereo  camera  pair.  A  stereo  algorithm  is 
used  to  recover  the  target’s  3D  structure.  A  set 
of  algebraic  invariants  is  then  computed  by  com¬ 
paring  the  recovered  structure  with  that  associ¬ 
ated  with  the  object  to  be  recognized,  which  is 
stored  in  a  database.  If  the  computed  invariants 
match  the  ideal  ones,  the  target  is  found  and  a 
successful  recognition  is  signaled  [Subrahmonia 
et  ai,  1992).  Otherwise,  this  step  is  repeated 
for  each  object  in  the  candidate  Ust  until  the 
target  is  found.  If  the  target  is  not  found  in  the 
candidate  Ust,  the  robot  continues  the  search  in 
a  new  area  as  directed  by  the  supervisor. 

Once  the  robot  has  arrived  at  the  region  given 
by  the  supervisory  control,  search  for  the  target 
object  can  proceed.  A  Ust  of  candidate  target 
objects  is  generated  using  a  quick  scan  of  the 
robot’s  sensors.  The  robot  can  then  move  adja¬ 
cent  to  the  object  and  examine  it  with  its  stereo 
camera  pair. 

2.1  Supervisory  Control 

It  is  the  responsibility  of  the  supervisor  to  gen¬ 
erate  strategies  for  finding  the  target  as  soon 
as  possible.  To  expedite  search,  the  supervi¬ 
sor  combines  fast  but  error-prone  sensing  rou¬ 
tines  with  slower  but  more  accurate  sensing  rou¬ 
tines.  Each  strategy  corresponds  to  a  sequence 


594 


of  navigation  and  sensing  actions.  The  super¬ 
visor  uses  a  temporal  Bayesian  network  [Dean 
and  Kanazawa,  1989]  to  compute  the  value  of 
various  information  gathering  strategies.  From 
a  given  initial  situation,  the  robot  selects  a  set 
of  reasonable  action  sequences  from  a  library  of 
such  sequences.  The  expected  value  of  an  action 
sequence  is  measured  in  terms  of  the  amount  of 
information  it  provides  about  the  location  of  the 
target. 

From  the  previous  history  of  fast  inaccurate 
sensing,  the  supervisor  has  a  distribution  of 
places  to  visit  and  costs  to  arrive  at  those  places. 
It  chooses  the  visitation  sequence  with  the  high¬ 
est  expected  value,  and  lets  the  path  planner 
execute  the  first  visit.  Subsequent  actions  in 
the  chosen  sequence  may  not  occur.  The  lower 
levels  of  the  system  execute  and  report  on  the 
success  of  the  first  visit.  Success  provides  a  list 
of  objects  observed  in  the  regional  visit,  with  a 
measure  of  their  similarity  to  the  target.  This 
list  is  added  to  the  supervisory  model,  and  used 
to  update  the  beliefs  about  the  target’s  location. 
The  supervisor  then  creates  a  new  visitation  se¬ 
quence  if  needed. 

2.2  Path  Planning 

A  path  consists  of  a  sequences  of  nodes  con¬ 
nected  by  edges,  which  may  be  real,  if  they  have 
been  traversed  and  are  known  to  exist,  or  vir¬ 
tual,  if  they  are  unknown  quantities.  The  cost  of 
a  path  is  some  non-global  function  of  its  compo¬ 
nent  edges.  Each  edge  has  a  weight  associated 
with  it,  determined  by  the  underlying  edge  data, 
and  used  by  the  path  planner.  For  example  vir¬ 
tual  edges  have  greater  weight  than  real  edges. 
Edges  that  cannot  be  traversed  due  to  obstacles 
are  considered  to  have  an  infinite  weight. 

The  path  planner  is  required  to  quickly  re¬ 
turn  a  path  with  low  cost.  We  use  a  variant 
of  best-first  search  and  a  set  of  heuristics  to 
eliminate  paths  that  are  unnecessarily  compli¬ 
cated.  When  the  supervisor  directs  the  planner 
to  move  to  a  given  region,  the  planner  selects 
a  destination  node  corresponding  to  a  location 
in  that  region  if  one  already  exists  or  creates  a 
new  one  otherwise. 


Path  traversal  consists  of  traversing  each  of  the 
component  edges  of  the  path.  At  each  node,  the 
robot  must  reaUgn  itself  toward  the  next  node 
and  proceed  if  there  are  no  obstacles  in  the  way. 
The  path  planner  itself  has  insured  to  the  best 
of  its  knowledge  that  there  are  no  obstacles  in 
any  of  the  component  edges,  can  consult  the  oc¬ 
cupancy  grid  for  an  estimate  of  the  there  being 
an  obstacle  in  any  given  untraversed  planner  as¬ 
sumes  that  the  low-level  control  system  is  able 
to  deal  with  obstacles  encountered  in  traveling 
from  one  node  to  another.  Additional  details 
regarding  the  path  planner  are  in  the  longer  ver¬ 
sion  of  this  paper  [Camus  et  ai,  1993]. 

2.3  Low-level  Control 

The  purpose  of  the  low-level  control  system  is  to 
execute  the  path  selected  by  the  path  planner. 
Execution  consists  of  visiting  in  sequence  each 
component  node  of  the  chosen  path,  avoiding 
all  obstacles  along  the  way.  Each  pair  of  nodes 
in  the  path  are  connected  by  an  edge  represent¬ 
ing  a  pathway  that  contains  all  the  information 
necessary  to  get  from  one  node  to  another. 

The  low-level  control  system  relies  on  dead  reck¬ 
oning  and  the  accuracy  of  a  laser-light-striper 
ranging  system  to  reliably  traverse  edges  in  the 
network  of  nodes.  Accurate  edge  traversal  is 
essential  to  maintain  registration  with  the  map 
(network  of  nodes)  being  constructed  during  the 
search  for  the  target. 

3  Experiments 

In  this  section  we  describe  a  series  of  experi¬ 
ments  involving,  Gort,  a  mobile  robot  especially 
designed  for  this  research.  Gort  is  operates  in 
an  enclosed  atrium  filled  with  boxes  and  oddly 
shaped  obstacles.  A  target  shaped  like  the  ob¬ 
stacles  is  located  in  the  area,  and  CJort  searches 
the  area  to  locate  the  target.  Gort  uses  stereo 
vision  both  to  identify  candidate  target  objects 
and  to  discriminate  between  the  targets  and  ob¬ 
stacles. 

3.1  Hardware 

Gort  consists  of  a  circular,  four-wheeled  ba.se, 
an  on-board  computer,  four  sensors,  and  a 
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power  supply.  The  base,  from  Real  World  Inter¬ 
face,  includes  its  own  controller  and  communi¬ 
cates  with  the  on-board  computer  over  a  stan¬ 
dard  serial  line.  The  base  carries  large  batteries 
that  can  power  the  robot  for  up  to  five  hours. 
Resting  on  the  base  is  a  metal  cage  that  sup¬ 
ports  the  additional  hardware  of  the  robot. 

The  on-board  computing  equipment  consists  of 
a  VME  card  cage  containing  a  68030  processor 
board  with  a  hard  drive.  The  processor  board 
runs  a  UNIX  variant  that  we  can  port  software 
to  directly  from  our  workstation  development 
platforms.  AU  the  navigation  and  planning  al¬ 
gorithms  run  on  the  68030  processor.  However, 
a  tether  connects  the  robot  to  a  separate  work¬ 
station  for  handling  the  vision  routines. 

Two  sensors  are  memory-mapped  into  the  vir¬ 
tual  memory  of  the  on-board  computer:  a  ring 
of  infrared  sensors  and  a  laser  light  striper. 
The  infrared  sensors  are  good  proximity  sen¬ 
sors,  providing  more  robust  sensing  of  obstacles 
than  most  standard  acoustic  sensors.  The  laser 
fight  striper  is  a  good  medium-range  sensor  for 
detecting  and  measuring  the  shape  of  obstacles. 

Also  mounted  on  the  metal  cage  are  two  cam¬ 
eras,  angled  for  stereo  vision.  The  cameras 
are  connected  directly  to  high-bandwidth  ca¬ 
bles  that  feed  the  vision  information  directly  to 
a  remote  computer  for  processing.  The  cables 
also  provide  communication  between  the  remote 
computer  and  the  on-board  computer. 

3.2  Target  Recognition 

A  key  component  to  the  system  is  the  acqui¬ 
sition  of  probablistic  information  for  the  tem¬ 
poral  Bayesian  network  planner.  The  ability  of 
the  supervisor  to  deal  with  noisy  data  is  essen¬ 
tial  in  the  overall  success  of  the  system.  We  use 
two  sensor  strategies  to  deal  with  this  problem. 
First,  the  laser  fight  striper  is  used  to  detect  ob¬ 
jects  within  the  robot’s  same  region.  The  laser 
can  detect  objects  with  high  accuracy  within 
this  range.  A  full  sweep  with  the  laser  thus 
constitutes  a  thorough  search  of  the  robot’s  own 
region . 

Unfortunately  the  laser  fight  striper’s  effective 


range  is  limited,  so  to  detect  potential  target's 
at  long  range  we  use  vision.  We  use  a  pattern¬ 
matching  algorithm  which  seeks  to  identify  po¬ 
tential  targets  at  up  to  8  meters  away.  This 
data  is  not  as  complete  as  that  from  the  laser 
fight  striper,  and  serves  to  inform  the  supervi¬ 
sory  control  of  potential  targets  in  neighboring 
regions.  The  supervisor  considers  the  informa¬ 
tion  from  both  these  sensors  in  deciding  where 
to  direct  the  robot  for  future  search. 

3.3  Results 

Our  primary  results  concern  the  navigation 
components:  the  path  planning  and  low-level 
controls.  Each  has  been  successfully  tested  in 
both  laboratory  and  target  environments. 

The  path  planner  has  successfully  dealt  with 
sensing  uncertainty  and  real  time  demands  for 
robot  control.  Simulated  results  have  shown  the 
path  planner  to  be  a  reliable  method  of  return¬ 
ing  efficient  paths  even  when  dealing  with  thou¬ 
sands  of  locations.  Physical  results  have  shown 
the  ability  of  the  path  planner  and  the  low-level 
control  to  coordinate  efficiently  through  a  field 
of  obstacles. 

The  low-level  obstacle  avoidance  algorithms 
have  been  successful  in  aU  experiments.  Two 
methods  of  using  the  fight  striper  to  detect  ob¬ 
stacles  were  originally  developed.  One  method 
makes  use  of  the  laser’s  wide-angle  beam  to  de¬ 
tect  objects  in  the  periphery.  Unfortunately  the 
laser  stripe  is  thin  and  less  visible  off-center, 
making  additional  checks  necessary  for  accurate 
sensing.  These  checks  can  increase  the  time  it 
takes  to  detect  obstacles. 

Thus,  we  have  chosen  the  second  method,  which 
is  to  rotate  the  robot  base  and  sweep  the  cen¬ 
ter  of  the  la.ser  beam  (where  it  is  strongest) 
back  and  forth  across  the  search  area.  This 
approach  was  very  successful  in  detecting  the 
boundaries  of  flat  surfaces  and  boxes,  and  even 
high-curvature  objects  such  as  tree  planters. 
Unfortunately,  turning  the  robot  to  sense  better 
led  to  rotational  inaccuracy  in  the  robot  dead 
reckoning. 

The  integration  of  the  navigational  modules  has 
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been  seamless.  Once  created,  a  node  provides 
a  convenient  way  around  an  existing  obstacle. 
Future  trips  can  make  use  of  the  node  in  the 
path  planning  stage  so  that  the  obstacle  avoid¬ 
ance  stage  can  be  skipped  for  these  previously 
traversed  paths. 

Preliminary  experiments  concerning  the  data 
acquisition  for  the  temporal  Bayesian  network 
Supervisor  have  been  successful  in  both  labo¬ 
ratory  and  target  environments.  The  Bayesian 
network  itself  has  also  been  successfully  simu¬ 
lated. 

4  Related  Work 

Dean  and  Wellman  [l99l]  describe  our  basic 
approach  to  planning  using  Bayesian  networks. 
Levitt  et  al.  [1988]  discuss  issues  involving 
search  in  the  context  of  object  recognition. 
Rimey  [1992]  consider  the  problems  involved  in 
repositioning  a  robot  head  for  recognition  pur¬ 
poses.  Agosta  [l99l]  discusses  some  of  the  more 
subtle  issues  involved  in  quantifying  relations 
among  visual  features. 

Basye  et  al.  [1992]  provide  an  overview  of  how 
the  Bayesian  approach  can  be  applied  to  prob¬ 
lems  in  robotics  with  additional  details  in  [Dean 
et  al.,  1990].  Kirman  et  al.  [l99l]  discuss  the 
use  of  sensor  abstractions  to  deal  with  the  com¬ 
binatorics  of  using  Bayesian  networks. 
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Abstract 

In  this  paper,  we  describe  a  method  of  incorporat¬ 
ing  the  sensor  planning  abilities  of  the  “MVP” 
Machine  Vision  Planning  system  to  function  in 
a  dynamic  robot  work-cell.  By  mounting  a  cam¬ 
era  on  a  manipulator,  it  is  possible  to  compute 
a  series  of  viewpoints  and  to  move  the  camera 
to  them  at  appropriate  times  so  that  there  is 
always  a  robust  view  suitable  for  monitoring  a 
robot  task.  The  dynamic  sensor  planning  system 
presented  here  achieves  this  by  analyzing  geomet¬ 
ric  models  of  the  environment  and  of  the  planned 
motions  of  the  robot,  as  well  as  optical  models  of 
the  camera  itself.  It  computes  a  series  of  view¬ 
points,  each  of  which  provides  a  valid  mewpoint 
for  a  different  interval  of  the  planned  task.  Ex¬ 
perimental  results  monitoring  a  robot  operation 
are  presented,  and  directions  for  future  research 
are  discussed. 

1  Introduction 

Recently,  there  has  been  much  research  in  the 
field  of  sensor  planning  [Cowan  and  Bergman, 
1989,  Hutchinson  and  Kak,  1989,  Ikeuchi  and 
Kanade,  1989,  Tarabanis  et  al.,  1991a].  The  ba¬ 
sic  problem  is  that  in  setting  up  an  automated 
system  for  monitoring  some  process,  the  effec¬ 
tiveness  of  the  system  can  largely  be  determined 
by  the  locations,  types  and  configurations  of  the 
sensors  used.  To  manually  determine  these  pa¬ 
rameters  on  a  case  by  case  basis  may  not  be  cost 
effective  or  accurate,  and  the  resulting  system 
may  not  be  optimal  in  any  sense.  It  may  be 
better  to  have  an  automated  system  for  deter¬ 
mining  the  sensor  locations  and  parameters  for 
monitoring  a  given  task. 

To  that  end,  many  systems  have  been  and  are 
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being  developed  which,  based  on  geometric  mod¬ 
els  of  an  environment  and  models  of  the  sensors, 
can  generate  sensor  locations  and  settings  which 
provide  a  robust  view  of  specific  features  so  that 
the  features  are  detectable,  recognizable,  mea¬ 
surable,  or  meet  some  other  task  constraints.  In 
general,  the  sensors  are  cameras  and  a  robust 
view  implies  that  the  camera  must  have  an  un¬ 
obstructed  view  of  the  entire  feature  set,  which 
must  lie  within  the  depth-of-field  of  the  camera 
and  must  be  magnified  to  a  given  specification. 
Sensor  planning  systems  can  then  generate  cam¬ 
era  locations,  orientations,  lens  settings  (focus- 
ring  adjustment,  focal  length,  aperture),  and  in 
some  cases  lighting  plans  to  insure  a  robust  view 
of  the  features. 

It  is  interesting  to  note  that  while  research  in 
robot  motion  planning  abounds,  research  in  sen¬ 
sor  planning  has  focused  on  sensor  planning  for 
static  scenes.  It  is  our  belief  that  an  intelligent 
robot  system  capable  of  planning  its  own  actions 
should  be  capable  of  planning  its  own  sensing 
strategies.  With  a  dynamic  sensor  planning  sys¬ 
tem,  this  goal  is  closer  to  a  reality.  Robots  in¬ 
volved  in  manufacturing  or  assembly  can  deter¬ 
mine  appropriate  sensor  locations.  Teleopera¬ 
tors  can  have  the  robot  system  guarantee  robust 
viewpoints  during  the  operation.  The  intelligent 
motion  plans  which  researchers  spend  so  much 
effort  computing  can  be  monitored  in  an  intelli¬ 
gent  fashion. 

To  that  end,  we  have  been  exploring  methods 
of  extending  the  sensor  planning  abilities  of  the 
“MVP”  Machine  Vision  Planning  [Tarabanis, 
1991,  Tarabanis  et  al.,  1991a]  system  to  func¬ 
tion  in  environments  where  objects  are  moving. 
In  particular,  we  focus  on  sensor  planning  in  a 
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dynamic  robotic  work  cell  environment. 

In  previous  work,  we  described  a  technique 
for  sensor  planning  in  a  dynamic  environ¬ 
ment  [Abrams  and  Allen,  1991],  which  was  im¬ 
plemented  using  a  simulated  model  of  a  simple 
moving  object.  Here,  we  present  a  detailed  anal¬ 
ysis  of  the  dynamic  sensor  planning  problem  and 
improved  versions  of  the  original  algorithms.  In 
addition,  experimental  results  using  a  model  of 
a  dual-robot  work  cell  are  presented  in  which  we 
automatically  monitor  a  task  in  the  work  cell. 

2  Overview  of  Static  Planning 

A  complete  description  of  the  MVP  system  is 
beyond  the  scope  of  this  paper.  For  details, 
see  [Tarabanis,  1991,  Tarabanis  et  ai,  1991a, 
Tarabanis  et  ai,  1991b].  In  brief,  MVP  takes 
a  constraint  based  description  of  the  vision  task 
requirements  and  synthesizes  what  has  been 
termed  a  generalized  viewpoint,  which  is  an  eight¬ 
dimensional  vector  incorporating  sensor  loca¬ 
tion,  orientation,  and  lens  parameters  including 
aperture  and  effective  focal  length.  The  con¬ 
straints  MVP  considered  in  determining  view¬ 
points  are  depth-of-field,  field-of-view,  resolu¬ 
tion,  and  unoccluded  visibility. 

MVP  contains  analytical  relationships  for  the 
optical  task  constraints  (resolution,  focus,  field- 
of-view),  and  uses  3-D  solid  geometric  models 
of  the  environment  to  formulate  visibility  con¬ 
straints.  (The  geometric  models  are  polyhedra, 
both  convex  and  concave. )  The  constraint  equa¬ 
tions  can  be  thought  of  as  defining  hypersurfaces 
bounding  feasible  regions  in  the  8-dimensional 
parameter  space  of  the  generalized  viewpoint. 
These  constraints  are  combined  in  an  optimiza¬ 
tion  setting  to  produce  a  generalized  viewpoint 
which  meets  all  task  constraints  with  as  much 
margin  for  error  in  sensor  placement  and  set¬ 
ting  as  possible  (i.e.,  as  far  away  from  all  hyper¬ 
surfaces  as  possible).  Using  CAD  descriptions 
of  the  object  to  be  viewed  and  its  environment. 
MVP  generates  the  visibility  region  for  viewing 
the  desired  features.  This  region  is  calculated 
to  be  the  total  volume  in  space  from  which  the 
features  are  viewable  without  obstruction.  This 
volume  is  used  in  the  optimization  stage  of  MVP 
for  finding  the  best  viewpoint.' 

‘  Here,  and  elsewhere  in  this  paper,  when  we  refer  to  a  riew- 


3  Motion  in  the  Work  Cell 

There  are  two  basic  cases  which  must  be  dealt 
with  separately  in  the  dynamic  sensor  planning 
problem.  First  is  the  case  where  the  target  ob¬ 
jects,  i.e.  those  features  which  must  be  viewed, 
remain  stationary  and  other  objects,  such  as  the 
robot  which  is  performing  some  operation  on  the 
stationary  part,  moves.  This  case  can  arises  in 
teleoperatioii  and  in  many  manufacturing  tasks 
(i.e.  spray-painting,  spot-welding,  etc.)  Sec¬ 
ond  is  the  case  where  the  targets  to  be  viewed 
are  moving.  This  can  also  arise  in  teleoperation 
and  in  other  manufacturing  tasks  (i.e.  pick-and- 
place,  part  insertion,  etc.). 

The  main  difference  between  these  two  cases  is 
that  in  the  first  case,  if  a  viewpoint  is  found 
to  be  valid  at  some  point  during  the  task,  it  is 
guaranteed  to  be  valid  with  respect  to  all  op¬ 
tical  constraints  at  all  times  during  the  task. 
This  is  because  the  functions  defining  the  op¬ 
tical  constraints  only  depend  on  the  target  fea¬ 
ture  locations  and  the  sensor  parameters,  and 
not  on  the  positions  or  orientations  of  obstacles 
in  the  environment.  This  fairly  obvious,  but  im¬ 
portant  property  allows  us  to  ignore  changes  in 
the  optical  constraints  over  time  and  focus  only 
on  changes  in  the  geometric  parameters,  i.e.  the 
visibility  constraint. 

The  second  case  is  more  difficult  because  it  re¬ 
quires  an  examination  of  how  changes  in  the  po¬ 
sition  and  orientation  of  the  target  features  effect 
the  optical  parameters,  particularly  focus  and 
resolution.  However,  if  the  viewpoint  is  consid¬ 
ered  in  terms  of  a  coordinate  frame  attached  to 
the  feature  set,  the  target  can  always  be  con¬ 
sidered  stationary  with  the  entire  environment 
considered  as  moving.  The  only  limitation  is 
that  the  entire  feature  set  must  be  moving  as 
a  single  rigid  body,  i.e.  features  can  not  move 
independently.  While  extremely  important,  in¬ 
dependently  moving  features  are  not  yet  handled 
in  this  work,  although  it  is  being  examined  as 
part  of  ongoing  research. 

To  summarize,  the  exact  problem  we  are  deal¬ 
ing  with  is  one  in  which  an  accurately  movable 
camera  is  being  used  to  monitor  a  task.  In  this 
task,  the  actual  target  we  arc  monitoring  does 

pf>int  wr  arc  artiialJy  referring  to  the  gfurrnhzrd  viewpoint 
mentioned  earlier. 
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not  move,  but  other  objects  in  the  environment, 
such  as  a  robot  arm,  or  other  mechanical  parts, 
move  in  a  way  which  is  known  a  priori.  The 
problem  is  to  find  where  to  place  the  camera,  and 
when  and  where  to  move  the  camera,  so  that  at 
all  times  during  the  task,  we  have  a  good  view¬ 
point  for  monitoring  the  task. 


4  A  Naive  Approach 

At  a  first  glance,  it  may  seem  that  the  dynamic 
sensor  planning  problem  can  be  solved  triviaUy. 
The  naive  algorithm  for  computing  a  series  of 
viewpoints  is  as  follows: 

1.  Compute  a  viewpoint  for  the  initial  state  of 
the  system,  considering  all  obstacles  in  the 
environment  as  they  are  before  any  motion 
takes  place. 

2.  At  every  time  interval  At,  test  the  current 
viewpoint  against  the  model  of  the  changea 
environment. 

3.  If,  at  some  instant  tn,  the  viewpoint  is  found 
to  be  invalid  due  to  the  movement  of  obsta¬ 
cles,  compute  a  new  viewpoint  based  on  the 
current  state  of  the  model,  and  go  back  to 
step  2. 

There  are  several  problems  with  this  approach. 
First,  it  makes  no  attempt  to  reduce  the  number 
of  sensor  placements  required.  Second,  a  view¬ 
point  is  used  up  until  the  moment  it  becomes 
invalid,  or  at  least  up  until  the  point  at  which 
the  margin  for  error  becomes  very  small.  This 
defeats  the  purpose  of  MVP,  which  is  to  find  a 
viewpoint  which  has  as  large  a  margin  for  er¬ 
ror  as  possible.  Worse,  by  the  time  a  viewpoint 
is  deemed  unacceptable,  due  to  errors  in  sensor 
placement,  etc.,  the  viewpoint  may  have  been 
invalid  for  some  time. 

The  basic  problem  is  that  this  technique  does  not 
use  knowledge  of  the  motion  in  computing  view¬ 
points  which  will  be  valid  for  a  long  period  of 
time.  It  is  conceivable  that  a  new  viewpoint  will 
be  needed  at  every  Al,  since  objects  are  moving 
in  unaccounted  for  paths.  A  better  approach, 
such  as  the  one  presented  below,  uses  its  knowl¬ 
edge  of  how  objects  in  the  environment  move  to 
plan  better  viewpoints. 


5  Overview  of  Our  Approach 

The  approach  being  taken  is  a  Temporal  Inter¬ 
val  Search  method,  which  is  is  based  on  the  use 
of  swept  volumes.  The  geometric  models  of  the 
moving  objects  are  swept  through  their  paths  to 
compute  the  regions  in  space  which,  during  some 
interval,  are  occupied  by  some  moving  object  in 
the  environment.  The  MVP  algorithms  are  then 
run  using  the  swept  volumes  for  the  occluding 
bodies  as  opposed  to  the  actual  models,  thus  re¬ 
ducing  the  dynamic  sensor  planning  problem  to 
a  static  problem.  If  no  viewpoint  is  found  consid¬ 
ering  these  swept  objects  over  a  time  interval,  a 
temporal  interval  search  is  performed  to  find  the 
largest  time  intervals  which  can  be  onitored  by 
a  .single  viewpoint.  This  allows  us  to  plan  a  se¬ 
ries  of  viewpoints  and  the  times  at  which  they 
become  feasible. 

Given  that  we  have  an  object  O  whose  motion  is 
known  over  a  time  interval  T,  we  define  T(T.O) 
to  be  the  volume  swept  out  by  O  during  T.  For 
example,  in  2  dimensions,  if  O  is  an  axis-aligned 
unit  square  moving  one  unit  per  second  in  the 
positive  X  direction,  and  T  is  3  seconds,  T(T,0) 
is  a  1  X  4  square.  The  key  to  using  swept  objects 
for  sensor  planning  (or,  in  fact,  for  any  collision 
avoidance  problem)  is  that  in  planning  around 
an  obstacle  given  by  T(T,  0),  you  guarantee  that 
you  have  avoided  the  actual  obstacle  0  at  any 
instant  in  interval  T.  This  observation  was  made 
by  Cameron  in  [Cameron,  1984]  for  the  “clash 
detection”  (robot  collision  avoidance)  problem. 

Let  V  represent  visibility  volume  for  T(T.O). 
V  is  the  set  of  all  points  (in  3-space)  which  give 
views  of  the  target  which  have  no  obstructions 
(due  to  O)  for  the  entire  time  interval  T.  If  I '  is 
a  null  volume,  there  is  no  single  viewpoint  which 
would  be  valid  for  all  of  T.  Even  if  V  is  not  null, 
there  is  no  guarantee  that  there  are  viewpoints 
within  V  which  satisfy  the  optical  constraints  of 
MVP. 

A  possible  problem  when  using  swept  volumes 
for  collision  avoidance  type  problems  is  that 
sweeping  an  object  discards  all  information  re¬ 
garding  where  the  object  is  at  any  particular 
moment.  We  present  a  technique  for  recover¬ 
ing  sufficient  temporal  information  to  plan  sen¬ 
sor  locations.  If  using  V  as  a  visibility  volume. 
MVP  is  unable  to  find  a  viewpoint  which  meets 
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all  constraints,  we  conclude  that  T  is  too  large 
an  interval  to  plan  a  single  viewpoint  for,  given 
the  motion  of  O.  We  have  no  information  con¬ 
cerning  when  any  particular  viewpoint  becomes 
invalid;  we  only  know  that  we  can  not  find  a 
single  viewpoint  which  is  valid  for  the  entire  in¬ 
terval.  Recomputing  T{T,0)  for  a  shorter  time 
interval  T  will  yield  a  smaller  obstacle,  a  larger 
V,  and  MVP  may  now  be  able  to  find  a  view¬ 
point. 

We  can  now  present  the  algorithm  formally.  As¬ 
sume  we  have  a  polygonal  target  r  which  we  wish 
to  monitor  during  the  time  interval  T  =  [<o,^n]- 
During  T,  there  is  a  set  of  known  obstacles  Oq 
through  Om,  which  move  in  known  paths.  The 
goal  is  to  plan  a  single  viewpoint  valid  for  the 
entire  interval,  if  such  a  point  exists,  or  to  de¬ 
termine  a  sequence  of  viewpoints  which,  when 
executed  at  the  appropriate  times,  allow  the  fea¬ 
tures  to  be  monitored  for  the  entire  interval. 

Temporal  Interval  Search 

1.  Compute  T(r,  O,)  for  each  of  the  m  obsta¬ 
cles. 

2.  Use  MVP  to  compute  a  viewpoint  using 
T{T,Oq)  through  T{T,Om)  as  well  as  all 
stationary  objects  in  the  environment  as  the 
set  of  potential  occluding  bodies. 

3.  If  MVP  can  successfully  find  a  viewpoint,  use 
this  viewpoint  for  the  entire  time  interval  T. 

4.  If  no  such  viewpoint  is  obtainable,  divide  the 
time  interval  in  half  yielding  T\  =  [<o,^n/2]- 
Go  back  to  step  1  using  interval  Ti- 

5.  If  the  entire  time  interval  T  has  been 
planned,  we  are  finished.  If  not,  go  to  step  1 
using  the  remaining  portion  of  the  the  orig¬ 
inal  interval  T. 

Note,  this  is  not  strictly  a  binary  search.  Step 
4  above  only  looks  at  the  first  half  of  the  time 
interval,  i.e.  Ti  =  [<0i<n/2]-  The  algorithm 
searches  for  the  endpoint  of  the  first  time  interval 
for  which  MVP  can  find  one  viewpoint.  It  does 
this  by  examining  [to,  <n],  then  [<o,  <,1/2],  [<o,  tn/4], 
and  so  on.  Once  a  single  viewpoint  is  found  for, 
say,  the  interval  [<o,  ti],  step  5  sees  to  it  that  the 
interval  [t,,  tn]  is  examined.  If  no  viewpoint  is 
found  for  this  whole  interval,  [ti,  t,^.(„_,)/2]  is  ex¬ 
amined,  and  so  on,  until  a  single  viewpoint  is 
found  for,  say,  the  interval  [ti,tj].  This  proess 


continues  until  a  viewpoint  has  been  found  which 
is  valid  until  t„.  The  critical  times  are  the  end¬ 
points  of  the  intervals,  i.e.  the  times  at  which 
the  sensor  must  be  moved. 

The  computation  of  swept  volumes  is  central  to 
this  algorithm.  Depending  upon  the  format  in 
which  the  motion  is  known,  the  computation  of 
swept  volumes  may  not  be  expensive.  If  piece- 
wise  linear  translational  motion  is  all  that  is  al¬ 
lowed,  then  the  computation  of  swept  volumes  is 
certainly  tractable  [Weld  and  Leu,  1990].  How¬ 
ever,  if  more  general  types  of  motion  are  allowed, 
as  in  the  motions  which  would  be  executed  by 
a  typical  articulated  manipulator  (rotations  in 
particular),  the  exact  computation  of  swept  vol¬ 
umes  is  more  expensive,  but  not  impossible.  Un¬ 
fortunately,  sweeping  is  not  closed  over  the  set  of 
polyhedra  when  rotational  motion  is  permitted. 
An  articulated  robot  arm  moves  strictly  in  rota¬ 
tions  about  its  joint  axes,  so  the  resulting  swept 
volumes  are  not  polyhedral  (they  would  contain 
circular  arcs,  spherical  patches,  and  other  curved 
surfaces).  These  objects  would  not  be  useable 
in  MVP.  Korein  gives  an  algorithm  for  comput¬ 
ing  polyhedral  approximations  [Korein,  1985]  of 
the  swept  volumes  formed  by  the  motion  of  ar¬ 
ticulated  robot  links.  These  techniques  can  be 
used  to  simplify  the  computation  of  the  swept 
volumes. 

Strictly  speaking,  MVP  directly  computes  vol¬ 
umes  of  occlusion,  not  volumes  of  visibility.  In 
theory,  the  complement  of  a  volume  of  occlusion 
is  a  volume  of  visibility.  In  practice,  the  comple¬ 
ment  of  a  volume  of  occlusion  with  'pect  to  the 
workspace  of  the  manipulator  pi  *  i  •  the  sensor 
yields  the  usable  visibility  voluiir  '».•  the  cur¬ 
rent  dynamic  sensor  planning  imple  r,pntation. 
instead  of  computing  a  swept  volum*-  and  then 
computing  its  occlusion  volume,  we  compute  a 
set  of  volumes  of  occlusion  at  discrete  points 
along  the  trajectory.  These  volumes  of  occlusion 
are  then  unioned  to  form  the  volume  of  occlusion 
for  the  entire  interval.  This  is  possible  because 
the  volume  of  occlusion  generated  by  the  union 
of  a  set  of  obstacles  (for  viewing  a  particular  tar¬ 
get)  is  equal  to  the  union  of  the  volumes  of  oc¬ 
clusion  generated  by  each  obstacle.  One  benefit 
of  this  approach  is  that  subdivisions  of  the  time 
interval  do  not  require  recomputing  new  swept 
volumes.  Instead,  the  appropriate  subset  of  the 
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instantaneous  occlusion  volumes  can  be  unioned 
to  approximate  the  volume  of  occlusion  for  any 
given  interval. 

6  Realization  of  the  Viewpoints 

The  result  of  the  temporal  interval  search  will  be 
a  set  of  viewpoints  and  critical  times  at  which 
to  execute  them.  However,  an  explicit  represen¬ 
tation  of  time  is  not  required  for  the  temporal 
interval  search,  in  which  case  the  critical  times 
are  not  times  at  all  but,  rather,  critical  events. 
If,  for  example,  the  motions  of  a  robot  have  been 
planned  eis  a  series  of  joint-space  moves,  the  crit¬ 
ical  events  would  be  joint  angle  values.  If  the 
motion  was  planned  in  cartesian  space,  the  criti¬ 
cal  events  would  be  cartesian  positions.  Finally, 
if  the  robot  motion  was  planned  on  some  global 
time  scale  (perhaps  avoiding  other  moving  ob¬ 
stacles),  the  critical  events  would  be  actual  times 
on  this  scale.  As  long  as  at  task  execution  time 
there  is  a  way  to  determine  when  the  critical 
events  arise,  (i.e.  by  waiting  for  the  robot  to  be 
within  some  distance  of  the  prescribed  position), 
the  viewpoints  can  be  realized. 

7  Experimental  Results 

We  have  modelled  our  laboratory  environment 
using  a  CAD  system  (see  figure  1).  The  model 
includes  two  PUMA  560  robots  and  the  object 
to  be  monitored  during  the  task.  The  first  robot 
(I)  executes  tasks,  while  the  second  robot  (II) 
has  a  camera  mounted  on  it.  In  the  simulated 
experiment,  robot  I  passes  over  the  object  as  if 
it  were  performing  an  operation  on  it,  such  as 
spray-painting.  During  the  task,  robot  II  needs 
to  monitor  a  feature  inside  the  object.  A  CAD 
model  of  the  object  and  the  feature  is  shown 
in  figure  2.  The  target  (i.e.  the  feature  to  be 
viewed)  is  the  top  face  of  the  inner  cube. 

In  order  to  compute  viewpoints  for  monitoring 
robot  I’s  task,  we  need  to  compute  the  visibil¬ 
ity  volume  for  the  object  as  the  robot  moves  in 
the  vicinity  of  the  object,  i.e.  the  volume  from 
which  the  object  is  visible  during  the  entire  task. 
In  other  words,  we  need  to  compute  the  visibility 
volume  for  T(  TaskIntervalRobotl , .)  The  visibil¬ 
ity  volume  is  computed  by  first  computing  the 
volume  of  occlusion,  and  subtracting  it  from  the 
reachable  work-space  of  robot  II,  in  order  to  pre¬ 


vent  the  computation  of  a  viewpoint  which  is  ei¬ 
ther  unreachable  or  has  an  occluded  view.  The 
volume  of  occlusion  is  approximated  using  the 
discrete  union  algorithm  described  earlier. 

In  the  experiment,  the  robot  model  is  stepped 
through  a  series  of  positions  along  its  planned 
trajectory.  At  each  step,  the  volume  of  occlu¬ 
sion  is  computed  as  in  the  static  sensor  plan¬ 
ning  problem.  The  individual  volumes  of  oc¬ 
clusion  are  unioned  together  to  form  the  vol¬ 
ume  of  occlusion  for  the  entire  trajectory.  In 
this  way,  we  approximate  the  volume  of  occlu¬ 
sion  for  T ( Taskinterval,  Robotl)  without  explic¬ 
itly  computing  T {Taskinterval,  Robotl).  In  fig¬ 
ure  4  we  show  a  discrete  approximation  to  the 
volume  swept  out  by  Robot  I  during  its  task  (i.e. 
T {Taskinterval,  Robotl)).  The  volume  of  occlu¬ 
sion  resulting  from  this  motion  is  shown  in  fig¬ 
ure  5.  The  volume  of  occlusion  resulting  from 
the  walls  of  the  part  (i.e.  due  to  self-occlusions) 
is  shown  in  figure  3.  These  two  volumes  were 
unioned  to  form  the  total  volume  of  occlusion. 

An  approximation  to  the  workspace  of  Robot  II, 
the  camera- carrying  robot,  (called  the  robot’s 
reachability  volume)  was  generated.  The  total 
occlusion  volume  was  subtracted  from  this  reach¬ 
ability  volume  giving  the  reachable/visible  vol¬ 
ume.  This  volume,  which  contains  all  points  in 
space  where  the  robot  can  position  the  camera 
such  that  the  target  can  be  seen  without  occlu¬ 
sion,  was  used  in  the  optimization  stage  of  MVP 
in  order  to  compute  a  viewpoint. 

Since  MVP  was  unable  to  find  a  valid  view¬ 
point  for  the  entire  task,  the  temporal  interval 
search  was  used  to  find  subintervals  for  which 
we  can  find  valid  viewpoints.  Instead  of  recom¬ 
puting  the  swept  volumes  for  each  subinterval 
examined,  the  discrete  approximation  allows  us 
to  union  the  appropriate  subset  of  volumes  of 
occlusion.  The  subintervals  found  for  this  task 
are  shown  in  figures  6  and  7.  The  generated 
volumes  of  occlusion  due  to  the  robot’s  motion 
during  each  sub-interval  are  shown  in  figures  8 
and  9.  These  volumes  were  again  unioned  with 
the  self-occlusion  volume  and  subtracted  from 
the  reachability  volume  forming  the  volumes  of 
reachability /visibility  shown  in  figure  10  and  11. 
These  volumes  were  used  in  the  optimization, 
and  MVP  was  able  to  compute  a  viewpoint  for 
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each  interval.  Simulated  views  from  these  view¬ 
points  are  shown  in  figures  12  and  13. 

8  Motion  Planning  and  Moving 
Sensors 

In  this  section  we  describe  some  alternate  ways 
of  examining  the  both  the  static  and  dynamic 
sensor  planning  problems.  The  observations  and 
discussions  of  this  section  are  the  motivation  for 
additional  research  which  is  currently  being  car¬ 
ried  out. 

One  can  view  the  static  sensor  planning  prob¬ 
lem  as  a  configuration  space  problem.  Using  this 
view,  the  sensor’s  possible  configurations  are  de¬ 
scribed  by  the  generalized  viewpoint.  The  valid 
configurations  are  bounded  by  the  constrain¬ 
ing  hypersurfaces  in  the  8-dimensional  parame¬ 
ter  space  of  the  generalized  viewpoint.  However, 
the  combination  of  the  highly  nonlinear  fashion 
of  the  sensor  constraining  equations,  plus  the 
high  dimensionality  of  the  generalized  viewpoint, 
standard  techniques  for  searching  configuration 
spaces  appear  to  be  unpractical.  This  is  one  of 
the  reasons  why  MVP  takes  a  numerical  opti¬ 
mization  approach  to  searching  the  sensor’s  pa¬ 
rameter  space.  However,  the  configuration-space 
analogy  will  be  useful  in  motivating  other  ideas 
below. 

Dynamic  sensor  planning  is  to  static  sensor  plan¬ 
ning  what  path-planning  with  stationary  ob¬ 
stacles  is  to  path-planning  with  moving  obsta¬ 
cles.  Erdmann  and  Lozano-Perez  [Erdmann  and 
Lozano-Perez,  1987]  proposed  a  configuration 
space-time  for  solving  such  problems  in  two  di¬ 
mensions.  They  presented  two  approaches,  one 
for  translating  polygons  and  one  for  two-link  ar¬ 
ticulated  planar  arms.  Their  approaches  focused 
on  the  efficient  construction  of  slices  of  configu¬ 
ration  space-time.  The  slices  were  chosen  so  as 
to  include  easily  computable  time-varying  con¬ 
straints,  simplifying  the  search  from  the  start 
configuration  to  the  goal  configuration. 

In  dynamic  sensor  planning  with  stationary  tar¬ 
gets,  the  only  constraints  in  configuration  space 
which  move  are  the  boundaries  of  the  visibil¬ 
ity  volume.  Even  if  the  obstacles  are  only  al¬ 
lowed  restricted  classes  of  motion,  their  volumes 
of  occlusion  not  only  move  but  warp,  due  to  the 


Figure  1:  CAD  Model  of  the  environment. 


fact  that  the  volume  of  occlusion  between  an  ob¬ 
ject  and  a  target  depends  on  the  relative  orien¬ 
tation  of  the  two.  Thus,  the  constraints  which 
are  moving  in  configuration  space-time  are  non- 
rigid.  This  makes  it  very  difficult  to  determine  a 
convenient  way  of  slicing  a  configuration  space- 
time. 

Another  way  of  viewing  the  dynamic  sensor  plan¬ 
ning  problem  is  to  segregate  the  positioning  of 
the  sensor  from  the  orienting  and  adjusting  of 
the  sensor.  This  allows  the  computation  of  a 
3-dimensional  region  from  which  all  constraints 
can  be  met  (i.e.  the  projection  into  3-space  of 
the  set  of  valid  8-dimensional  sensor  configura¬ 
tions).  The  moving  polyhedral  volumes  of  occlu¬ 
sion  generated  by  the  moving  obstacles  in  the  en¬ 
vironment  can  be  considered  as  obstacles  which 
the  sensor  must  avoid  while  moving  in  the  free- 
space.  This  reduces  the  sensor  planning  problem 
to  that  of  keeping  a  single  point  away  from  the 
boundaries  of  a  set  of  moving  polyhedra.  Then, 
after  the  sensor  path  through  3-spare  has  been 
planned,  the  other  5  (optical)  parameters  can  be 
planned  accordingly. 

This  suffers  from  the  same  problem  as  the  pre¬ 
vious  approach,  namely  that  the  set  of  mov¬ 
ing  polyhedra  (the  volumes  of  occlusion),  are 
non-rigid  bodies.  Although  moving  polyhedra 
have  been  modelled  and  examined  (i.e.  [Canny. 
1986,  Cameron.  1984]),  non-rigidly  moving  bod¬ 
ies  have  not  been  examined  in  detail.  It  appears 
that  an  examination  of  how  the  volumes  of  oc¬ 
clusion  change  with  respect  to  movements  of  the 
obstacles  will  allow  these  approaches  to  be  more 
u.seful  and  appears  very  promising  for  future  re¬ 
search  . 


604 


Figure  2:  CAD  Model  of  the  part  to  be  viewed. 
The  target  itself  is  the  top  face  of  the  inner  cube. 


Figure  3:  Volume  of  occlusion  caused  by  other 
features  on  the  object  itself  (i.e.  self-occlusions). 


Figure  4:  Swept  Volume  showing  the  robot’s  mo¬ 
tion  over  the  entire  task. 

9  Conclusion 

In  conclusion,  we  have  successfully  extended  our 
MVP  system  to  plan  sensor  locations  in  a  time- 
varying  environment.  This  is  notable  in  that  to 
the  best  of  our  knowledge,  motion  has  not  been 
widely  addressed  in  the  sensor  planning  litera¬ 
ture.  The  use  of  swept  volumes  which  provides 


Figure  5:  Volume  of  occlusion  caused  by  the 
robot’s  motion  during  the  entire  task. 


Figure  7:  Second  task  interval. 

a  useful  way  to  extend  static  planning  problems 
to  dynamic  domains.  We  have  presented  a  con¬ 
venient  way  to  recover  enough  temporal  infor¬ 
mation  from  swept  volumes  to  use  them  in  plan¬ 
ning  tasks.  Our  immediate  research  plans  are 
to  bring  the  results  of  this  paper  into  our  lab¬ 
oratory  and  execute  the  task  with  the  planned 
viewpoints.  Also,  we  will  be  examining  the  al¬ 
ternative  sweeping  techniques  presented  to  see  if 


Figure  8:  Occlusion  due  to  the  robot’s  motion  Figure  11:  Intersection  of  reachable  and  visible 

during  the  first  task  interval,  shown  with  object.  volumes  for  second  task  interval 


Figure  9:  Occlusion  due  to  the  robot’s  motion  Figure  12;  Simulated  view  from  first  computed 
during  the  second  t2isk  interval,  with  object.  viewpoint. 


Figure  10:  Intersection  of  reachable  and  visible 
volumes  for  first  task  interval 


Figure  13:  Simulated  view  from  second  com¬ 
puted  viewpoint. 


they  offer  any  performance  improvements. 

There  are  several  open  issues  in  dynamic  sensor 
planning.  There  is  work  to  be  done  in  compu¬ 
tational  geometry  to  characterize  the  changes  in 
a  volume  of  occlusion  as  the  target  and  occlud¬ 
ing  bodies  move  with  respect  to  each  other.  A 


similar  characterization  of  how  the  optical  con¬ 
straints  vary  with  the  target’s  motion  is  also 
important.  Finally,  it  is  hoped  that  these  vari¬ 
ous  characterizations  can  be  combined  to  plan  a 
continuous  path  through  the  sensor’s  parameter- 
space,  rather  than  computing  a  series  of  view¬ 
points  and  critical  times.  This  would  complete 
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the  analogy  between  sensor  planning  and  config¬ 
uration  space-time  based  motion  planning,  and 
allow  more  useful  solutions  tobe  found  to  dy¬ 
namic  sensor  planning  problems. 
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Abstract 

In  the  feature-based  motion  analysis  of  an 
image  sequence,  consistent  feature  extraction 
and  reliable  matching  are  crucial  factors  for 
the  motion  estimation.  Inconsistent  feature 
extraction  and  erroneous  matching  are  closely 
related  and  hard  to  detect  without  additional 
information.  In  this  paper,  we  address  the  is¬ 
sues  of  using  errorful  data  in  motion  estima¬ 
tion  and  of  using  feedback  to  improve  feature 
extraction  and  matching  in  incremental  anal¬ 
ysis  of  an  image  sequence.  Thus  we  use  3-D 
motion  estimation  as  an  aid  in  generating  the 
data  necessary  for  the  motion  estimation  sys¬ 
tem  itself.  Initial  noisy  correspondence  data 
are  continuously  refined  by  removing  those 
parts  that  do  not  fit  the  estimated  3-D  mo¬ 
tion  parameters.  The  feature  extraction  of  a 
tracked  region  is  guided  by  its  expected  prop¬ 
erties  which  are  obtained  from  the  correspond¬ 
ing  object  in  the  previous  frames.  The  motion 
parameters  and  the  environmental  depth  map 
are  continuously  updated  with  each  additional 
frame.  Finally,  the  surface  of  environment  is 
reconstructed  from  a  sparse  depth  map  at  cor¬ 
ners,  utilizing  the  relations  among  the  regions 
which  underlie  the  corners.  Test  results  for 
standard  real  image  sequences  are  presented. 

1  Introduction 

Recovery  of  relative  motion  between  the  camera  and  the 
environment  as  well  as  recovery  of  the  environmental 
structure  is  an  active  research  area  in  computer  vision. 
Much  of  the  work  has  viewed  this  task  purely  in  mathe¬ 
matical  terms  -  given  perfect  data,  how  can  we  extract 
3-D  information?  In  this  research,  we  concentrate  on 
using  3-D  motion  estimation  as  an  aid  in  generating  the 
data  necessary  for  the  motion  estimation  system  itself. 

'This  research  was  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  was  mon¬ 
itored  by  the  Air  Force  Office  of  Scientific  Research  under 
Contract  No.  F49620-90-C-0078.  The  United  States  Govern¬ 
ment  is  authorized  to  reproduce  and  distribute  reprints  for 
governmental  purposes  notwithstanding  any  copyright  nota¬ 
tion  hereon. 


Thus  we  deal  with  the  issues  of  using  errorful  data  in 
motion  estimation  and  of  using  feedback  from  3-D  esti¬ 
mation  to  feature  extraction  and  feature  matching. 

Conventional  feature  based  motion  analysis  tech¬ 
niques  use  a  sequential  framework  -  feature  extraction, 
establishment  of  correspondence,  estimation  of  motion 
parameters  and  recovery  of  3-D  structure.  This  sequen¬ 
tial  processing  of  data  forces  the  overall  analysis  to  de¬ 
pend  on  the  integrity  of  the  earlier  stages.  The  reliability 
of  the  estimated  motion  parameters  depends  on  the  qual¬ 
ity  of  correspondences,  which  are  affected  by  the  consis¬ 
tency  of  feature  extraction.  Feature  extraction,  feature 
matching  and  motion  estimation  have  been  studied  ex¬ 
tensively  by  numerous  researchers,  each  as  a  separate 
research  topic. 

Establishment  of  correspondence  has  been  a  challeng¬ 
ing  problem  in  motion  analysis  of  real  image  sequences. 
Sethi  [Sethi  and  Jain,  1987]  suggests  a  trzu:king  method 
based  on  the  smoothness  of  motion  in  image  plane.  Their 
work  assumes  that  the  number  of  extracted  points  re¬ 
mains  constant,  except  for  one  frame  where  some  points 
may  disappear  due  to  occlusion.  Cheng  [Cheng  and  Ag- 
garwal,  1990]  uses  a  2-stage  tracking  algorithm  for  point 
correspondence  in  multiple  frames.  The  rule-based  sec¬ 
ond  stage  inspects  the  last  four  frames  and  updates  pre¬ 
vious  matches  by  maximizing  the  smoothness  of  the  2-D 
motion. 

In  recent  work  on  the  integration  of  subsystems 
into  a  working  motion  analysis  system,  attempts  have 
been  made  to  use  feedback  from  motion.  Chan- 
drashekhar  [Chandrashekhar  and  Chellappa,  1991]  uses 
predicted  3-D  motion  for  feature  correspondence  with  a 
partially  known  structure.  Sawhney  [Sawhney  and  Han¬ 
son,  1992]  uses  a  predicted  mask  in  the  tracking  of  struc¬ 
ture  with  hypotheses.  The  main  use  of  motion  in  these 
works  is  to  reduce  the  search  space  in  matching  of  in¬ 
terest  points  [Chandrashekhar  and  Chellappa,  1991],  or 
in  grouping  of  linear  segments  [Sawhney  and  Hanson, 
1992]. 

Little  work  has  been  done  using  feedback  of  3-D  mo¬ 
tion  estimation  to  feature  extraction  and  matching  while, 
in  the  analysis  of  an  image  sequence,  the  consistency  of 
features  extracted  over  the  frames  is  a  crucial  factor  for 
a  reliable  correspondence.  In  this  paper,  we  present  a 
feedback  approach,  where  feature  extraction,  matching 
and  motion  analysis  are  performed  in  cooperative  man- 
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gions  and  corners  speed  up  computation  and  increase 
stability  of  the  matching.  Each  image  in  the  sequence  is 
segmented  into  regions  (global  segmentation),  that  are 
matched  between  adjacent  frames.  Then,  corners  com¬ 
puted  based  on  the  linear  segment  approximation  of  the 
contours  are  matched.  For  convenience,  we  use  RMS 
(region  matching  sequence)  and  CMS  (corner  matching 
sequence)  to  refer  to  a  sequence  of  matched  regions  and 
corners  over  multiple  frames. 

A  recursive  splitting  technique  [Ohiander  et  ai,  1978] 
is  used  for  the  segmentation  of  an  image  which  uses 
the  statistics  of  image  attributes  ( intensity  for  black  and 
white  image).  The  segmentation  procedure  locates  well- 
separated  peaks  in  the  histogram  of  the  image  value  over 
a  masked  area.  The  image  is  segmented  into  regions  with 
a  certain  range  of  values  of  the  attribute.  Segmented  re¬ 
gions  are  recursively  segmented  into  smaller  regions  until 
the  size  of  the  region  is  too  small  or  the  attribute  is  in¬ 
separable.  After  the  initial  frames  in  the  sequence  are 
segmented  this  way,  the  segmentation  of  regions  in  the 
following  frames  is  guided  by  the  expected  properties  in¬ 
duced  from  previous  matching  results. 

Matching  is  performed  both  in  forward  and  in  back¬ 
ward  direction  to  ensure  one-to-one  match  of  fea¬ 
tures.  Relaxation- based  symbolic  matching  [Faugeras 
and  Price,  1981]  is  used  for  both  region  and  corner 
matching.  The  matching  system  uses  a  feature-based 
symbolic  description  for  its  input.  For  region  matching, 
the  properties  include  average  values  of  the  image  inten¬ 
sity,  size,  location  and  simple  shape  measures.  Relations 
include  adjacency,  relative  position  and  near-by.  For  cor¬ 
ner  matching,  the  properties  us'^d  include  position  and 
angular  data  of  the  line  segments. 

The  3-D  motion  parameters  and  structure  of  the 
matched  features  are  estimated  using  Chronogeneous 
analysis  technique  developed  by  Franzen  [Franzen,  1992], 
which  handles  uniform  acceleration  with  constant  trans¬ 
lation  and  rotation.  Each  point  need  not  be  visible  in 
all  frames,  but  should  be  visible  in  at  least  3  frames. 
The  accuracy  of  the  solution  depends  on  the  number 
of  frames  with  steady  improvement  as  the  number  of 
frames  increases.  However,  most  of  the  improvement  oc¬ 
curs  within  the  first  7  frames  with  a  slight  improvement 
after  frame  11.  Thus,  7  to  11  frames  provide  a  good 
compromise  between  computation  time  and  accuracy. 

3  Guidance  of  motion  in  feature 
extraction 

To  accomplish  consistent  feature  extraction  and  reliable 
correspondence,  the  processing  of  features  is  guided  by 
feedback  of  information  in  various  ways  as  follows,  where 
motion  provides  the  major  guidance  information. 


3.  Refinement  of  noisy  correspondences  Initial 
noisy  correspondence  data  are  gradually  refined  and 
linked  by  3-D  motion.  Details  are  found  in  [Kim 
and  Price,  1992a]. 

4.  Guide  in  local  segmentation  In  incremental 
mode,  the  extraction  of  a  tracked  feature  is  guided 
by  its  expected  properties  (size,  intensity)  induced 
from  the  matched  regions  in  previous  frames. 

3.1  Reference  regions 

Regions  can  be  related  in  two  ways,  region  matching 
and  corner  linking.  Matching  of  segmented  regions  be¬ 
tween  adjacent  frames  becomes  disconnected  at  those 
frames  where  similar  regions  are  not  extraicted  or  prop¬ 
erly  matched.  The  regions  in  each  disconnected  RMS  are 
related.  Regions  in  non-adjacent  frames  become  related 
by  linking  of  the  corners  generated  by  the  regions. 

Figure  2  illustrates  all  the  cases  of  related  regions.  An 
object  is  segmented  into  regions  1,  2,  3,  5,  6  in  frames  1, 
2, 3,  5,  6.  In  frame  4,  only  part  of  the  object  is  segmented 
into  region  4-a  and  another  similar  object  is  segmented 
into  region  4-b.  Since  region  matching  fails  at  frame 
4,  regions  (1,  2,  3)  and  regions  (5,  6)  are  two  sets  of 
regions  related  by  region  matching.  CMS-a  and  CMS-b 
in  frames  (1,  2,  3)  are  linked  with  corners  in  region  4-a 
and  region  4-b,  respectively.  CMS-c  in  frames  (5,  6)  is 
linked  with  a  corner  in  region  4-a.  From  these  relations, 
regions  (1,  2,  3,  4-a,  4-b,  5,  6)  are  all  related.  Since  only 
local  properties  of  the  contour  around  the  corner  are 
used  in  the  linking  process,  there  can  be  non-negligible 
variations  in  the  properties  of  the  related  regions.  For 
frame  4,  each  of  the  two  regions  is  considerably  different 
from  the  rest  in  the  other  frames. 

We  need  a  set  of  regions  that  is  the  most  representa¬ 
tive  of  these  related  regions  to  be  used  as  the  reference 
in  the  local  feature  extraction.  First,  the  set  of  regions 
with  corners  that  passed  the  refinement  process  are  se¬ 
lected.  Then  they  are  clustered  into  sets  of  consistent 
regions.  The  regions  in  the  dominant  set  are  selected  as 
the  reference  set.  The  criteria  used  in  the  clustering  are; 
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1 .  Guide  in  global  segmentation  The  global  seg¬ 
mentation  of  the  current  frame  is  guided  by  the 
intensity  distribution  of  the  regions  extracted  from 
the  previous  frame. 

2.  Guide  in  feature  matching  Matching  becomes 
more  stable  and  faster  by  limiting  the  search  space 
of  matching  along  the  predicted  trajectory. 


Related  by  region  matching:  (Reg-1  Reg-2  Reg-3)  (Reg-5  Reg-6) 
Related  by  comer  linking;  (Reg-1  Reg-2  Reg-3  Reg-4-a  Reg-4-b) 
(Reg-4-a  Reg-S  Reg-6) 

Finally:  (Reg-1  Reg-2  Reg -3  Reg-4-a  Reg-4-**  Reg-S  Reg -6)  are 
all  related. 


Figure  2:  Related  regions  by  corner  linking 
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Criterion  1  Consistent  intensity  histogram:  The 
overlap  range  between  the  intensity  peaks  of  two  regions 
is  larger  than  a  %  of  the  minimum  range  of  the  two 
peaks,  where  a  is  usually  fO. 

Criterion  2  Consistent  contour  shape:  The  size  of 
the  difference  of  the  masks  of  the  two  regions  after  proper 
scaling  and  translation  to  compensate  for  the  motion  is 
less  than  0  %  of  the  average  area  of  the  regions,  where 
0  is  usually  20. 

3.2  Intensive  local  segmentation 

The  reference  regions  focus  the  segmentation  on  both 
the  position  and  shape  of  the  corresponding  object.  The 
mask  that  covers  the  object  of  interest  is  predicted  from 
the  shapes  of  the  reference  regions.  The  peak  selection 
process  in  the  intensity  histogram  of  the  masked  area  of 
the  image  is  guided  by  the  histograms  of  the  reference 
regions. 

The  mask  for  the  local  segmentation  is  obtained  by 
scaling  and  translation  of  reference  regions  to  compen¬ 
sate  for  the  motion.  The  motion  of  a  region  is  repre¬ 
sented  by  that  of  its  corners.  When  there  are  several 
refined  CMSs,  each  CMS  produces  a  predicted  mask  for 
the  region  in  the  next  frame.  Consequently,  the  mask 
is  a  union  of  all  the  predicted  masks  from  the  refined 
corner  sequences  of  the  reference  regions. 

As  the  mask  fits  the  desired  region  more  closely,  the 
probability  of  the  region  being  represented  by  the  dom¬ 
inant  peak  gets  larger.  Since  the  predicted  mask  is  the 
union  of  those  from  each  refined  corner  sequence,  usually 
it  covers  much  larger  area  than  the  desired  region  and 
thus  the  histogram  of  intensity  in  the  mask  area  usually 
consists  of  several  peaks,  where  the  desired  region  is  not 
necessarily  represented  by  the  dominant  peak.  In  local 
segmentation,  the  peak  is  selected  which  overlaps  with 
the  intensity  peak  in  the  union  of  the  histograms  of  the 
reference  regions.  Thus,  the  role  of  the  guidance  is  to 
pick  the  correct  peak  which,  otherwise,  may  be  hidden 
by  a  larger  peak  in  the  mask  area.  Since  the  intensity  his¬ 
tograms  of  the  reference  regions  consist  of  similar  peaks, 
the  union  of  their  histograms  usually  forms  a  smooth 
peak.  When  the  regioi.  size  is  small,  the  union  of  the 
histograms  with  equal  weighting  may  be  dominated  by 
an  irregular  intensity  distribution  of  a  large  region.  To 
reduce  such  effects,  all  the  reference  regions  are  given 
equal  weighting  in  the  union  of  the  intensity  histograms. 

3.3  Merging  of  global  segmentation  and  locfd 
segmentation 

We  applied  guided  local  segmentation  to  improve  the 
quality  of  the  globally  segmented  regions.  Comparison 
of  the  results  from  global  segmentation  and  guided  local 
segmentation  leads  to  the  following  conclusions; 

•  When  the  globally  segmented  regions  in  an  RMS 
are  good  in  most  frames,  then  guided  local  segmen¬ 
tation  provides  some  improvement.  If,  in  some  of 
the  frames,  global  segmentation  fails  to  extract  a 
consistent  region  or  fails  to  generate  the  desired  re¬ 
gion  at  all,  then  guided  local  segmentation  gener¬ 
ates  regions  with  more  consistent  contour  s!  vpe,  or 
regions  which  have  not  been  extracted  in  the  global 


segmentation.  The  role  of  guided  segmentation,  in 
this  case,  is  to  fill  in  the  gap  of  the  original  RMS. 

•  When  the  size  of  an  object  is  small,  the  correspond¬ 
ing  regions  are  missing  in  some  of  the  frames  in 
global  segmentation.  The  resulting  RMS  includes 
noisy  mismatched  regions.  When  a  fine  set  of  refer¬ 
ence  regions  is  provided,  guided  local  segmentation 
is  successful  in  extracting  the  missing  regions.  An 
adaptive  minimum  size  of  regions  is  used  in  local 
segmentation,  which  is  a  function  of  the  sizes  of  the 
reference  regions. 

•  When  the  RMS  from  global  segmentation  is  highly 
noisy,  stable  reference  regions  are  hard  to  obtain, 
and  thus  the  improvement  from  guided  local  seg¬ 
mentation  is  weak. 

Thus,  the  gaps  can  be  filled  in  by  extracting  missing 
regions  and  a  noisy  region  can  be  replaced  by  a  locally 
segmented  region.  Global  segmentation  and  guided  local 
segmentation  are  complementary.  Guided  local  segmen¬ 
tation  is  not  used  alone  is  because  it  is  focused  on  an  ob¬ 
ject  which  has  been  tracked  and  is  not  appropriate  in  ex¬ 
tracting  a  new  region.  In  incremental  analysis,  each  ad¬ 
ditional  frame  is  globally  segmented  and  matched.  Mo¬ 
tion  parameters  and  structure  are  computed  from  the 
matching  data  and  then  local  segmentation  is  performed 
for  each  RMS  that  underlies  those  CMSs  that  pass  the 
refinement  process. 

Since  the  performance  of  guided  local  segmentation 
is  affected  by  the  quality  of  the  reference  regions,  we 
are  conservative  in  using  the  results  from  guided  local 
segmentation.  A  region  is  replaced  when  it  is  far  more 
consistent  with  the  reference  regions.  The  same  criteria 
1  and  2  in  subsection  3.1  with  more  strict  condition  (q, 
0  is  50, 10,  respectively)  is  used  for  consistency  measure. 
A  region  in  a  CMS  from  global  segmentation  is  replaced 
by  a  new  region  from  local  segmentation  only  when  the 
new  one  meets  the  criteria  and  the  old  one  does  not. 

4  Selection  from  multiple 
interpretations  of  motion 

From  a  set  of  correspondence  data,  multiple  solutions 
are  generated  each  associated  with  a  different  3-D  MI 
(motion  interpretation)  of  the  features  in  the  scene.  The 
motion  analysis  algorithm  used  in  our  work  [Franzen, 
1992]  is  based  on  a  search  technique  starting  from  multi¬ 
ple  initial  guesses.  The  number  of  solutions  is  affected  by 
the  quality  of  the  correspondence  data  with  3-4  solutions 
generated  from  reasonably  noisy  data. 

In  incremental  analysis,  several  solutions  are  gener¬ 
ated  with  each  new  frame.  Selection  of  the  correct  solu¬ 
tion  is  important  since  the  3-D  motion  associated  with 
it  guides  the  processing  of  the  next  frame.  The  fitting 
error,  which  is  a  sum  of  the  differences  between  the  given 
2-D  positions  and  the  reconstructed  2-D  positions  in  the 
image,  could  be  a  simple  measure  of  the  reliability  of  the 
solution.  However,  it  is  a  good  criterion  only  when  the 
quality  of  the  input  correspondence  data  is  is  very  good. 

The  multiplicity  of  solutions  from  incremental  analysis 
provides  a  means  to  select  a  good  solution  by  measuring 
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Figure  3:  Compatibility  measure 


the  confidence  factor  of  the  3-D  MI  associated  with  each 
solution.  We  believe  that  a  solution  which  represents 
the  actual  environment  is  more  likely  to  have  coinciding 
solutions  in  future  frames  than  a  solution  which  is  far 
from  the  real  situation  but  happens  to  fit  the  given  2- 
D  positions  of  the  correspondence  data.  The  motion 
parameters  from  each  solution  produce  a  3-D  structure. 
The  confidence  factor  of  a  solution  is  computed  from  the 
consistency  of  the  3-D  structure  over  the  sequence  of  the 
frames. 

Figure  3  illustrates  this  compatibility  measure.  First, 
the  confidence  factor  of  a  point  is  computed  as  the 
weighted  sum  of  its  compatibilities  with  other  points. 
The  compatibility  between  two  points  is  dependent  on 
the  consistency  of  the  relative  positions  and  instanta¬ 
neous  velocity  vectors  of  the  two  points.  Then,  the  con¬ 
fidence  factor  of  a  solution  is  a  weighted  sum  of  the  com¬ 
patibilities  of  the  points  in  the  feature  set. 

The  following  definition  of  compatibility 
Cp{m,  n,  MI{i),  measures  the  consistency  of  the 


relative  positions  of  point  m  and  n  between  MI(i)  and 
MI{j)  in  3-D  space.  Cp{m,n,MI{i),MI{j))  is  scale 
invariant  since  it  compares  the  directions  of  the  recon¬ 
structed  structure. 


Cp{m,n,MI{i),MI{j)) 


a  cos  0m  +l>  cos  0n  +c  COS  0mn 
a  +  b  +  c 


(1) 

O:  the  origin  in  the  object  centered  coordinate 

mj,  n,,  mj,nj:  the  3-D  position  of  feature  m,  n  in 
M/(»)  and  MI{j) 

0m, 0n--  angle  of  (m,,  o,  rrij),  (n.,  o,  n^) 

0mn'  angle  between  the  displacement  {mini)  and 
the  displacement  {mjnj) 

a,b,c:  weighting  factor  (0.2, 0.2, 0.6  is  used) 


The  confidence  factor  of  an  MI  is  based  on  the 
strength  of  compatibility  with  other  Mis.  The  compati¬ 
bility  between  M/(i)  and  MI{j)  is; 


CM{MI(i),MI{j)) 


Ill’ll" 


(2) 

where  F  is  the  set  of  common  point  features  and  ||F||  is 
its  size. 


5  3-D  Reconstruction 

Motion  and  structure  estimation  from  corner  correspon¬ 
dence  data  generates  a  sparse  depth  map  for  corners  in 
the  image.  Since  the  selected  corners  are  usually  from 
several  objects  in  the  image,  regular  surface  interpola¬ 
tion  techniques  cannot  be  applied  to  reconstruct  the  3-D 
surfaces.  We  build  a  dense  range  map  using  the  sparse 
depth  map  of  corners  and  the  properties  {shape  and  sire) 
and  relations  of  the  underlying  region. 

Since  several  corners  are  generated  from  a  region,  the 
range  of  a  region  is  represented  by  the  distribution  of 
the  depth  values  of  its  corners.  If  the  depth  estimation 
is  fairly  accurate,  the  local  structure  of  a  region  could  be 
computed  from  the  depths  of  several  corners  along  the 
contour.  If  the  number  of  corners  is  large  enough,  then 
low-confidence  depth  values  can  be  eliminated  statisti¬ 
cally  by  disregarding  those  lying  at  the  extremities  of 
the  samples  as  done  in  [Smith  et  al.,  1992].  In  our  work, 
a  region  usually  contains  3  or  4  corners  that  are  matched 
over  the  sequences.  The  number  of  well-behaved  corners 
that  pass  through  the  refinement  process  is  even  smaller. 
The  criterion  used  in  our  work  is  the  relative  geometric 
stability  of  the  estimated  3-D  position  in  eq.  1.  We  se¬ 
lected  the  corner  that  maintains  the  largest  compatibility 
throughout  the  frames. 

We  assume  that  the  objects  of  interest  represented  by 
regions  have  shallow  structure,  where  the  thickness  (the 
difference  in  depth  within  the  whole  structure)  is  small 
compared  to  the  depth  of  the  structure  [Sawhney  and 
Hanson,  1992).  The  depth  of  the  region  is  represented 
by  that  of  the  corner  with  the  most  stable  depth  value. 
The  3-D  structure  for  objects  of  interest  are  obtained  as 
follows.  The  shape  and  size  of  the  region  is  determined 
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from  its  contour.  We  use  2  models,  cone-type  and  box- 
type  object  for  the  isolated  regions  which  do  not  have  any 
descendants  in  the  segmentation.  The  reference  regions 
from  an  RMS  are  labelled  at  each  frame,  which  depends 
on  the  linear  approximation  of  the  contour  of  the  region. 
A  region  with  one  corner  at  the  top  and  two  corners 
in  the  beise  is  labelled  as  cone-type.  All  other  isolated 
regions  are  labelled  as  box-type.  Consistent  labelling 
is  obtained  in  most  cases.  When  the  labelling  conflicts 
among  the  frames,  the  model  voted  by  the  majority  of 
the  reference  regions  is  used  as  the  label. 

As  stated  in  section  3,  the  shape  of  the  contour  of  a 
region  of  an  RMS  may  have  large  variation  from  frame  to 
frame.  Hence,  the  region  to  be  used  in  the  surface  recon¬ 
struction  cannot  be  taken  from  an  arbitrary  frame,  but 
should  be  from  the  reference  regions,  which  are  consis¬ 
tent  in  contour  shape  and  intensity.  The  contour  of  the 
selected  region  is  scaled  and  translated  to  compensate 
for  the  motion  since  the  frame  number  of  the  selected 
region  varies  from  RMS  to  RMS.  The  height  and  width 
of  the  region  is  obtained  from  the  bounding  rectangle  of 
the  contour  of  the  region.  The  thickness  is  assumed  to 
be  equal  to  the  width  of  the  region,  which  is  not  avail¬ 
able  from  the  data.  A  region  which  has  descendants  is 
labelled  as  a  base  region.  A  base  region  is  2issumed  to 
have  flat  surface  and  the  orientation  of  the  surface  is  ob¬ 
tained  from  an  interpolation  using  the  depth  values  of 
the  well-behaved  corners  that  lie  inside  the  base  region. 

6  Results 

The  motion  analysis  system  has  been  tested  for  standard 
sets  of  real  image  sequences.  We  presen:  the  results  for 
two  image  sequences  provided  by  UMASS.  The  first  one, 
the  Rocket  field  sequence  (Dutta  et  ai,  1989],  is  an  out¬ 
door  sequence  taken  by  a  camera  mounted  on  a  vehicle 
on  a  terrain,  whose  motion  is  dominant  translation  with 
some  rotational  component.  Interframe  motion  is  almost 
constant  but  has  some  minor  variations. 

The  second  one  is  the  Cone  sequence  [Sawhney  and 
Hanson,  1992].  The  sequence  consists  of  8  frames  and 
the  motion  is  pure  translation  along  the  line  of  sight. 
The  first  and  last  frames  used  in  the  analysis  are  shown 
in  figure  4. 

6.1  Guided  segmentation 

Figure  5  (a)  shows  a  region  tracked  from  frame  1  to  frame 
6  of  the  Rocket  field  sequence.  The  region  represents  the 
front  of  the  building  in  the  image.  In  global  segmenta¬ 
tion,  the  building  is  extracted  into  a  region  from  frame 
1  to  6  but  the  corresponding  region  is  not  available  in 
frame  7.  The  building  region  in  frame  7  is  extracted 
by  guided  local  segmentation.  In  this  case,  the  regions 
from  frame  1  through  6  are  so  consistent  in  intensity  and 
shape  that  all  of  them  become  the  reference  regions.  In 
figure  5  (b),  a  local  mask  around  the  predicted  position 
and  the  segmented  region  are  shown,  which  is  very  con¬ 
sistent  to  the  corresponding  regions  in  frames  1  through 
6. 

Figure  6  (a)  (b)  show  the  image  and  intensity  his¬ 
togram  used  in  global  and  local  segmentation.  Three 
peaks  are  found  in  the  intensity  histogram  in  the  global 


(a)  Rocket  field  sequence 


(b)  Cone  sequence 


Figure  4:  First  and  last  frames  of  the  image  sequence 

segmentation  over  the  initial  mask  of  the  whole  image. 
But  none  of  them  represents  the  intensity  values  (17  - 
41)  of  the  front  side  of  the  building.  \Vhile  the  area 
corresponding  to  each  peak  continues  to  be  segmented 
into  smaller  regions  recursively,  the  area  corresponding 
to  the  building  fails  to  be  extracted  into  a  region.  In  lo¬ 
cal  segmentation,  the  mask  is  represented  by  the  white 
solid  line.  The  area  of  the  building  is  represented  by 
the  dominant  peak  in  the  intensity  histogram  and  is  ex¬ 
tracted  into  a  region  as  shown  in  figure  5  (b).  Though 
the  desired  region  is  not  necessarily  represented  by  the 
dominant  peak,  the  guidance  from  previously  matched 
regions  assures  the  selection  of  the  correct  peak  if  it  ex¬ 
ists  in  the  histogram. 

6.2  Refinement  of  noisy  correspondence 

Figure  7  (a)  shows  the  refinement  of  a  CMS  from  the 
building  region.  The  extraction  of  the  right  upper  cor¬ 
ner  of  the  building  is  stable  throughout  the  sequence. 
The  irregular  position  of  the  corner  at  the  fourth  frame 
(numbered  as  3)  is  due  to  the  irregular  motion  of  the 
camera  [Dutta  et  at.,  1989].  After  the  refinement,  the 
corners  in  frames  (0  1  2  4  5  9)  are  selected  and  the 
corner  in  frame  3  is  not  used  in  the  motion  estimation. 
In  [Kim  and  Price,  1992b],  the  gradual  improvement  of 
the  initial  noisy  correspondences  for  the  Rocket  sequence 
is  described  in  details. 

Figure  7  (b)  shows  the  refinement  of  a  CMS  for  a 
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(a)  Building  region  in  frame  1  to  6  from  global 
segmentation 


(b)  Building  region  in  frame  7  from  guided  local 
segmentation 


(a)  Whole  image  and  histogram  in  global  segmenta¬ 
tion 


(b)  Partial  image  and  histogram  in  local  segmentation 


Figure  5:  Guided  local  segmentation 

cone  in  the  Cone  sequence.  For  this  indoor  sequence, 
the  interframe  motion  of  the  camera  is  uniform.  Most 
feature  matches  are  correct.  The  irregular  motion  at 
the  third  frame  (numbered  2)  comes  from  an  imperfect 
segmentation  at  the  frame.  The  erroneous  frame  2  is 
discarded  in  the  refinement  process. 

6.3  Incremental  analysis 

We  tested  the  automated  selection  of  solutions  based  on 
eq.  2.  The  input  data  are  the  corner  correspondence 
data  from  the  first  14  frames  of  the  Rocket  field  Se¬ 
quence.  In  each  incremental  step,  the  latest  7  frames 


Figure  6:  Histograms  of  intensity 


are  used  with  2  to  4  solutions  generated  at  each  step.  At 
each  frame,  the  solution  is  selected  that  has  the  highest 
compatibility,  which  is  a  linear  sum  of  the  compatibili¬ 
ties  with  the  selected  solutions  in  the  previous  frames.  In 
table  1  is  shown  the  compatibility  value  between  the  3-D 
structure  from  the  selected  solution  at  each  frame  with 
the  ground  truth,  which  is  computed  from  the  motion  of 
the  vehicle  and  the  locations  of  18  objects  in  the  scene. 
The  number  of  the  tracked  CMSs  ue  much  larger  (393 
CMSs)  and  only  those  corners  are  used  in  the  compat¬ 
ibility  measure  that  are  common  both  to  the  set  of  the 
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6.4  3-D  structure 


(a)  A  building  in  rocket  sequence 

Selected  comers  =  (012459) 

1 

(b)  A  cone  in  cone  sequence 
Selected  comers  =  (013) 


Figure  7:  Refinement  of  noisy  corner  trajectories 


Frame  number 

Compatibility 

Number  of 
common  corners 

7 

0.795 

4 

8 

0.fi3l 

4 

9 

0.849 

6 

10 

0.883 

7 

11 

0.888 

7 

12 

0.887 

7 

13 

0.889 

7 

14 

0.892 

7 

Table  1 :  The  compatibility  of  the  best  solution  from  each 
frame  with  ground  truth  values  for  the  Rocket  sequence 


tracked  objects  and  to  the  set  of  objects  with  ground 
truth.  The  compatibility  value  increases  as  the  frame 
number  increases  and  the  improvement  is  slow  after  a 
value  of  0.89  (The  maximum  possible  value  is  1.0). 

Figures  9  (a),  (b)  show  the  reconstructed  trajectories 
and  the  top  view  of  the  objects  with  ground  truth  for  the 
Rocket  sequence.  The  top  views  in  the  incremental  anal¬ 
ysis  for  the  eighth,  tenth,  twelfth  and  fourteenth  frames 
are  shown  in  figures  9  (c),  (d),  (e),  (f).  When  compared 
with  the  ground  truth  top  view  in  figure  9  (b),  we  find 
that  the  order  of  depth  is  reversed  for  part  of  the  objects 
but  the  estimated  motion  of  the  camera  is  close  to  the 
real  motion. 


Figure  8  (a)  and  (b)  show  the  reconstructed  trajectories 
and  the  top  view  of  the  objects  for  the  Cone  sequence. 
The  estimated  motion  of  the  camera  shown  as  a  straight 
line  in  the  top  view  is  very  close  to  the  real  motion.  The 
estimated  positions  of  the  cones  and  the  trash  box  agree 
with  the  data  given  in  [Sawhney  and  Hanson,  1992].  Fig¬ 
ure  8  (c)  shows  a  rendered  view  of  the  reconstructed  3-D 
structure  of  the  cones  and  the  trash  box,  seen  from  a 
different  viewing  angle. 

7  Conclusion 

In  this  paper,  we  presented  an  approach  to  use  feed¬ 
back  in  the  domain  of  feature-based  motion  analysis  for 
a  monocular  image  sequence.  We  extended  the  scope 
of  the  guidance  of  the  feedback  of  3-D  motion  from  re¬ 
finement  of  features  to  feature  extraction.  The  incoming 
frames  are  analyzed  incrementally  with  several  solutions 
generated  at  each  frame.  We  devised  a  scheme  of  au¬ 
tomatic  selection  of  the  correct  solution  that  is  based 
on  the  relative  geometric  stability  of  the  estimated  3-D 
position. 

We  applied  this  approach  in  an  automated  motion 
analysis  system  that  is  built  on  hierarchical  feature  ex¬ 
traction  and  matching.  Standard  real  image  sequences 
are  used  as  the  test  set  of  our  system  and  the  results 
for  two  image  sequences  (one  outdoor  sequence  Euid  one 
indoor  sequence)  are  presented. 

We  showed  that  the  initial  noisy  correspondence  data 
are  gradually  refined  and  that  the  extraction  of  a  tracked 
region  is  improved  when  guided  by  its  expected  prop¬ 
erties  obtained  from  the  corresponding  objects  in  the 
previous  frames.  3-D  surfaces  of  the  objects  are  recon¬ 
structed  using  the  sparse  depth  map  from  the  estimates 
of  3-D  point  positions  and  the  motion  parameters. 

Our  motion  analysis  system  is  based  on  very  common 
subsystems.  We  believe  that  the  idea  of  using  feedback 
to  improve  the  extraction  and  matching  of  features  can 
be  applied  to  other  motion  analysis  system.  A  draw¬ 
back  of  our  system  is  that  the  refinement  of  correspon¬ 
dence  assumes  that  most  of  the  objects  in  the  scene 
belong  to  one  major  motion  group  as  is  the  case  with 
the  egomotion  of  the  camera  in  stationary  environments 
When  there  are  multiple  motion  groups  none  of  which 
are  dominant,  the  result  of  the  refinement  process  is  un¬ 
predictable  in  the  present  implementation.  This  problem 
can  be  solved  using  a  clustering  of  the  correspondences 
with  hypotheses  of  multiple  motion,  which  requires  a 
costly  computational  process. 
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(a)  Reconstructed  front  view 


(b)  Reconstructed  top  view  where  the  camera 
is  numbered  as  -1. 
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(c)  Rendered  3-D  view 


Figure  8:  Reconstructed  structure  for  the  Cone  .sequence 
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Figure  9:  The  top  view  of  estimated  structure  in  increments  anSysis  for  Rocket  sequence 
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Abstract 

This  paper  is  concerned  with  three-dimensional  in¬ 
terpretation  of  image  sequences  showing  multiple 
objects  in  motion.  Each  object  exhibits  smooth 
motion  except  at  certain  time  instants  when  a  mo¬ 
tion  discontinuity  may  occur.  The  objects  are  as¬ 
sumed  to  contain  point  features  which  are  detected 
as  the  images  are  acquired.  The  problem  of  esti¬ 
mating  initial  feature  trajectories,  in  the  first  two 
frames,  is  that  of  feature  matching.  As  more  images 
are  acquired,  existing  trajectories  are  extended. 
Both  initial  detection  and  extension  of  trajectories 
are  done  by  enforcing  pertinent  constraints  from 
among  the  following  :  similarity  of  image  plane  ar¬ 
rangement  of  neighboring  features,  smoothness  of 
three  dimensional  motion  and  smoothness  of  image 
plane  motion.  As  trajectories  are  estimated,  they 
are  segmented  into  subsets  each  corresponding  to 
a  different  object.  Both  detection  and  segmenta¬ 
tion  are  formulated  as  cost  minimization  problems 
which  enforce  appropriate  sets  of  the  above  con¬ 
straints.  Cost  minimization  in  each  case  is  done 
using  a  Hopfield  network.  Experimental  results  on 
several  image  sequences  are  shown. 

1  Introduction 

This  paper  is  concerned  with  three-dimensional  inter¬ 
pretation  of  image  sequences  showing  multiple  objects 
in  motion.  Each  object  exhibits  smooth  motion  except 
at  certain  time  instants  when  a  motion  discontinuity 
may  occur.  A  common  type  of  motion  discontinuity  is 
in  motion  direction,  e.g.  when  an  object  undergoes  a 
collision.  Between  such  instants  of  temporal  disconti¬ 
nuity  each  object  exhibits  a  smooth  motion. 

The  objects  are  assumed  to  contain  point  features 
which  are  detected  as  images  are  acquired.  Trajec¬ 
tory  detection  begins  with  the  first  two  frames  wherein 
it  amounts  to  the  standard  problem  of  feature  match¬ 
ing.  As  more  images  are  acquired,  existing  trajecto- 

*This  work  was  supported  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency  and  the  National  Science  Fonndation 
under  grant  IRI-8902728 


ries  are  extended.  There  is  little  information  avail¬ 
able  for  the  initial  matching  of  features  in  the  two 
frames.  So  any  matching  constraints  are  potentially  er¬ 
ror  prone  and  must  be  further  confirmed  against  subse¬ 
quent  frames.  Such  initial  matching  of  features  is  done 
based  on  the  similarity  of  the  image  plane  arrangements 
of  their  neighbors.  Clearly,  this  only  holds  for  those 
neighbors  which  are  detect^  in  both  frames,  and  pro¬ 
vided  the  neighbors  do  not  belong  to  anotW  neigh¬ 
boring  object.  Once  the  first  two  frames  are  matched, 
the  initial  segment  of  each  feature  trajectory  is  found. 
There  is  now  more  information  available  for  matching 
of  the  features  in  the  second  frame  with  those  in  the 
third,  i.e.,  for  trajectory  extension.  The  extension  of 
trajectories  must  be  consistent  with  continuation  of  the 
three-dimensional  motion  of  the  object.  However,  this 
is  only  true  when  the  object  is  not  undergoing  a  motion 
discontinuity.  Further,  this  constraint  can  be  applied 
only  if  features  have  been  segmented  into  objects  so 
that  three-dimensional  motion  of  an  object  can  be  esti¬ 
mated.  When  the  smoothness  of  the  three-dimensional 
motion  cannot  be  enforced,  smoothness  of  trajectories 
are  constrained  to  have  only  two-dimensional  smooth¬ 
ness  which  is  possible  once  the  initial  trajectories  are  de¬ 
tected,  i.e.,  beyond  the  second  frame,  and  which  is  cor¬ 
rect  except  across  temporal  discontinuities  in  motion. 
Thus,  several  different  constraints  are  used  to  detect 
and  extend  trajectories,  but  each  of  these  constraints 
must  be  used  when  it  is  applicable. 

As  the  images  are  acquired  and  trajectories  are  es¬ 
timated,  they  are  segmented  into  subsets  each  corre¬ 
sponding  to  a  different  moving  object.  This  is  done 
by  identifying  neighboring  correspondences  and  break¬ 
ing  these  neighbor  relationships  if  they  are  found  to  be 
very  dissimilar. 

Both  problems,  feature  matching  and  trajectory  de¬ 
tection,  as  well  as  segmentation  are  formulated  as  cost 
minimization  problems.  The  costs  are  defined  in  terms 
of  the  constraints  mentioned  above,  such  that  only  ap¬ 
propriate  constraints  are  used  at  any  given  image  loca¬ 
tion  at  any  given  time.  Each  of  the  two  problems  is 
mapped  onto  a  different  Hopfield  network  which  does 
the  cost  minimization. 
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have  been  no  attempts  though,  to  integrate  cost  mea¬ 
sures  from  several  constraints,  as  discussed  above,  to 
solve  more  complex  scenes  with  multiple  moving  ob¬ 
jects  exhibiting  temporal  discontijiuities  in  their  motion. 
Also,  none  of  the  previous  approaches  offer  any  scheme 
to  eliminate  wrong  matches  which  result  from  settling 
down  to  a  local  minimum  while  performing  energy  min¬ 
imization.  This  makes  the  approach  more  robust  to 
variations  in  scene  parameters  like  inter-frame  motion 
of  objects  which  can  vary  from  0  to  20  pixels. 

Section  2  gives  the  details  of  the  formulation  of  fea¬ 
ture  matching  and  trajectory  detection  problems.  Sec¬ 
tions  3  and  4  describe  the  mapping  of  the  problems  of 
feature  correspondence  and  trajectory  finding  onto  the 
Hopfield  network.  Section  5  describes  the  algorithm  to 
eliminate  wrong  correspondences  that  result  from  de¬ 
scending  to  a  local  energy  minimum.  Section  6  describes 
a  segmentation  scheme  to  segment  the  correspondences 
into  groups  representing  rigid  objects.  Finally.  Section 
7  contains  results  on  several  image  se(iuences. 

2  Feature  Matching  and  Trajectory 
Detection 

2.1  Feature  Matching 

The  problem  of  feature  correspondence  deals  with  find¬ 
ing  a  match  in  the  current  frame,  if  it  exists,  for  every 
point  in  the  previous  frame.  In  orrler  to  finrl  the  cor¬ 
rect  match,  constraints  like  uniqueness  and  image  plane 
similarity  in  the  arrangement  of  neighbors  around  the 
points  are  imposed.  The  uniqueness  constraint  implies 
that  a  point  in  the  previous  f^rame  can  match  at  most 
one  point  in  the  current  frame  and  vice  versa.  The  other 
constraint  implies  that  the  point  in  the  current  frame 
which  is  the  correct  match  for  a  certain  point  in  the  pre¬ 
vious  frame  should  have  a  similar  geometrical  arrange¬ 
ment  of  neighbors  around  it.  These  constraints  are  then 
incorporated  into  an  energy  function  in  such  a  way  that 
the  energy  function  attains  a  miniiiiiim  when  the  points 
are  either  correctly  matched  or  not  matched  at  all.  This 
energy  minimization  is  done  using  the  gradient  descent 
approach  that  is  implemented  using  a  Hopfield  network. 

Let  the  first  frame  have  A'l  points  and  the  second 
frame  points.  We  have  a  matrix  of  A'l  x  A^  pos¬ 
sible  match  hypotheses  where  element  {i,j)  indicates 
that  the  feature  in  the  first  frame  matches  the 
feature  in  the  second  frame.  Only  a  subset  of  these  are 
correct  and  these  represent  the  correct  matches.  A  cost 
is  associated  with  every  hypothesis  such  that  a  low  cost 
is  assigned  to  a  correct  hypothesis  and  a  high  cost  is 
assigned  to  a  wrong  hypothesis.  Since  the  motion  of 
the  feature  points  is  bounded  by  a  maximum  po.s.sil>le 
motion,  we  compute  the  costs  for  only  those  hypotheses 
where  the  point  in  the  second  frame  is  within  a  rmh 
of  mieresi  of  the  point  in  the  first  frame.  All  other  hy¬ 
potheses  are  assigned  a  fixed  high  cost.  The  costs  are 
computed  based  on  the  image  plane  similarity  in  the 
arrangement  of  neighbors  around  the  two  points  consti¬ 


tuting  the  hypothesis.  The  neighbors  discussed  here  are 
the  Delaunay  neighbors  of  the  points.  To  determine  the 
cost  associated  with  a  match  hypothesis,  we  must  try 
to  find  a  subset  of  neighbors  of  the  point  in  the  first 
frame  that  match  a  subset  of  neighbors  of  the  point  in 
the  second  frame.  The  similarity  in  the  image  plane  ar¬ 
rangement  of  these  subsets  is  used  to  compute  the  cost 
associated  with  the  match.  We  look  for  similarity  in  the 
subsets  rather  than  the  entire  set  of  neighbors  because 
missing  feature  points  and  object  boundaries  cause  dis¬ 
tortions  in  the  set  of  neighbors.  This  cost  computation 
scheme  is  explained  in  Section  4.1.  The  costs  associated 
with  all  the  possible  Ni  x  hypotheses  are  then  incor¬ 
porated  into  an  energy  function  which  when  minimized 
yields  A'u  trajectories  of  length  2. 

2.2  Trajectory  Detection 

The  trajectory  detection  problem  is  just  an  extension  of 
the  feature  correspondence  problem  where  the  trajecto¬ 
ries  obtained  till  the  previous  frame  are  to  be  matched 
to  the  points  in  the  current  frame  if  such  a  match  does 
exist.  In  addition  to  constraints  like  uniqueness  of  a 
match  and  image  plane  similarity  in  the  arrangement 
of  neighbors  around  points  corresponding  to  a  correct 
match,  other  constraints  like  2-D  (image  plane)  conti¬ 
nuity  of  trajectories  and  3-D  motion  continuity  can  be 
imposed, 

Consider  the  problem  of  extending  the  N12  trajecto¬ 
ries  obtained  thus  far  to  the  third  frame.  The  problem 
can  be  restated  as  that  of  having  to  match  the  N12  tra¬ 
jectories  to  the  A'a  points  in  the  third  frame.  In  this 
situation,  we  have  a  matrix  of  N 12  x  N3  hypotheses.  Of 
these  hypotheses,  there  will  be  a  subset  of  hypotheses 
that  represent  matches  between  trajectories  and  points 
in  the  third  frame  that  lie  within  their  circles  of  interest. 
Tlie  costs  to  be  associated  with  each  of  these  hypothe¬ 
ses  have  to  be  determined.  All  of  the  other  hypotheses 
(those  involving  trajectories  and  points  outside  their  cir¬ 
cles  of  interest)  will  be  assigned  very  high  costs.  The 
costs  as.sociated  with  each  hypothesis  are  representative 
of  the  following  three  constraints, 

1.  Similarity  between  the  arrangement  of  neighbors 
around  the  last  point  of  the  trajectory  and  the  point 
in  the  third  frame.  This  is  exactly  the  same  as  the 
constraint  imposed  on  point  correspondences. 

2.  (  ontinuily  of  the  3-D  motion  computed  from  the  tra¬ 
jectories  already  known. 

3.  2-D  continuity  of  the  trajectories. 

The  extent  to  which  the  hypotheses  satisfy  each  of  the 
above  constraints  is  determined  separately.  This  results 
in  three  different  cost  measures  that  need  to  be  merged 
before  being  incorporated  into  the  energy  function.  The 
second  method  is  now  explained.  Having  obtained  the 
trajectories  till  the  previous  frame,  any  applicable  mo¬ 
tion  and  structure  algorithm  can  be  used  to  estimate 
the  3-D  motion  and  structure  of  the  objects  and  these 
estimates  ran  be  u.sed  to  predict  the  positions  of  the 
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feature  points  in  the  current  frame.  Based  on  tlie  pre¬ 
dicted  positions,  a  cost  can  be  associated  with  every 
hypothesis  that  a  trajectory  extending  upto  the  previ¬ 
ous  frame  matches  a  point  in  the  current  frame.  Before 
applying  any  motion  model  one  has  to  segment  the  tra¬ 
jectories  into  groups  that  correspond  to  rigid  objects. 
This  cost  computation  scheme  is  described  in  detail  in 
Section  4.2. 

The  third  method  to  compute  the  cost  to  be  associ¬ 
ated  with  a  trajectory  and  a  point  in  the  third  frame 
that  lies  in  its  circle  of  interest  is  based  on  the  im¬ 
age  plane  continuity  of  the  trajectory  across  the  frames. 
This  is  done  as  follows.  The  time  axis  is  collapsed  so 
that  the  problem  of  fitting  a  function  to  the  trajectory 
becomes  a  one-dimensional  problem.  A  Lagrange  in¬ 
terpolation  is  done  between  the  points  constituting  the 
trajectory  and  the  slope  at  the  last  point  of  the  trajec¬ 
tory  is  computed.  Then,  the  slope  of  the  line  segment 
joining  the  last  point  of  the  trajectory  and  the  point 
in  the  current  frame  that  is  competing  for  the  match  is 
computed.  The  cost  associated  with  the  hypothesis  is 
then  based  on  the  similarity  of  the  two  slopes. 

We  thus  have  three  different  cost  measures  that  are 
associated  with  each  competing  hypothesis.  A  method 
to  merge  these  three  costs  and  associate  a  single  cost 
measure  with  each  competing  hypothesis  is  devis<'d. 

2.3  Cost  Merge  Algorithm 

Three  cost  computation  schemes  have  been  outlined 
above.  Each  method  works  well  in  some  ca.ses  but 
fails  in  others.  The  2-D  geometrical  cost  computation 
method  might  not  yield  good  results  at  points  close  to 
object  boundaries  because  of  differences  in  the  motions 
of  the  two  objects.  It  could  also  happen  that  though  a 
certain  point  in  the  current  frame  is  the  correct  match 
for  a  point  in  the  previous  frame,  there  might  be  no 
subset  Oi  neighbors  that  match  for  the  two  points.  The 
cost  computed  using  the  assumption  of  3-D  motion  con¬ 
tinuity  across  frames  yields  better  results  than  the  2-D 
method  but  again  fails  at  object  boundaries  because  the 
rigidity  assumption  is  violated  across  boundaries.  This 
method  also  fails  when  there  is  a  temporal  discontinuity 
in  the  motion  of  objects.  Finally,  the  cost  computation 
scheme  based  on  the  2-D  continuity  of  trajectories  gives 
good  results  at  object  boundaries  but  fails  when  thi-re 
is  a  temporal  discontinuity  of  motion.  In  a  situation  in 
which  there  is  a  temporal  discontinuity  in  the  motion, 
only  the  cost  based  on  the  2-D  arrangement  of  neigh¬ 
bors  can  be  used.  Table  1  lists  the  cases  in  which  th«' 
three  methods  fail  or  apply. 

This  suggests  that  there  is  a  need  to  merge  I  liesi’  costs 
and  determine  a  single  cost  measure  for  a  certain  match 
that  does  not  fail  at  either  object  boundaries  or  motion 
discontinuities.  One  could  suggest  many  methods  to 
achieve  this.  In  this  system  we  have  taken  the  siinph- 
approach  of  choosing  the  least  of  the  three  costs.  In  our 
experiments  we  have  seen  that  this  method  works  pretty 
well,  and  there  is  no  need  for  a  more  complex  cost  merge 


scheme.  Once  the  merged  cost  associated  with  every 
hypothesis  is  computed,  they  are  incorporated  into  an 
energy  function  along  with  the  uniqueness  constraints. 
Tliis  energy  function  is  minimized  using  the  Hopfield 
network. 


3  Mapping  the  feature  correspondence 
and  trajectory  finding  problems  onto 
the  Hopfield  Network 

3.1  Mapping  the  feature  matching 
problem  onto  the  network 

Consider  the  case  in  which  there  are  two  images  of  the 
scene  taken  at  two  different  time  instants,  ti  and  <2- 
The  feature  extractor  is  run  on  the  two  images  and  two 
sets  of  feature  points  are  obtained.  Let  the  first  image 
have  N\  points  and  the  second  image  have  N2  points. 
The  problem  at  hand  is  to  find  a  subset  of  points,  A^i2 
in  number,  in  the  first  frame  that  has  matches  in  the 
.v'cond  frame. 

The  Hopfield  network  [6]  used  for  the  above  problem 
consists  of  a  two-dimensional  array  of  processing  ele¬ 
ments  (PEs)  having  N-[  rows  corresponding  to  the  Ni 
points  in  the  first  frame  and  Nn  columns  corresponding 
to  the  N2  points  in  the  second  frame.  Each  PE  is  es- 
.s<uitially  a  nonlinear  amplifier  that  produces  an  output 
1/  which  is  relateil  to  its  input  «i  by  the  equation 

Vi  =  (/(Am,  )  =  ^(1  -h  tanh(Au,))  (1) 

where  A  is  called  the  gain  parameter.  The  input  u,  to 
the  i'^  PE  is  the  weighted  sum  of  the  outputs  of  the  PEs 
that  are  connected  to  it.  The  processing  element  (1,  j) 
represents  the  hypothesis  that  the  point  in  frame 
1  matches  the  j*'*  point  in  frame  2.  Each  processing 
element  has  a  potential  associated  with  it.  This  poten¬ 
tial  corresponds  to  the  quantity  v  discussed  previously 
and  can  take  on  a  continuum  of  values  between  0  and 
I.  The  value  1  represents  a  sure  match  between  the 
corresponding  points  in  the  two  frames  and  the  value 
0  repre.sents  a  nonmatch.  Any  value  between  0  and  1 
signifies  the  level  of  confidence  in  the  match  between 
the  corresponding  points. 

The  connections  between  the  processing  elements  and 
the  weights  associated  with  them  depend  on  the  energy 
function  that  has  to  be  minimized.  The  energy  function 
used  in  this  problem  has  the  form 


j  Ni  Na  N,  Ni 

Evfrijii  =  —  -  EEEE  TijkiVij 

f=l ;=1 1=1 t=l 
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Table  1:  Comparative  analysis  of  the  effectiveness  of  tlie  tliree  cost  computation  schemes 


Cost  Computation  Schemes 

Fails 

Works 

2-b  arrangement  of  neighbors 

When  no  subsets  of  neighbors 
match 

All  other  times  including  when 
there  is  a  temporal  discontinu¬ 
ity  in  the  motion 

Continuity  of  3-D  motion 

At  object  boundaries  and 
temporal  motion  discontinuities 

All  other  times 

Continuity  of  trajectories 

All  other  times 

where,  Vij  represents  the  potential  of  the  (/.j)**  FK. 
Tij'ki  represents  the  weight  of  the  link  from  the  (/•,/)"’ 
PE  to  the  (i,  PE,  lij  represents  the  bias  input  to  the 
(i,  j)**  element  and  /2,j  represents  the  resistance  seen  at 
the  input  of  the  (i,  PE. 

Ignoring  the  integral  term  in  the  previous  ecpiation 
which  was  present  due  to  the  introduction  of  an  inter¬ 
mediate  variable  Uij,  the  energy  function  can  also  be 
written,  specifically  for  the  problem  of  feature  corre¬ 
spondence,  as 

.  tVi  .Vi  .V.. 

Energy  =  £  51  S  '’j''* 

i  =  i  j  =  i  i  = 


P  JVj  Ni  N|  /Vj 

T  21  i  + T  <  21  li  ~  -  )■ 


>  =  l  i  =  l  k  =  l,k^i 


.=1 j=l 


E  (Cost(i.j)  -  a 

1  =  1  j  =  i  k=i.k^j 
r  'Vj  Ni  N, 

■•■■2  2121  21  (Cost(i.j)-('o.sl{kJ))rijVi.j  {:\) 

j  =  l  1  =  1  k  =  l,k^i 

Each  term  in  the  above  equation  has  a  physical  expla¬ 
nation  that  is  outlined  below.  The  first  term  deals  with 
Row  Inhibition.  This  term  ensures  that  when  the  net¬ 
work  stabilizes,  there  is  at  most  one  PE  in  each  row 
that  has  a  potential  of  I  whereas  all  of  the  other  el¬ 
ements  have  value  0.  The  constant  ,1  determines  the 
relative  importance  this  term  is  given  w.r.t  the  other 
terms.  The  second  term  deals  with  Column  Inhibition. 
This  is  the  column  analog  of  the  first  term.  When  the 
network  stabilizes,  at  most  one  in  each  column  has  a  po¬ 
tential  of  1.  Again  the  constant  B  rh'cides  the  relative 
importance  this  term  is  given  w.r.t.  the  other  terms. 
The  first  two  terms  in  the  above  equation  enforce  the 
concept  of  uniqueness  of  a  match,  i.e.,  a  point  in  the  first 
frame  can  match  at  most  one  point  in  tlie  secomi  frame 
and  vice  versa.  Since  both  of  these  terms  are  equiva¬ 
lent,  we  generally  give  them  equal  importance,  i.e..  we 
have  A  —  B.  The  third  term  in  the  energy  e(|uation 
deals  with  Clobal  Inhibition.  This  ti'rm  is  minimum, 
i.e.,  0,  only  when  the  total  number  of  I'.s  iu  the  arr.i> 


is  ;Vi2.  This  term  ensures  that  there  are  approximately 
;Vi2  matches  obtained  when  the  network  stabilizes.  At 
the  energy  minimum,  in  most  cases,  we  will  not  have 
exactly  N\->  matches  but  some  number  that  is  close  to 
it.  In  this  problem  we  set  N12  to  min(Ai,A'2).  The 
fourth  term  deals  with  Cost  Based  Row  Inhibition.  For 
a  certain  point  in  the  left  frame,  all  of  the  points  in  the 
second  frame  compete  for  a  match.  If  a  certain  point 
in  the  .second  frame  has  a  lower  cost  as.sociated  with  it 
than  another  point,  then  it  tries  to  reduce  the  potential 
of  the  other  hypothesis  by  issuing  an  inhibitory  signal. 
This  reduces  the  potential  of  the  other  PE.  In  this  term 
again  D  determines  the  relative  weight  that  this  term 
has  in  the  final  expression.  Analogously,  the  last  term 
deals  with  Cost  Based  Column  Inhibition.  Ideally  the 
last  two  terms  should  also  be  0  when  the  network  sta- 
biliz«“s  on  a  solution. 

Comparing  Equations  (2)  and  (3)  we  obtain 


r,j  B6j,{\-6it)-C 

-D{C'ost(i.j)  -  Cosf{k.l))6n(l  -  6ji) 
-E(Cosi{i.j)  -  C’o.sf(F./))^;,(l  -  6it)(4) 

where  =  1  if  ni  =  n  and  =  0  if  m  ^  n.  We 
also  obtain 

/,2=CA-,2  (5) 

We  us«'  the  gradient  descent  method  to  approach  the 
energy  minima.  The  equation  for  gradient  descent  can 
be  written  as 


which  yields 


(liijj  _  ()(Energy) 

dt  c)r,j 


•Vi  A’j 

l.  =  \  /=! 


(6) 


(7) 


where  r,^  =  /fijC.  the  time  constant.  In  our  formulation 
of  the  problem,  we  have  a,ssumed  that  the  time  constant 
is  th«'  same  for  all  proces.sors.  This  does  not  affect  the 
.solution  but  only  decides  the  rate  of  convergence. 

.•\  digital  simulation  of  this  .system  requires  that  we 
inti'grate  these  equations  numerically.  For  a  sufficiently 
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small  value  of  At,  we  can  write 

Si  N2 

=  (im  Tij,k,Vki  -  -7  +  lij  )^<  («) 

t=i i=i 

The  values  of  Uij  can  be  iteratively  up(lat<Hl  according 
to  the  following  rule. 

Uij(t+l)  =  Uij{t)  +  An,j  (S)) 

The  final  output  potential  of  the  PE  is  given  by 

'  u  =  9{^ij)  =  +  ^*‘*'*' 

If  we  substitute  Equations  (4)  and  (5)  into  8  we  ol)tain 
the  following  result: 

Auij  =  [-^-A  ^  ra-B  ^  vi-j 

S,  S3 

*=1 /=! 

S3 

-D  ^2  {Cost(iJ)  -  ('o>it(i.L))r,i 

k  =  l,k^j 
Si 

-E  ^2  (C'os<(»,  j)  -  Co.st(I',  j))i>j]A/ 

ksl.k^i 

(11) 

Equations  (9)  and  (11)  describe  the  dynamics  of  the 
network.  Now  only  the  initial  values  of  the  potentials  of 
the  PEs  (  Vij  for  the  PE  )  have  to  1)0  specified. 

Having  done  this  the  network  could  be  allowed  to<'volve 
in  time  until  it  attains  a  steady  state,  i.e.,  a  stage  where 
the  outputs  of  the  PEs  do  not  change.  The  output  of 
each  PE  is  initialized  to  nj  =  1.0  -  Cost{i,j).  The  ini¬ 
tial  output  could  then  be  viewed  as  the  probability  that 
the  corresponding  hypothesis  is  true.  The  network  can 
then  be  allowed  to  evolve  in  time  until  it  stal)ilizes.  The 
process  of  evolution  is  auto-association.  The  atlvanlages 
of  this  initialization  scheme  are  that  a  fewer  number  of 
iteration  steps  are  needed  and  the  solution  obtained  is 
better  than  that  obtained  using  random  inilializalion. 
After  the  network  stabilizes,  all  of  the  PEs  havi-  out¬ 
puts  equal  to  either  0  or  1.  Those  that  have  outputs 
1  have  been  identified  as  the  correct  hypotheses  while 
those  that  have  outputs  0  have  been  idenlifi<'d  as  (h<' 
wrong  hypotheses. 

3.2  Mapping  the  trajectory  detection 
problem  onto  the  network 

The  problem  of  establishing  the  correspondence  be¬ 
tween  the  trajectories  computed  until  th<’  previous 
frame  and  the  points  in  the  current  frame  is  similar, 
in  formulation,  to  the  problem  of  feature  correspon¬ 
dence  between  two  frames.  The  only  differenci-  is  in 


the  method  used  to  calculate  the  costs  associated  with 
every  competing  hypothesis.  In  the  case  of  feature  cor¬ 
respondence  between  two  frames,  we  rely  only  on  the 
cost  based  on  the  image  plane  similarity  in  the  arrange¬ 
ment  of  neighbors  around  the  point  pairs  constituting 
the  hypotheses.  In  the  case  of  correspondence  between 
the  trajectories  and  points,  there  are  three  cost  mea¬ 
sures.  These  are  based  on  the  image  plane  similarity 
in  the  arrangement  of  neighbors  around  the  last  point 
of  each  trajectory  and  the  point  in  the  current  frame, 
continuity  of  3-D  motion,  and  continuity  of  trajecto¬ 
ries.  The  details  of  the  implementation  of  these  three 
cost  computation  schemes  are  given  in  the  next  section. 

4  Cost  Computation 

4.1  Two-Dimensional  Geometrical  Cost 

This  cost  is  computed  to  reflect  the  similarity  in  the 
image  plane  arrangement  of  the  Voronoi  neighbors  of 
the  two  points  constituting  a  candidate  match.  Let  the 
Voronoi  neighbors  of  point  1  in  the  previous  frame  be 
(«.  b  c  (I  e)  and  that  of  j  in  the  current  frame  be 
(«'.  6'  c'  d').  We  have  to  determine  which  subset  of 
the  lines  (/«,  ib,  ic,  id,  ie)  matches  a  subset  of  lines 
ija',  jb' ,  jc' .  jd'  ).  This  is  done  using  a  two  dimen¬ 
sional  llopfield  network  similar  to  the  one  described  in 
Section  3.  When  this  match  is  determined,  the  simi¬ 
larity  in  the  image  plane  arrangement  of  the  neighbors 
can  be  estimated.  The  cost  function  needed  to  initialize 
this  network  is  based  on  the  similarity  in  the  lengths 
and  orientations  of  the  lines  competing  for  a  match. 
This  funct  ion  is  defined  as 

niniH  =  /'(C'l  -  ('2(A/enf/</jaf(„o(  -  ALEiVGT//) 

-(  3( Aorif  ntdlioUariuai-'^ORI ENTAT ION))  ( 12) 

wh**re  F  is  a  non-linear  function  defined  by  F(x)  = 
x.x  >  ('4  and  F(x)  =  C^.x  <  C'4.  and  C4  = 

-{Ci-i-C-jALENGTH+CsAORIENTATION).  Here 
Alfvgtbnetuai  and  Aoricntatioiiactuai  are  the  actual  ab¬ 
solute  differences  in  the  length  and  orientaion  between 
the  lines  constit\iting  a  match  hypothesis.  ALENGTH 
aiul  AORI ENTATION  are  constants  to  be  provided 
a  priori.  The  similarity  measures  are  normalized  and 
the  cost  for  each  match  is  computed  as  cost  =  1.0  — 
.'iiinilarit!/. 

After  the  network  computes  the  right  subset  of 
matched  lines,  then  for  every  matched  line  pair  ia  and 
ja'.  the  quantity  1.0—  dotproduct  of  ij  and  aa'  is  com¬ 
puted.  The  minimum  of  the  above  quantity  over  all 
the  matched  pairs  is  the  final  two-dimensional  geomet¬ 
ric  cost. 

4.2  Motion  model  cost 

I'lie  correspondences  obtained  between  the  first  two 
frames  can  be  segmented  into  groups  belonging  to  dif- 
fi  rent  rigid  obji'cts  as  described  in  Section  6.  A  motion 
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model  can  be  used  to  predict  the  positions  of  the  fea¬ 
ture  points  in  the  next  frame.  Depending  on  the  num¬ 
ber  of  correspondences  available,  suitable  motion  mod¬ 
els  could  be  used.  We  have  used  the  local  translation 
model  where  we  fit  the  model  to  a  correspondence  and 
its  Voronoi  neighbors.  Based  on  the  3D  motion  coin- 
puted,  the  positions  of  the  feature  points  in  the  current 
frame  are  predicted  such  that  each  trajectory  till  the 
previous  frame  has  one  predicted  feature  point  in  the 
current  frame  associated  with  it.  The  cost  of  matching 
a  trajectory  till  the  previous  frame  to  a  point  in  the 
current  frame  is  done  by  determining  the  similarity  in 
the  image  plane  arrangement  in  the  neighbors  of  the 
predicted  points  and  the  actual  points  in  the  current 
frame.  This  is  done  in  the  a  manner  similar  to  Section 
4.1.  As  the  length  of  the  trajectories  increases,  more 
complex  motion  models  could  be  used. 

4.3  Cost  based  on  the  continuity  of 
trajectories 

Consider  a  trajectory  of  length  (n  -f-  1)  extending  from 
the  first  frame  to  the  (n  -f  1)‘*  frame.  Let  the  image 
plane  co-ordinates  of  the  points  constituting  the  trajec¬ 
tory  be  (A'o,  Vo),  (Ai,  V'’i), ...,  (A‘„,  V,,).  We  fit  a  function 
to  this  set  of  points  using  Lagrange  interpolation.  The 
degree  of  the  polynomial  is  limited  to  3  or  4.  Consider 
two  vectors,  the  tangent  to  the  function  at  the  last  point 
of  the  trajectory,  and  a  line  joining  the  last  point  of  the 
trajectory  to  a  candidate  point  match  in  the  current 
frame.  The  cost  associated  with  this  match  is  related 
to  the  dot-product  of  the  two  vectors. 

The  three  costs  computed  above  are  merged  together 
for  every  candidate  match  and  the  Hopfield  network 
is  initialized  to  Vij  =  1.0  -  cost{i,j),  where  (,j  is  the 
probability  that  the  trajectory  /  till  the  previous  frame 
matches  point  j  in  the  current  frame.  The  network  is 
allowed  to  evolve  and  the  hypotheses  that  survive  are 
the  right  matches. 

5  Eliminating  wrong  correspondences 

Wrong  correspondences  that  result  from  the  network 
settling  down  to  a  local  minimum  are  eliminated  us¬ 
ing  a  one-dimensional  Hopfield  network  where  each 
PE  represents  the  probability  of  correctness  of  a  cor¬ 
respondence.  Each  PE  is  connected  to  a  proci's.s- 
ing  element  that  represents  a  Voronoi  neighbor  cor¬ 
respondence  that  is  most  similar  to  it.  The  simi¬ 
larity  measure  is  calculated  as  shiiilariln  =  1.0  — 
cost  =  (1.0  —  angcost)lengthcost  <nigco.sl  where 

lengthcost  =  0.5[1.0 -1- tanh  (l.^CA/fiif/t/i  -  2.0))]  and 
angcost  =  0.5[1.() -t- tanh  (0.15(Ann(;/r  -  20.0))].  The 
network  is  simulated  using  the  following  ecpiation  for 
every  PE 

•  ^  =  +  A  + 

•  Vj  =  0.5[1.0-|- tanh(Au,  )] 

where  j{i)  is  the  most  similar  neighbor  to  i.  li  is  the 
bias  input  chosen  to  be  .50  in  all  the  experiments. 


TiJ^i)  =  -100cos<(i,  j),  and  u,  and  u,  are  the  input  and 
output  of  the  PE.  All  those  PE)s  that  survive  when 
the  network  stabilizes  are  retained.  With  this  reduced 
set  of  correspondences,  the  entire  process  is  repeated  till 
no  more  correspondences  are  removed.  This  eliminates 
wrong  correspondences. 

6  Segmentation  of  trajectories 

The  underlying  assumption  made  in  the  segmentation 
procedure  proposed  in  this  paper  is  that  the  motion  be¬ 
tween  two  frames  is  small  due  to  dense  sampling  in  time. 
The  segmentation  algorithm  is  outlined  below.  After 
the  wrong  corespondences  are  eliminated  using  the  net¬ 
work  discussed  in  Section  5,  the  Voronoi  neighbors  of 
every  correspondence  are  determined.  The  lines  join¬ 
ing  the  correspondences  to  their  neighbors  are  called 
edges.  These  edges  could  be  inter-object  edges  where 
the  correspondences  on  either  side  belong  to  two  differ¬ 
ent  objects  or  within-the-object  edges.  For  inter-object 
edges  the  similarity  between  the  two  correspondences 
on  either  end  is  low  whereas  for  within-the-object  edges 
it  is  high.  VVe  construct  an  energy  function 

#o}fdgn 

E—  ^  A(1.0— i’,)cosf,-foi',-f  /  g~^(v)dv  (13) 

,  =  i 

that  needs  to  be  minimized  to  give  the  right  segmen¬ 
tation.  Here  costi  is  the  cost  reflecting  the  similarity 
of  the  correspondences  on  either  end  of  the  edge,  v,  is 
the  probal)ility  that  the  edge  i  is  an  inter-object  edge 
and  A  and  ct  are  constants  weighing  smoothness  of  mo¬ 
tion  versus  discontinuities.  This  is  minimized  using  the 
gradient  descent  approach  according  the  equations 

•  +  ^cosii  -  o 

where  n,  and  Vj  are  related  as  in  the  Hopfield  network. 
All  the  inter-object  edges  are  discarded.  This  gives  the 
segmentation.  This  method  fails  when  the  motion  be¬ 
tween  frames  is  very  large  or  when  the  motion  between 
neighboring  objects  is  similar.  The  second  case  is  inher¬ 
ently  difficult  to  handle  because  if  two  nearby  objects 
have  .similar  motions,  they  act  as  a  single  object.  In  the 
first  ca.se.  a  better  motion  model  must  be  found  that 
fits  unions  of  available  segments. 

7  Results 

Experiments  were  conducted  to  evaluate  the  perfor¬ 
mance  of  the  system  on  multiple,  discontinuous  mo¬ 
tion  image  .sec|uences  and  smooth  motion  real  image 
.se<|ueiices.  The  latter  were  taken  from  those  made  avail¬ 
able  at  the  1991  IEEE  workshop  on  visual  motion. 

Figure  1  shows  the  first  frame  of  a  sequence  of  10 
frames  of  a  scene  with  3  objects  moving  in  different 
directions.  Their  motion  varies  from  4  to  10  pixels. 
200  feature  points  corresponding  to  local  intensity  max¬ 
ima  and  minima  were  automatically  detected  in  every 
frame.  The  trajectories  determined  are  shown  in  Figure 
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2.  Figure  3  shows  the  result  of  segmenting  tlie  corre¬ 
spondences  between  the  first  two  frames. 

Figure  4  shows  the  first  image  of  a  seciuence  of  10  im¬ 
ages  of  a  scene  with  one  object  having  a  temporal  dis¬ 
continuity  in  its  motion  between  the  5'^  and  6'*  frame. 

The  object  moves  to  the  left  for  the  first  five  frames 
and  then  moves  towards  the  upper  right  hand  corner  of 
the  image.  Again  around  200  points  were  automatically 
detected  in  each  frame.  Figure  5  shows  the  trajectories 
obtained. 

Figure  6  shows  the  second  frame  of  a  se<iuence  of  im¬ 
ages  of  a  laboratory  taken  from  a  camera  mounted  on 
a  PUMA  robot  arm  that  was  rotating.  This  caused 
the  entire  scene  to  rotate  around  the  optical  axis  of  the 
camera.  The  motion  of  the  features  varied  from  0  to 
30  pixels.  Figure  7  shows  the  correspondences  obtained 
between  the  2"“*  and  frames  of  this  sequence. 

Figure  8  shows  the  4'^  frame  of  a  sequence  of  images 
of  an  outdoor  scene  taken  by  a  camera  mounted  on  a 
robot  moving  along  the  pathway  seen  in  the  image.  'I'lie 
feature  points  had  a  motion  of  15  —  20  pixels.  Figure  ^ 
9  shows  the  correspondences  obtained  betwi'eii  the  l"' 
and  5**  frames  of  this  sequence. 
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Abstract 

The  problem  of  egomotion  recovery  has 
been  treated  by  using  as  input  local  image 
motion,  with  the  published  algorithms  uti¬ 
lizing  the  geometric  constraint  relating  2- 
D  local  image  motion  (optical  flow,  corre¬ 
spondence,  derivatives  of  the  image  flow) 
to  3-D  motion  and  structure.  Since  is  has 
proved  very  difficult  to  achieve  accurate  in¬ 
put  (local  image  motion),  a  lot  of  effort  has 
been  devoted  to  the  development  of  robust 
techniques.  In  this  paper  a  new  approach 
to  the  problem  of  egomotion  estimation  is 
taken,  based  on  constraints  of  a  global  na¬ 
ture.  It  is  proved  that  local  normal  flow 
measurements  form  global  patterns  in  the 
image  plane.  The  position  of  these  patterns 
is  related  to  the  three  dimensional  motion 
parameters.  By  locating  some  of  these  pat¬ 
terns,  which  depend  only  on  subsets  of  the 
motion  parameters,  through  a  simple  search 
technique,  the  3-D  motion  parameters  can 
be  found.  The  proposed  algorithmic  proce¬ 
dure  is  very  robust,  since  it  is  not  affected 
by  small  perturbations  in  the  normal  flow 
measurements.  As  a  matter  of  fact,  since 
only  the  sign  of  the  normal  flow  measure¬ 
ment  is  employed,  the  direction  of  transla¬ 
tion  and  the  axis  of  rotation  can  be  esti¬ 
mated  with  up  to  100%  error  in  the  image 
measurements. 


1.  Introduction 

The  methodological  theory  of  computational  vision 
presented  by  Marr  [15]  has  form.!d  the  basis  for  re¬ 
search  in  visual  motion  understanding.  Vision  is  de- 

*  Permanent  address:  Department  for  Pattern  Recog¬ 
nition  and  Image  Processing,  Institute  for  Automation, 
Technical  University  Vienna,  IVeitktrafie  3,  A-1040  Vi¬ 
enna,  Austria 


scribed  as  a  reconstruction  process,  that  is,  a  problem 
of  creating  representations  of  increasing  levels  of  ab¬ 
straction,  leading  from  2-D  images  through  the  pri¬ 
mal  sketch  and  the  2^D  sketch  to  object-centered  de¬ 
scriptions.  Applied  to  visual  motion  perception  this 
led  to  the  computational  theory  known  as  “struc¬ 
ture  from  motion”  theory.  The  goal  is  to  recover 
from  dynamic  imagery  the  3-D  motion  parameters 
and  the  structure  of  the  objects  in  view.  The  sug¬ 
gested  strategy  attempts  to  solve  the  problem  in  two 
stages.  First,  accurate  image  displacements  between 
consecutive  frames  have  to  be  computed,  either  in 
the  form  of  point  correspondences  [7,  19]  or  as  dense 
motion  fields  (optical  flow  fields)  [4,  8,  11].  Then, 
in  a  second  step,  the  3-D  motion  and  the  structure 
are  computed  from  the  equations  relating  them  to 
the  2-D  Image  velocity  [1,  5,  9,  13,  14,  17,  18).  This 
computational  theory  has  been  uncritically  accepted, 
and  as  a  result,  most  studies  on  visual  motion  percep¬ 
tion  are  to  be  found  at  the  algorithmic  level  of  the 
problem,  addressing  either  the  estimation  of  image 
motion  or  the  recovery  of  3-D  motion  and  structure 
from  image  motion. 

The  problem  addressed  in  this  paper  is  not  the 
general  “structure  from  motion”  problem.  We  are 
concerned  only  with  the  estimation  of  3-D  motion 
independent  of  depth.  For  a  monocular  observer  un¬ 
dergoing  unrestricted  rigid  motion  in  the  3-D  world, 
we  compute  the  parameters  describing  this  motion. 
Using  the  perspective  transformation  as  our  geomet¬ 
ric  imaging  model,  only  five  unknowns  can  be  derived 
from  2D  images,  namely  three  rotational  parameters 
and  two  parameters  describing  the  direction  of  trans¬ 
lation.  In  the  literature  this  problem  is  known  as 
“passive  navigation”. 

In  this  paper  an  alternative  approach  to  the  prob¬ 
lem  of  passive  navigation  is  taken,  which  is  different 
from  existing  methods,  at  both  the  the  computational 
and  the  algorithmic  level. 

First,  we  do  not  compute  the  exact  2-D  image 
velocity.  In  general,  the  estimation  of  optical  flow 
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is  an  ill-posed  problem  and  additional  assumptions 
must  be  made  in  order  to  estimate  it.  If  these  as¬ 
sumptions  hold,  as  in  the  case  of  some  model-based 
approaches  [6,  12],  for  special  purpose  applications, 
optical  flow  can  be  computed  and  as  a  result  3-D 
motion  can  be  derived.  In  the  general  case,  how¬ 
ever,  any  algorithm  will  produce  imperfect  output 
(erroneous  output,  if  the  assumptions  do  not  hold). 
Therefore,  we  use  as  a  representation  for  image  mo¬ 
tion  the  so-called  normal  flow.  As  the  only  available 
constraint  on  the  flow  («,  v)  of  the  time  varying  image 
we  consider  the  motion  constraint  equation 
[11]  /*u  -t-  /yV  +  /f  =  0,  where  the  subscripts  denote 
partial  differentiation.  This  constraint  means  that  we 
can  only  compute  the  projection  of  the  flow  on  the 
gradient  direction  ((/*,7y)  •  (u,u)  =  — /«)• 

Recovering  3-D  motion  from  noisy  flow  fields  has 
turned  out  to  be  a  a  problem  of  extreme  sensitivity, 
with  researchers  reporting  very  large  errors  in  the  mo¬ 
tion  parameter  estimates  under  small  perturbations 
in  the  input.  Even  optimal  algorithms  [17]  perform 
quite  poorly  in  the  presence  of  moderate  noise.  Al¬ 
though  a  formal  proof  is  still  lacking,  it  has  been 
argued  [2]  that  the  estimation  of  3-D  motion  from 
image  motion  is  itself  ill-posed,  because  it  does  not 
continuously  depend  on  the  input. ^  Thus,  in  order 
to  estimate  3-D  motion  in  a  robust  way,  we  have  to 
consider  the  fact  that  no  flow  measurement  (neither 
optical  flow  nor  normal  flow)  can  be  perfect.  In  this 
paper  new  global  constraints  of  a  geometric  nature, 
which  relate  3-D  motion  to  2-D  image  measurements 
(normal  flow),  are  introduced. 


rotational  motion  was  studied,  and  linear  equations 
relating  the  rotation  parameters  to  the  normal  flow 
were  derived.  A  similar  result  was  reported  by  Horn 
and  Weldon  [10],  who  presented  several  methods  for 
solving  the  problem  of  motion  and  structure  com¬ 
putation  not  only  in  the  purely  rotational  case,  but 
also  for  pure  translation,  for  known  rotation,  and  for 
known  structure.  The  constraint  of  positive  depth 
was  used  by  Negahdaripour  [16]  to  estimate  the  fo¬ 
cus  of  expansion  for  purely  translational  motion  and 
in  [20]  translation  and  rotation  were  estimated  for  an 
observer  rotating  around  the  direction  of  translation. 


2.  Geometric  constraints 

For  a  monocular  observer  undergoing  unrestricted 
rigid  motion  in  the  3-D  world  we  compute  the  pa¬ 
rameters  describing  this  motion.  If  the  coordinate 
system  is  fixed  to  the  observer  with  the  center  being 
the  nodal  point  of  the  camera  and  /  the  focal  length, 
then  the  equations  relating  the  velocity  (u,  r)  of  an 
image  point  to  the  3-D  velocity  (([/,  V,  W)  transla¬ 
tional  and  (a,/?, 7)  rotational)  and  the  depth  Z  of 
the  corresponding  scene  point  are  [13] 


(1) 


In  our  approach,  we  first  compute  the  rotation 
axis  and  the  direction  of  translation.  Motion  rigidity 
introduces  a  number  of  constraints  on  the  normal  flow 
values.  These  constraints  take  the  form  of  patterns  in 
the  image  plane.  In  other  words,  for  given  positions 
of  the  translational  and  rotational  axes,  the  normal 
flow  values  form  certain  global  patterns.  Our  algo¬ 
rithmic  procedure  searches  for  these  patterns.  It  uses 
data  from  different  parts  of  the  image  plane  and  con¬ 
siders  only  the  sign  of  the  normal  flow.  The  method 
for  deriving  the  direction  of  translation  and  the  rota¬ 
tion  axis  is  of  a  robust  and  global  character  and  thus 
can  handle  a  considerable  amount  of  error  in  the  in¬ 
put.  After  having  found  the  axis  of  rotation  and  the 
direction  of  translation  further  constraints  are  con¬ 
sidered,  and  the  complete  set  of  motion  parameters 
is  obtained. 


The  number  of  motion  parameters  a  monocular 
observer  is  able  to  compute  under  perspective  pro¬ 
jection  is  limited  to  five:  the  three  rotational  param¬ 
eters  and  the  direction  of  translation.  We  therefore 
introduce  coordinates  for  the  direction  of  trauislation, 
(*o,yo)  =  {U f /W,V f  jW)  and  rewrite  the  righthand 
side  of  equation  (1)  as  a  sum  of  translational  and  ro¬ 
tational  components: 

“  ==  “trans  +  “rot  =  (2) 

=  (-zo  +  x/)^-f  ay-/?(y -(-/)-t-7y 
“  =  “trans  "b  “rot  =  (3) 

=  (-yo  +  y/)^+a(y +  /)-/?^-744) 


Methods  for  estimating  3-D  motion  from  only  the 
normal  flow  field  without  going  through  the  inter¬ 
mediate  stage  of  computing  optical  flow  have  previ¬ 
ously  appeared  in  [3,  10,  16].  In  [3]  the  case  of  purely 

’a  problem  is  ill-posed  if  its  solution  does  not  exist, 
is  not  unique,  or  does  not  continuously  depend  on  the 
input. 


Since  we  can  only  compute  normal  flow,  the  pro¬ 
jection  of  flow  on  the  gradient  direction  (nx,Hy)  (unit 
vector),  only  constraint  can  be  derived  at  every 
point.  Tli*  \  slue  u„  of  the  normal  flow  vector  along 
the  gradicT:'  ■;  action  is  given  by 

Ur,  =  UMi  -1-  vriy  ,  or 
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Xfi 

tin  =  ((-*0  +  ar/)  Y  +  "  ^(y  +  /)  +  7y)n, 

+  ((-yo  +  y/)  Y  +  o(y  4-  /)  -  -  T*)ny  (5) 

The  above  equation  demonstrates  the  difficulties 
of  motion  computation  using  normal  flow.  A  monoc¬ 
ular  observer  not  being  able  to  measure  depth  is  con¬ 
fronted  with  a  motion  fleld  of  flve  unknown  motion 
parameters  and  one  scaled  depth  component  {WIZ) 
at  every  point.  Since  there  is  only  one  constraint  for 
a  single  point  and  since  we  do  not  want  to  make  as¬ 
sumptions  about  depth,  there  is  no  straightforward 
way  to  compute  the  motion  parameters  analytically. 

2.1.  Motion  field  interpretation 

A  motion  field  is  composed  of  a  translational  and 
a  rotational  component.  Only  the  first  of  these  is 
dependent  on  the  distance  from  the  observer.  This 
suggests  the  idea  of  searching  for  a  way  of  deter¬ 
mining  the  motion  components  by  disregarding  the 
depth  components.  The  motion  under  consideration 
is  rigid.  Every  point  in  3-D  moves  relative  to  the 
observer  along  a  constrained  trajectory.  The  rigid¬ 
ity  constraint  also  imposes  restrictions  on  the  motion 
field  in  the  image  plane  and  these  restrictions  are 
reflected  in  the  normal  field  as  well.  This  is  the  mo¬ 
tivation  for  investigating  the  geometrical  properties 
inherent  in  the  normal  flow  field.  The  motion  estima¬ 
tion  problem  then  amounts  to  resolving  the  normal 
flow  field  into  its  rotational  and  translational  compo¬ 
nents. 


Figure  1;  Translational  motion  viewed  under  per¬ 
spective  projection:  the  observer  is  approaching  the 
scene. 

If  the  observer  undergoes  only  translational  mo¬ 
tion,  all  points  in  the  3-D  scene  move  along  parallel 
lines.  lYansIational  motion  viewed  under  perspective 
results  in  a  motion  field  in  the  image  plane,  in  which 
every  point  moves  along  a  line  that  passes  through 
a  single  vanishing  point.  This  point  is  the  intersec¬ 
tion  of  the  image  plane  and  the  line  parallel  to  the 
translation  direction  that  passes  through  the  nodal 


Figure  2:  The  intersection  of  the  image  plane  with 
a  cone  (defined  by  the  circular  path  in  3-D  and  the 
rotation  axis)  defines  the  projection  of  rotational  mo¬ 
tion  on  the  image  plane. 

point.  Its  image  coordinates  are  x  =  UfjW  and 
y  =  V flW\  the  flow  there  has  value  zero. 

In  cases  where  the  sensor  is  approaching  the 
scene  ail  the  image  motion  vectors  emanate  from  the 
vanishing  point,  which  is  then  called  Focus  of  Expan¬ 
sion  (FOE)  (Figure  1).  Otherwise  they  all  point  into 
it,  in  which  case  we  speak  of  the  Focus  of  Contraction 
(FOC).  The  direction  of  every  vector  is  determined 
by  the  location  of  the  vanishing  point,  and  the  length 
of  each  vector  is  dependent  on  the  3-D  position  of  the 
corresponding  scene  point.  The  vanishing  point  also 
constrains  the  direction  of  the  normal  flow  vector  at 
every  point;  it  can  only  be  in  the  half  plane  contain¬ 
ing  the  optical  flow  vector  at  that  point. 

In  cases  of  purely  rotational  motion  every  point 
in  3-D  moves  along  a  circle  in  a  plane  perpendicu¬ 
lar  to  the  axis  of  rotation.  The  perspective  image 
of  this  circul^tr  path  is  the  intersection  of  the  image 
plane  with  the  cone  which  the  circle  defines  together 
with  the  rotation  axis  (see  Figure  2).  Depending  on 
the  relation  between  the  opening  angle  of  the  cone 
for  a  specific  image  point  and  the  angle  the  image 
plane  forms  with  the  rotation  axis,  the  field  lines  of 
the  rotational  vector  field  (i.e  the  lines  which  have  the 
property  that  at  each  point  the  rotational  flow  is  tan¬ 
gential  to  them)  form  second  order  curves  of  different 
types:  ellipses,  hyperbolas,  parabolas,  or  even  circles 
when  the  rotation  axis  zuid  the  optical  axis  coincide. 
The  conic  sections  generated  by  a  rotational  motion 
are  defined  by  the  axis  of  rotation.  A  rotation  axis, 
given  by  the  two  parameters  (^)  and  (^),  defines  a 

family  M(j,  ^:*,y)  of  conic  sections: 

M(a,  l.r,y)  =  ((S)V  +  2xyaf  -h  y*(f  )* 

+2x/f -b2y/f -f /*)/(*2-»-y*-|-/2)  =  C  (6) 

withCin[0...(l-f(f)»  +  (f)2)] 
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3.  Properties  of  selected  vectors 

In  this  section  geometrical  properties  of  normal  flow 
vectors  in  selected  directions  are  investigated.  To  be 
more  precise,  we  study  the  sign  of  the  normal  flow 
in  certain  directions  and  the  locations  of  normal  flow 
vectors  of  the  same  sign.  Vectors  which  are  perpen¬ 
dicular  to  rotational  vector  field  lines  and  vectors  per¬ 
pendicular  to  lines  emanating  from  a  point  are  con¬ 
sidered.  For  these  vectors  we  find  that  the  normal 
flow  values  in  the  image  are  separated  into  regions 
in  which  they  have  different  signs  by  a  second  order 
curve  and  a  straight  line. 

The  normal  flow  vector  tl^  is  the  projection  of 
the  optical  flow  vector  u  on  the  gradient  direction 
and  the  value  of  the  normal  flow  is  therefore  defined 
by  the  scalar  product  of  the  optical  flow  vector  and 
the  unit  vector  (n,,  ny)  in  the  gradient  direction.  The 
flow  vector  can  be  decomposed  into  its  translational 
and  rotational  components  and  the  right  hand  side 
of  equation  (5)  can  be  written  as  a  sum  of  scalar 
products: 

«n  =  •T((-®o  +  */),(-yo  +  y/))(n*,ny)-|- 

0^  -  yx)(nx,Tiy) 

(7) 

Our  goal  is  to  achieve  a  separation  between  trans¬ 
lation  and  rotation.  Therefore  we  classify  the  normal 
flow  vectors  according  to  their  direction  by  defining 
two  classes  which  are  motivated  by  the  concepts  of 
the  rotation  axis  and  the  FOE. 

Any  possible  axis  given  by  an  orientation  vec¬ 
tor  (A,  B,  C),  where  A’  -H  -f-  C*  =  1,  defines  an 
infinite  class  of  cones  with  axis  {A,B,C)  and  apex 
at  the  origin.  The  image  plane  gives  rise  to  a  set 
of  conic  sections,  hereafter  called  vector  field  lines, 
or  field  lines  of  the  axis  {A,B,C),  or  just  {A,B,C) 
field  lines.  It  is  worth  noting  that  the  {A,B,C)  field 
lines  are  the  lines  along  which  the  image  points  would 
move  if  the  observer  rotated  around  axis  (A,  5,0). 
Normal  flow  vectors  are  combined  into  a  single  class 
if  they  are  perpendicular  to  the  vector  field  lines  of 
the  same  axis  (A,  B,  C).  At  a  point  (x,  y)  the  orienta¬ 
tion  perpendicular  to  the  vector  field  lines  (A,B,0) 
is  given  by  a  vector  N  =  (Af*,  Ny): 

(Nr,Ny)  =  ((-A(y2  +  /2)  +  Bxy  +  Ox/), 

{Axy-B{x^  +  f)  +  Cyf)) 

and  the  unit  vector  n  =  {ng,ny)  denoting  the  gra¬ 
dient  is  thus  n  =  jj^.  We  call  the  vectors  of  the 

class  corresponding  to  the  axis  (A,B,0)  the  coaxis 
vectors  (A,B,C).  In  order  to  establish  conventions 
about  the  vectors’  orientations,  a  vector  will  be  said 


Figure  3:  Field  lines  corresponding  to  an  axis 
(A,B,C)  and  positive  coaxis  vectors  (A,  B,C). 


to  be  of  positive  orientation  if  it  is  pointing  in  di¬ 
rection  n  =  (nx,ny),  whereas,  if  it  is  pointing  in  di¬ 
rection  (— «*,— Uy),  its  orientation  will  be  said  to  be 
negative  (see  Figure  3). 

Next  we  evaluate  the  translational  components 
of  the  normal  flow  vectors  in  the  chosen  direction. 
The  value  tn  of  any  translational  vector  component 
at  point  (x,y)  in  direction  (nx,ny)  is  given  by 

W 

tn  =  ((a^-*o,y-yo)-^)n 

Since  ^  is  positive  when  the  observer  is  ap¬ 
proaching  the  scene,  a  classification  into  positive  and 
negative  values  independent  of  the  distance  from  the 
image  plane  is  possible.  The  translational  compo¬ 
nents  of  the  coaxis  vectors  (A,B,C)  are  separated 
by  a  second  order  curve  fi(A,  B,C,xo,yo;®,y)  given 
by 

A(A,B,C,Xo,yo,;ar,y)  = 
x^iCf  +  Byo)  -I-  y*(C/  +  Axo)  -  xy(Ayo  +  Bxq) 
-x/(A/  -I-  Cxo)  -  y/(B/  -h  Cyo)  +  /^(Axo  +  Byo) 
=  0. 

(8) 

When  h{x,y)  >  0  the  translational  normal  flow 
values  are  positive;  when  h{x,y)  <  0  they  are  neg¬ 
ative;  and  when  h{x,y)  =  0  they  have  value  zero. 
For  any  selected  class  of  coaxis  vectors  there  exists  a 
curve  h  which  is  uniquely  defined  by  the  two  coordi¬ 
nates  xo,  yo  of  the  FOE;  furthermore  it  is  linear  in 
xo  and  yo  (see  Figure  4a). 

The  rotational  components  of  the  flow  vectors 
are  defined  only  by  the  three  rotational  parameters 
a,  0  and  7.  Along  the  positive  direction  of  the  coaxis 
vectors  the  value  r„  of  the  rotational  components  is 

rn  =  ((a^-^t^+/)  +  7y). 

(“(•jr  +f)-0^-  yx){nr,  rty) 

The  coaxis  vectors  (A,  B,  C)  and  the  rotational  flow 
vectors  form  a  right  angle  for  all  points  on  a  straight 
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line.  Thus,  considering  only  the  sign  of  the  rota¬ 
tional  component  along  the  coaxis  vectors  {A,B,C), 
the  image  plane  is  separated  by  a  straight  line 
into  two  halves  containing  values 
of  opposite  sign,  where 

g{A,  B,  C,  a,  0, 7,  x,  y)  =  y(aC  -  -/A)  -  x{0C  -  'fB)+ 
0Af  -  aBf  =  0 

(9) 

Again  the  sign  of  g{x,y)  at  a  point  (z,y)  determines 
the  sign  of  the  coaxis  vectors  (A,  B,C).  The  straight 
line  is  defined  by  only  two  parameters  which  char¬ 
acterize  the  axis  of  rotation,  namely  ^  and  ^  (see 
Figure  4b). 

In  order  to  investigate  the  constraints  for  general 
motion  the  geometrical  relations  due  to  rotation  and 
due  to  translation  have  to  be  combined.  A  second 
order  curve  separating  the  plane  into  positive  and 
negative  values  and  a  line  separating  the  plane  into 
two  halfplanes  of  opposite  sign  intersect.  This  splits 
the  plane  into  areas  of  only  positive  coaxis  vectors, 
areas  of  only  negative  vectors,  and  areas  in  which 
the  rotational  and  translational  flow  have  opposite 
signs.  In  these  last  areas,  no  information  is  derivable 
without  making  depth  assumptions  (Figure  4c). 

We  thus  obtain  the  following  geometrical  result 
for  the  case  of  general  motion.  Any  class  of  coaxis 
vectors  (A,B,C)  is  separated  by  a  rigid  motion  into 
two  groups.  The  FOE  (zo,j/o)  and  the  rotation  axis 
(f)f)  geometrically  define  two  areas  in  the  plane, 
one  containing  positive  and  one  containing  negative 
values.  We  call  this  structure  on  the  coaxis  vectors 
the  coaxis  pattern.  It  depends  on  the  four  parameters 
*0,  yo,  f  and 

For  the  second  kind  of  classification  of  the  normal 
flow  vectors,  namely  the  one  defined  as  “perpendicu¬ 
lar  to  the  lines  emanating  from  a  defined  point”  (see 
Figure  5),  similar  patterns  are  obtained.  In  this  case, 
the  rotational  components  are  separated  by  a  second 
order  curve  into  positive  and  negative  values  and  the 
translational  components  are  separated  by  a  straight 
line.  We  call  the  vectors  perpendicular  to  straight 
lines  passing  through  a  point  (r,  s)  the  copoint  vec¬ 
tors  (r,  s).  * 

At  point  (x,  y)  a  copoint  vector  n  of  unit  length 
in  the  positive  direction  is  defined  as 

(-y  +  a,x-r) 

Vix  -  r)2  -I-  (y  -  s)2 

The  functions  which  define  the  curves  are  given  as  fol¬ 
lows:  The  straight  line  k{r,8,xo,yo,x,y)  separating 

^The  copoint  and  coaxis  vectors  are  dual  to  each  other. 


(b) 


Figure  4:  (a)  The  coaxis  vectors  {A,B,C)  due  to 
translation  are  negative  if  they  lie  within  a  second 
order  curve  defined  by  the  FOE,  and  are  positive  at 
all  other  locations,  (b)  The  coaxis  vectors  due  to 
rotation  separate  the  image  plane  into  a  halfplane 
of  positive  values  and  a  halfplane  of  negative  values, 
(c)  A  general  rigid  motion  defines  an  area  of  positive 
coaxis  vectors  and  an  area  of  negative  coaxis  vectors. 
The  rest  of  the  image  plane  is  not  considered. 


Figure  5;  Positive  copoint  vectors  (r,  s). 
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the  translational  components  is 

*(r,s,xo,i/o;x,y)  =  y(xo  -  r)  -  x(yo  -  «)  -  xo* 
+yor  =  0 

(10) 

and  the  second  order  curve  l(r,s,a,0,y;x,y) 
separating  the  rotational  components  is  (like 
h{A,  B,  C,  xo,  yo,  X,  y))  defined  as 

/(r,  s,a,j3,j,x,y)  =  -x^{^s  +  7/)  -  y^(ar  +  7/)+ 
+xy(as  +  0r)  +  x/(a/  +  7r) 

+yf  ll3f  +  7«)  -  Piar  -/?«)  =  0 

(11) 

The  superposition  of  translational  and  rotational 
values  again  defines  patterns  in  the  plane  which  con¬ 
sist  of  a  negative  and  a  positive  area.  These  patterns, 
called  copoint  patterns,  are  defined  by  the  same  four 
parameters  which  characterize  the  coaxis  patterns. 

4.  Search  for  motion  patterns 

Utilizing  the  geometrical  constraints  developed  in  the 
last  section,  motion  estimation  for  a  rigid  moving  ob¬ 
server  will  now  be  addressed  through  a  search  tech¬ 
nique.  The  strategy  involves  checking  constraints 
that  a  certain  solution  would  impose  on  the  normal 
flow  field  and  in  this  way  discarding  impossible  solu¬ 
tions.  The  search  is  performed  in  three  steps,  where 
at  each  step  the  constraints  become  more  restrictive; 
hence  the  number  of  possible  solutions  computed  at 
each  step  decreases.  First  a  set  5i  of  possible  solu¬ 
tions  for  the  FOE  and  axis  of  rotation  is  estimated 
by  fitting  a  small  number  of  patterns  to  the  nor¬ 
mal  flow  field.  Two  techniques,  which  use  different 
patterns  defined  on  certain  coaxis  vectors,  are  pro¬ 
posed  for  solving  this  task.  Both  fitting  processes 
use  the  input  in  a  qualitative  way,  since  only  the  sign 
of  the  normal  flow  is  employed.  In  the  second  step 
the  third  rotational  parameter  is  computed,  and  the 
space  of  solutions  is  further  narrowed  to  a  set  53. 
This  is  performed  by  using  normal  flow  vectors  that 
do  not  contain  translation  (certain  copoint  vectors) 
and  thus  approximating  the  remaining  rotational  pa¬ 
rameter  from  the  given  rotational  vectors.  Finally, 
in  the  last  step  all  impossible  solutions  are  discarded 
by  checking  the  validity  of  the  motion  parameters  at 
every  point,  which  results  in  a  set  S3  as  output. 

4.1.  First  step:  Pattern  fitting 

The  direction  of  translation  and  the  axis  of  rotation 
define  patterns  on  subsets  of  the  normal  flow  vectors. 
In  the  general  case  these  patterns  are  described  by 
four  independent  variables  and  searching  for  the  solu¬ 
tion  would  mean  searching  in  a  four-dimensional  pa¬ 
rameter  space.  By  concentrating,  in  an  initial  search, 
only  on  a  small  number  of  normal  flow  vectors,  we 


show  how  to  tackle  the  problem.  Clearly,  such  a  re¬ 
stricted  use  of  of  data  will  generally  not  result  in  a 
unique  solution,  but  it  allows  us  to  either  reduce  the 
dimensionality  of  the  problem  (algorithm  1)  or  to  em¬ 
ploy  motion  vectors  from  all  parts  of  the  image  plane 
(algorithm  2). 

Algorithm  1.1  :  a-,  0-,  and  7-pattern  fitting 

One  way  to  look  at  the  optical  flow  vector  is  to 
imagine  it  as  a  sum  of  five  vectors,  each  being  due 
to  only  one  of  the  motion  parameters  (either  one  of 
the  two  translational  or  one  of  the  three  rotationed 
components).  Consequently  the  value  of  the  normal 
flow  vector  at  a  point  is  computed  as  the  sum  of  the 
five  scalar  products  of  these  vectors  and  the  unit  vec¬ 
tor  in  the  gradient  direction.  The  scalar  product  of 
two  vectors  is  zero  if  the  vectors  are  perpendicular  to 
each  other.  Thus,  by  selecting  normal  flow  vectors 
in  particular  directions,  one  or  more  of  the  motion 
components  vanish. 

The  coaxis  vectors  which  are  dependent  on  only 
two  of  the  three  rotational  parameters  correspond  to 
one  of  the  three  coordinate  axes.  These  normal  vec¬ 
tors  and  their  patterns  have  special  properties. 

The  coaxis  vectors  {A,B,C)  when  the  orienta¬ 
tion  vector  {A,ByC)  is  the  Z  axis  are  perpendicu¬ 
lar  to  circles  whose  center  is  the  origin  of  the  image 
plane,  and  we  call  them  7-vector8.  Similarly,  when 
(A,B,C)  is  the  X  01  Y  axis,  the  {A,B,C)  coaxis 
vectors  are  called  a-vectors  and  /^-vectors  and  the 
corresponding  field  lines  are  hyperbolas  whose  major 
axes  are  the  image  plane’s  x-  and  y-axes,  respectively. 
Figure  6  depicts  these  sets  of  vector  field  lines  and  the 
corresponding  7-,  o-  and  /3-vectors  in  positive  orien¬ 
tation. 

The  values  of  the  a-,  0,  and  7-  vectors  due  to 
rotation  only  can  be  described  by  a  one-parameter 
function.  Thus  the  dimensionality  of  the  correspond¬ 
ing  patterns  is  also  reduced  by  one  and  the  search  for 
these  patterns  can  be  limited  to  a  three-dimensional 
parameter-space.  This  becomes  clear  by  substituting 
into  equation  (9)  for  the  triple  (A,  B,  C)  the  orienta¬ 
tion  vectors  of  the  coordinate  axes  ((1,0,0),(0, 1,0) 
and  (0,0,1)).  The  rotational  components  of  the  7- 
vectors  are  separated  by  a  line  passing  through  the 
center,  which  has  equation  y  =  ^x.  For  the  rota¬ 
tional  components  of  the  a-vectors  the  line  is  parallel 
to  the  x-axis  and  is  defined  by  the  equation  x  =  ^/. 
The  /?-vectors  are  separated  by  a  line  parallel  to  the 
y-axis  defined  as  x  =  ~f. 

The  second  order  curves  separating  the  transla¬ 
tional  components  of  the  a-,  0-,  and  7-vectors  are  ob¬ 
tained  from  equation  (8).  For  the  7-vectors  the  curve 
reduces  to  a  circle,  which  has  the  FOE  and  the  image 
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Fi^re  6:  If  the  (A,B,C)  axis  is  the  Z-,  X-,  or  Y- 
axis,  the  corresponding  vector  field  lines  are  circles 
with  center  O  (a),  or  hyperbolas  whose  axes  coincide 
with  the  coordinate  axes  of  the  image  plane  (b  and 
c).  Normal  flow  vectors  perpendicular  to  these  field 
lines  are  called  7-,  a-,  and  /?- vectors. 

center  as  two  diametrically  opposite  points.  Equation 
(8)  reduces  to 

h(0,0,  l,aro,yo,;a!,y)  = 

/(X  -  +  fiy  -  1^)2  -  =  0 

The  curves  separating  the  a-  and  /3-vectors  be¬ 
come  hyperbolas  of  the  form 

A(l,0,0,xo,yo,:*,y)  =  y^xo-xyyo-xf^  +  f^xo  =  0 

and 

h(0, 1,0,  xo,  yo, ;  X,  y)  =  x*yo  -  xyxo  -  y/^  -h  /*yo  =  0 

Figure  7  shows  the  a-,  /?-,  and  7- vectors  for  a 
general  rigid  motion. 

The  algorithm  which  computes  the  FOE  and  the 
axis  of  rotation  from  a  given  normal  flow  field  by  us¬ 
ing  only  the  a-,  /?-  and  7- vectors  works  as  follows. 
With  each  subset  of  normal  flow  vectors  is  associated 
a  three-dimensional  parameter  space  that  spans  the 
possible  locations  of  the  FOE  and  of  a  line  defined 
by  the  quotient  of  two  of  the  three  rotational  param¬ 
eters.  A  search  in  the  three-dimensional  subspaces 
is  accomplished  by  checking  the  patterns  which  the 
subspaces’  parameter  triples  define  for  selected  values 
of  the  normal  flow  field.  The  a-patterns  are  fitted  to 
the  a-vectors,  which  provides  possible  solutions  for 
the  coordinates  of  the  FOE:  xo,  yo,  and  the  quotient 
^ .  Similarly,  the  fitting  of  the  /?-  or  7-patterns  results 
in  solutions  for  xq,  yo,  and  ^  or  The  objective  is  to 
find  the  four  parameters  deflning  the  directions  of  the 
translational  and  rotational  axes  which  give  rise  to 
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Figure  7:  ot-,  /?-,  and  7-  patterns  for  a  general  rigid 
motion. 

three  successfully  fitted  patterns.  Therefore  the  three 
subspaces’  patterns  are  combined  and  the  parameter 
quadruples  which  define  possible  solutions  are  deter¬ 
mined.  Since  only  subsets  of  the  normal  flow  values 
are  considered  in  the  fitting  process,  the  fitting  alone 
does  not  uniquely  define  the  motion,  but  just  con¬ 
stitutes  a  necessary  condition.  Usually  there  will  be 
a  number  of  parameter  quadruples  {xq,  yo,oi/7,Ph} 
that  are  selected  as  candidate  solutions  through  pat¬ 
tern  fitting. 

The  range  of  values  for  the  coordinates  of  the 
FOE  and  for  ^  and  ^  is  [— 00, -foo].  To  cope  with 
all  possible  cases  a  coordinate  transformation  on  the 
sphere  is  performed,  in  which  case  the  coordinates 
are  expressed  by  two  angles. 

Algorithm  1.2:  Search  for  the  rotation  axis 

For  any  rigid  motion  there  exists  one  class  of 
coaxis  vectors  which  does  not  contain  any  rotational 
components.  This  set  is  defined  by  the  actual  ro¬ 
tation  axis  ^  s  and  ^  The  coaxis  vec¬ 

tors  of  this  kind  are  due  only  to  translation  and  the 
pattern  of  these  vectors  is  solely  defined  by  the  two- 
parameter  second  order  curve  h(a, /9, 7,xo,yo;x,y). 
There  is  only  one  curve  separating  the  positive  from 
the  negative  values  and  thus  the  pattern  is  defined  on 
the  whole  image  plane.  Since  h(o,/3,7,xo,yo;x,y)  is 
linear  in  xq  and  yo  the  problem  of  finding  the  FOE 
from  the  normal  vectors  due  to  rotation  reduces  to  es¬ 
timating  the  linear  discriminant  function  separating 
two  classes  (labeled  positive  and  negative)  of  values. 

The  pattern  is  due  only  to  two  parameters.  In 
order  to  find  the  axis  of  rotation  a  search  in  the 
two-dimensional  parameter  space  of  ^  and  ^  is  per¬ 
formed.  For  every  possible  rotation  axis  the  data  is 
checked  for  linear  discrimination.  If  a  second  order 
curve  can  be  found  that  separates  the  positive  from 
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Figure  8:  Normal  flow  vectors  perpendicular  to  lines 
passing  through  the  FOE  are  due  only  to  rotation. 

the  negative  values  the  quadruple  (zo.  yo>  f )  will 

be  added  to  the  set  of  possible  solutions. 


an  equation  of  the  form 

7  =  «n(f  (^n,  +  +  /)ny) 

-f  ((y  +  /)n*  +  J-riy)  +  (yn,  -  zny)) 

(12) 

Since  the  chosen  normal  flow  vectors  are  due  only 
to  rotation,  the  solution  to  the  overdetermined  sys¬ 
tem  gives  the  7  value.  In  a  practical  application  a 
threshold  has  to  be  chosen  to  discriminate  between 
possible  and  impossible  solutions.  The  value  of  the 
residual  is  used  to  confirm  the  presumption  that  the 
selected  normal  flow  vadues  are  purely  rotational. 
Usually  “detranslation”  will  not  result  in  only  one 
solution,  but  will  provide  a  set  of  possible  param¬ 
eter  quintuples. 


Concerning  the  computational  aspect  of  solving 
the  discrimination  problem,  different  algorithms  from 
the  pattern  recognition  literature  can  be  applied.  For 
example,  the  Ho-Kashyap  algorithm  decides  whether 
a  data  set  is  linearly  discriminable  and  will  also  find 
the  best  discrimination. 


4.2.  Second  step:  Detranslation 

Proper  selection  of  normal  flow  vectors  also  enables 
the  elimination  of  the  normal  flow’s  translational 
components.  This  can  be  achieved  by  choosing  as 
normal  flow  vectors  the  copoint  vectors  defined  by 
the  locus  of  the  FOE.  With  the  location  of  the  FOE 
the  directions  of  the  translational  motion  components 
are  defined.  The  optical  flow  vectors  lie  on  lines  pass¬ 
ing  through  the  FOE.  The  normal  flow  vectors  per¬ 
pendicular  to  these  lines  (the  copoint  vectors  (r,s), 
where  r  =  xo  and  s  =  po))  Ao  not  contain  transla¬ 
tional,  but  only  rotational  components.  This  can  be 
seen  from  equation  (5).  If  the  selected  gradient  direc¬ 
tion  at  a  point  (x,y)  is  ((y<)-y),(-io-l-x))  the  scalar 
product  of  the  translational  motion  component  and  a 
vector  in  the  gradient  direction  is  zero.  This  method 
of  eliminating  the  translational  component,  referred 
to  below  as  “detranslation” ,  is  applied  to  compute 
the  third  rotational  component  and  to  further  reduce 
the  possible  number  of  solutions. 

For  each  of  the  possible  solutions  computed  in 
the  second  module,  the  normal  flow  vectors  perpen¬ 
dicular  to  the  lines  passing  through  the  FOE  have 
to  be  tested  to  see  if  they  are  only  due  to  rotation 
(see  Figure  8).  This  results  in  solving  an  overdeter¬ 
mined  system  of  linear  equations.  Since  two  of  the 
rotational  parameters  are  already  computed,  there  is 
only  one  unknown,  the  value  7.  Every  point  supplies 


4.3.  Third  step:  Derotation 

The  nnodules  described  so  far  considered  only  sub¬ 
sets  of  the  normal  flow  vectors.  Clearly,  after  having 
found  possible  solutions  for  the  FOE  and  the  axis 
of  rotation,  we  can  test  every  candidate  solution  for 
its  correctness  on  any  class  of  coaxis  vectors.  Since 
the  quadruple  (xo,yk),  y,  defines  a  pattern  on  ev¬ 
ery  class  of  coaxis  vectors,  we  just  have  to  test  for 
the  existence  of  this  pattern.  However,  a  pattern  in 
the  general  case  is  defined  only  on  parts  of  the  image 
plane.  Thus  even  by  testing  every  possible  class  of 
coaxis  vectors  not  every  normal  flow  vector  will  be 
tested. 

In  order  to  eliminate  all  motion  parruneters  which 
are  in  contradiction  to  the  given  normal  flow  field,  ev¬ 
ery  normal  flow  vector  has  to  be  checked.  This  check 
is  performed  by  a  “derotation”  technique.  With  ev¬ 
ery  parameter  quintuple  computed  in  the  second  step 
a  possible  FOE  and  a  rotation  are  defined.  The  three 
rotational  parameters  are  used  to  derotate  the  nor¬ 
mal  flow  vectors  by  subtracting  the  rotational  com¬ 
ponent  (Ufotiffot)  At  every  point  the  flow  vector 
(“der**'der)  “  computed: 

“der  ~  “"U*  ~  “rot”* 

Vder  =  UnH,  —  (13) 

If  the  parameter  quintuple  defines  the  correct  so¬ 
lution,  the  remaining  normal  flow  is  purely  transla¬ 
tional.  Thus,  it  has  to  have  the  property  of  an  em¬ 
anating  motion  field.  Since  the  direction  of  optical 
flow  for  a  given  FOE  is  known,  the  possible  direc¬ 
tions  of  the  normal  flow  vectors  can  be  determined. 
The  normal  flow  vector  at  every  point  must  lie  in  a 
half  plane  (see  Figure  9).  The  technique  checks  all 
points  for  this  property  and  eliminates  solutions  that 
cannot  give  rise  to  the  given  normal  flow  field. 
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Figure  9:  Normal  flow  vectors  due  to  translation  are 
constrained  to  lie  in  halfplanes. 


4.4.  The  complete  technique 

In  this  section  we  give  a  summary  of  the  complete 
technique  in  the  form  of  a  block  diagram.  The  com¬ 
putation  of  an  observer’s  egomotion  is  performed  in 
three  steps,  where  for  the  first  step  two  alternative 
modules  can  be  chosen.  The  sets  of  candidate  solu¬ 
tions  which  are  determined  in  the  four  modules  are 
called  Si,  S2,  S3.  To  denote  single  solutions  or  single 
pzuameters,  subscripts  are  used:  5i,t,  52, t,  zo,t,  yo,<« 
etc.  The  input  to  the  algorithm  is  a  normal  flow  fleld 
and  the  outputs  are  all  possible  solutions  for  the  di¬ 
rection  of  translation  and  the  rotation  which  can  give 
rise  to  this  normal  flow  fleld. 

The  algorithm  determines  the  complete  set  of  so¬ 
lutions.  If  for  a  given  normal  flow  field  the  algorithm 
finds  more  than  one  solution,  then  from  the  normal 
flow  field  alone  the  3-D  motion  cannot  be  determined 
uniquely.  In  this  case  one  may  use  matching  of  promi¬ 
nent  features  to  eliminate  the  incorrect  motion  pa¬ 
rameters. 

5.  Experiments 

Several  experiments  have  been  performed  on  syn¬ 
thetic  and  real  data.  For  different  3-D  motion  pa¬ 
rameters  normal  flow  fields  were  generated,  where  the 
depth  value  within  an  interval  and  the  gradient  di¬ 
rection  were  chosen  randomly.  In  all  experiments  on 
noiseless  data  the  correct  solution  was  found  as  the 
best  one.  Figure  10  shows  the  optical  flow  field  and 
the  normal  flow  field  for  one  of  the  generated  data 
sets.  The  image  size  is  100  x  100,  the  focal  length  is 
150,  the  image  coordinates  of  the  FOE  are  (—5,  -1-30) 
and  the  relationship  between  the  rotational  compo¬ 
nents  is  a  :  /?  :  7  =  10  :  11  :  150.  In  Figure  11  the 
fitting  of  the  circle  and  the  hyperbolas  to  the  a-,  0-, 
and  7-vectors  and  the  coaxis  pattern(^,  is  dis¬ 
played.  Points  with  positive  normal  flow  values  are 
displayed  in  a  light  gray  level  and  points  with  nega¬ 
tive  values  are  dsuk.  In  the  two  modules  of  the  first 
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Figure  10:  Flow  fields  of  synthetic  data. 


step  no  quantitative  use  of  values  is  made,  since  only 
the  sign  of  the  normal  flow  is  considered.  This  lim¬ 
ited  use  of  data  makes  the  module  very  robust,  and 
the  correct  solution  for  the  axes  of  translation  and 
rotation  will  be  found  even  in  the  presence  of  high 
amounts  of  noise  (up  to  100%). 

The  NASA-Ames  sequence^  was  used  as  a  real 
data  set.  In  this  sequence  the  camera  undergoes  only 
translational  motion;  we  added  different  amounts  of 
rotation.  For  all  points  at  which  the  translational 
motion  can  be  found,  the  rotational  normal  flow  is 
computed  and  the  new  position  of  each  pixel  is  eval¬ 
uated.  The  “rotated”  image  is  then  generated  by 
computing  the  new  greyvalues  through  bilinear  inter¬ 
polation.  The  images  were  convolved  with  a  Gaussian 
of  kernel  size  5x5  and  standard  deviation  <t  =  1.4. 
The  normal  flow  was  computed  by  using  3x3  large 
Sobel  operators  to  estimate  the  spatial  derivatives  in 
the  X-  and  y-directions  and  by  subtracting  the  3x3 
box-filtered  values  of  consecutive  images  to  estimate 
the  temporal  derivatives. 

When  adding  rotational  normal  flow  of  magni¬ 
tude  on  the  order  of  a  third  to  three  times  the  amount 
of  translational  flow,  the  exact  solution  was  always 
found  among  the  best  fitted  parameter  sets.  In  Fig¬ 
ure  12  the  computed  normal  flow  vectors  and  the  fit¬ 
ting  of  the  a-,  /?-,  and  7- vectors  for  one  of  the  “ro¬ 
tated”  images  is  shown.  Areas  of  negative  normal 
flow  vectors  are  marked  by  horizontal  lines  and  ar¬ 
eas  of  positive  values  by  vertical  lines.  The  ground 

^This  is  a  calibrated  motion  sequence  made  public  for 
the  Workshop  on  Visual  Motion,  1991. 
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Normal  flow  field 


Detranslation: 

For  every  53, •  select  the  copoint  vectors  defined  by  zo,<,  po,i 
Check  if  the  system  of  linear  equations  is  consistent  with  rotation 
and  compute  the  third  rotational  component. 


- 1 - 

S3  (set  of  quintuples  {  zo,  90,  or,  5.  7  }) 


Complete  derotation: 

53  =  {} 

repeat  until  St  is  empty 

For  every  Ss.i  derotate  by  Bi,  Ci\. 

If  all  derotated  normal  flow  vectors  lie  within  the  allowed  halfplane 
defined  by  {  zo,t.  yo.i  }  keep  the  quintuple  as  a  solution 
Sj  =  Ss  U  S3,i 
53  =  S3  —  S3, 1 

1 

S3  (set  of  qnintuple(s)  (  Zq,  Vd,  a,  S,  7  )) 
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(a) 


(d) 


Figure  11:  (a),(b),(c);  Positive  and  negative  a-, 
and  7-vector8  of  synthetic  data.  (d),(e),(f):  Fitting 
of  0-,  and  y-patterns.  (g);  Separation  of  coaxis 
pattern  (*,  ^). 


Or 


7 

Figure  12:  Natural  scene:  Normal  flow  field  and  fit¬ 
ting  of  a-,  /?-,  and  y-vectors. 
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truth  for  the  FOE  is  (-5,  -8),  the  focal  length  is  599 
pixels  and  the  rotation  between  two  image  frames  is 
a  =  0.0006,  0  =  0.0006  and  7  =  0.004.  The  algo¬ 
rithm  computed  the  solution  exactly. 

6.  Conclusions 

We  have  presented  geometric  constrsunts  on  normal 
flow  fields  due  to  rigid  motion,  which  take  the  form  of 
patterns  in  the  image  plane.  These  constraints  were 
exploited  to  solve  the  problem  of  computing  the  3-D 
motion  of  an  observer  relative  to  the  scene  in  a  ro¬ 
bust  way.  We  claim  robustness,  because  for  the  esti¬ 
mation  of  the  translational  and  rotational  axes  only 
the  signs  of  the  normal  flow  vectors  are  used;  and 
the  technique  is  global,  because  values  in  all  parts  of 
the  image  are  considered.  The  algorithmic  procedure 
constitutes  a  search  technique  in  a  parameter  space, 
where  appropriate  selection  of  normal  flow  values  is 
used  in  different  ways  to  reduce  the  dimensionality  of 
the  motion  estimation  problem.  In  order  to  compute 
the  axes  of  translation  and  rotation  two  different  so¬ 
lutions  were  proposed.  Either  three  different  subsets 
of  the  vector  field  are  searched  for  patterns  defined 
by  only  three  of  the  five  parameters,  or  a  search  for 
the  pattern,  defined  on  the  whole  image  plane,  which 
characterizes  the  axis  of  rotation  is  performed.  By 
selecting  values  which  are  due  only  to  rotation,  the 
complete  rotation  is  computed,  and  in  the  last  phase 
of  the  procedure  every  normal  flow  vector  is  tested  for 
consistency  with  the  computed  motion  parameters. 
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Abstract 

This  paper  presents  a  feature  based  approach 
to  estimating  the  kinematics  of  a  moving  cam¬ 
era  (or  cameras)  and  the  structure  of  the  ob¬ 
jects  in  a  stationary  environment  using  long, 
noisy,  monocular  and  binocular  image  se¬ 
quences.  Both  batch  and  recursive  algorithms 
are  discussed.  The  image  plane  coordinates 
of  the  feature  points  in  each  frame  are  first 
detected  and  then  matched  over  the  frames. 
These  noisy  image  coordinates  serve  as  inputs 
to  our  algorithms.  Due  to  the  nonlinear  na¬ 
ture  of  perspective  projection,  a  nonlinear  least 
squares  method  is  formulated  for  the  batch  al¬ 
gorithm,  and  a  conjugate  gradient  method  is 
then  applied  to  find  the  solution.  A  recursive 
method  using  an  Iterated  Extended  Kalman 
Filter  (lEKF)  for  incremental  estimation  of  mo¬ 
tion  and  structure  is  also  presented.  Since 
the  plant  model  is  linear  in  our  formulation, 
closed  form  solutions  for  the  state  and  covari¬ 
ance  transition  equations  are  directly  derived. 
Experimental  results  for  several  real  image  se¬ 
quences  are  included. 

1  Introduction 

Visual  motion  analysis  from  image  sequences  is  one  of 
the  most  challenging  areas  in  the  field  of  computer  vi¬ 
sion.  The  goal  here  is  to  exploit  useful  2-D  informa¬ 
tion  from  images  to  infer  3-D  information  about  camera 
motion  and  scene  structure.  Apart  from  its  relevance 
to  the  understanding  of  the  human  visual  system,  mo¬ 
tion  analysis  has  many  applications  in  the  areas  of  target 
tracking,  passive  navigation,  mobile  robotics,  missile  and 
autonomous  vehicle  guidance,  and  space  and  underwater 
exploration.  The  task  at  hand  is  to  design  and  analyze 
algorithms  for  recovering  3-D  information  from  2-D  im¬ 
age  frames.  Many  approaches  have  been  advanced  by 
various  researchers  in  the  last  two  decades  to  address 
this  problem.  While  the  difficulties  involved  are  now 
better  understood,  experimental  results  have  had  mixed 
success.  By  its  very  nature  this  problem  is  ill-posed  and, 
as  will  be  discussed  later,  the  bulk  of  the  approaches  in¬ 
vestigated  have  proven  to  be  very  sensitive  to  even  mod¬ 
erate  levels  of  error  in  the  image  information  and  in  the 
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calibration  of  the  camera  system. 

Optical  flow  and  feature-based  methods  are  two 
paradigms  for  motion  analysis.  Methods  that  are  based 
on  the  computation  of  spatial  and  temporal  image  gra¬ 
dients  can  be  broadly  classified  as  optical  flow  methods 
[l]-[4].  The  optical  flow  is  by  definition  the  apparent 
movement  of  the  brightness  patterns  in  the  image.  It 
is  assumed  in  the  previously  cited  literature  to  be  iden¬ 
tical  to  the  motion  field  (the  true  2-D  velocity  field  of 
each  image  point),  but  this  is  seldom  the  case.  Further¬ 
more  the  aptriun  problem  states  that  only  the  compo¬ 
nent  of  the  optical  flow  in  the  direction  of  the  brightness 
gradient — commonly  referred  to  as  the  normal  flow — 
can  be  determined.  In  order  to  compute  the  full  optical 
flow,  further  arbitrary  constraints  must  be  introduced. 
Finally,  the  estimation  of  brightness  derivatives  is  unsta¬ 
ble  and  sensitive  to  image  noise.  These  limitations  have 
resulted  in  alternative  uses  of  image  brightness  deriva¬ 
tives  to  obtain  qualitative  environmental  information  [5], 
or  in  direct  estimation  methods  [6],  or  in  methods  mak¬ 
ing  exclusive  use  of  normal  flow  information  [7].  The 
feature-based  approach  uses  discrete  image  features  such 
as  points,  lines,  or  contours  as  input  to  the  estimation 
process.  The  most  critical  problem  for  this  method  is  the 
so  called  correspondence  problem:  image  features  must 
be  extracted  and  matched  over  the  frames  both  tempo¬ 
rally  and  spatially  (in  the  case  of  stereoscopic  vision). 
Another  problem  is  the  nonlinearity  of  the  equations  in¬ 
volved  in  the  algorithm  due  to  the  perspective  projection 
model.  In  this  paper,  we  take  a  feature-based  approach. 

Much  of  the  early  work  on  feature-based  motion  anal¬ 
ysis  is  focused  on  two-  or  three-frame  problems  [8]-[13]. 
The  goad  is  to  determine  the  transformation  between  se¬ 
lected  feature  points  observed  at  two  (or  three)  succes¬ 
sive  time  instants.  The  resulting  equations  are  usually 
nonlinear;  they  are  then  made  linear  by  increasing  the 
dimensionality  of  the  parameter  space.  The  advantage 
of  this  approach  is  that  it  does  not  assume  an  arbitrary 
kinematic  motion  model  for  the  objects  being  imaged 
and  that  conditions  for  the  uniqueness  of  the  parame¬ 
ters  to  be  estimated  can  be  found.  The  drawback  of  this 
method  is  its  sensitivity  to  the  presence  of  noise  [14]-[16]. 
Also,  because  the  rotation  center  is  implicitly  forced  to 
be  at  the  origin  of  the  world  coordinate  system,  the  mo¬ 
tion  parameters  computed  from  every  image  pair  will  be 
different,  which  makes  it  impossible  to  predict  the  pose 
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of  the  camera  or  of  objects  at  later  time  instants.  It 
has  been  shown  that  increasing  the  number  of  feature 
points  only  moderately  improves  the  motion-structure 
estimates. 

On  the  other  hand,  long-frame  based  methods  can 
fully  utilize  the  image  information  and  are  more  robust 
to  noise.  Approaches  in  this  category  usually  involve 
making  an  assumption  about  the  nature  of  the  motion 
model.  This  requires  a  certain  degree  of  smoothness  in 
the  motion  of  the  imaged  objects.  The  difficulties  in¬ 
volved  in  this  method  are  related  to  the  correspondence 
problem  mentioned  earlier  and  the  nonlinearity  of  the 
dynamic  and  observation  equations.  Thus,  methods  such 
as  least  squares  estimation  or  nonlinear  filtering  are  of¬ 
ten  adopted  to  solve  these  problems. 

2  Literature  Review 

The  following  is  a  brief  partial  review  of  relevant  liter¬ 
ature;  due  to  space  limitations,  many  other  significant 
pieces  of  related  work  are  not  mentioned  here. 

In  [17,  18],  Gennery  used  a  Kalman-like  recursive  fil¬ 
ter  for  the  tracking  of  a  known  3-D  object  using  line  fea¬ 
tures.  Weng  et  al.  [19]  introduced  a  long-frame  object 
motion  algorithm  based  on  a  Locally  Constant  Angular 
Momentum  (LCAM)  model.  Matthies  and  Shafer  [20] 
proposed  a  Kalman  filter  based  method  of  navigating  a 
robot  using  stereo  image  sequences.  Important  work  on 
outdoor  vehicle  navigation  was  done  by  Dickmanns  and 
Graefe  [21,  22]. 

Shariat  and  Price  studied  the  use  of  more  than  two 
frames  for  motion  analysis  assuming  that  the  motion 
is  approximately  constant  [23].  In  [24],  Tomasi  and 
Kanade  used  the  singular  value  decomposition  (SVD) 
technique  to  factor  a  matrix  of  image  measurements  into 
two  matrices  that  represent  shape  and  motion,  respec¬ 
tively.  Zh2uig  and  Faugeras  [25]  developed  a  method  of 
simultaneously  performing  motion  estimation  and  object 
segmentation  from  a  long  stereo  image  sequence  using 
3-D  line  segments  as  features  (tokens)  and  using  an  Ex¬ 
tended  Kalman  filter  (EKF)  for  motion  tracking. 

In  [26],  Broida  and  Chellappa  proposed  a  Kdman 
filter-based  recursive  algorithm  for  estimating,  from  a 
long,  noisy,  monocular  image  sequence,  the  motion  and 
structure  of  a  2-D  object  undergoing  2-D  motion.  This 
work  was  extended  to  3-D  objects  and  more  general  kine¬ 
matic  models  in  [27],  where  a  batch  estimation  scheme 
was  also  present^.  Broida  et  al.  also  used  the  Iter¬ 
ated  Extended  Kalman  Filter  (lEKF)  to  effectively  im¬ 
plement  a  recursive  algorithm  for  3-D  motion  and  struc¬ 
ture  [28].  A  batch  algorithm  was  used  to  initialize  the 
recursive  procedure. 

Chandrashekhar  and  Chellappa  [29]  developed  a  mov¬ 
ing  camera  algorithm  for  passive  ranging  using  a  monoc¬ 
ular  image  sequence.  Feature  points  were  automatically 
extracted  using  Gabor  waveleto,  and  the  feature  match¬ 
ing  process  was  interleaved  with  the  recursive  estimation 
algorithm,  which  uses  EKF,  thereby  reducing  the  search 
time  for  finding  matching  points.  In  [30],  Young  and 
Chellappa  proposed  a  more  general  motion  model  which 
included  constant  precession  and  acceleration,  and  in¬ 


corporated  it  into  their  algorithm  for  stereo  image  se¬ 
quences. 

3  Outline 

For  the  motion-structure  estimation  problem,  given  the 
perspective  projection  image  model  that  we  use,  a  non¬ 
linear  least  squares  method  is  used  in  the  batch  algo¬ 
rithms  and  an  lEKF  is  used  in  the  recursive  algorithms. 
The  case  of  a  moving  camera  and  stationary  scene,  us¬ 
ing  either  monocular  or  binocular  image  sequences,  is 
considered.  Our  approach  is  to  model  the  motion  of 
the  camera(s)  as  a  constant  translational  and  rotational 
motion  using  nine  motion  parameters,  namely  the  3-D 
vectors  of  the  position  of  the  rotation  center  and  its  lin¬ 
ear  and  angular  velocities.  The  structure  parameters  are 
the  3-D  coordinates  of  the  salient  feature  points  in  the 
inertial  coordinate  system. 

The  justification  for  choosing  this  motion  model  is  the 
smoothness  of  the  object  motion.  As  long  as  the  sam¬ 
pling  rate  is  high  enough,  the  object  motion  can  be  ap¬ 
proximated  over  a  short  period  of  time  using  only  a  first 
order  motion  model;  deviations  from  this  model  can  be 
treated  as  noise  which  can  be  taken  care  of  later  by  the 
recursive  (tracking)  filter.  A  standard  rotation  matrix  is 
used  to  describe  the  rotational  motion  rather  than  the 
quaternion  representation  used  earlier.  Under  these  con¬ 
ditions,  linear  plant  models  can  be  obtained  and  closed 
form  solutions  for  the  state  and  covariance  transition  dif¬ 
ferential  equations  can  be  directly  derived  without  the 
need  for  a  time-consuming  numerical  integration  step. 

To  handle  the  correspondence  problem,  the  Gabor 
wavelet  method  is  used  to  extract  salient  feature  points 
from  each  image  [31],  and  matching  wd  tracking  of  these 
points  are  performed  using  the  method  originally  pro¬ 
posed  in  [32].  These  image  correspondences  are  then 
used  as  inputs  to  our  algorithms.  The  noise  in  the 
data  includes  quantization  error,  detection  error,  sys¬ 
tem  (camera)  parameter  error,  etc.  Due  to  the  motion 
of  the  camera(8),  some  feature  points  may  be  outside 
the  image  or  temporarily  occluded  by  other  objects  in 
the  scene  in  some  image  frames;  this  u  known  as  the  oc¬ 
clusion  problem.  In  our  batch  algorithnu  this  problem 
is  handled  by  omitting  the  measurements  of  the  m?ssing 
points  in  the  least  squares  criterion  functions.  Similarly, 
in  the  recursive  algorithms  we  incorporate  only  the  mear 
surements  obtained  from  the  unoccluded  points  into  the 
measurement  equations. 

For  binocular  imagery,  the  traditional  stereo  triangu¬ 
lation  method  fails  when  the  images  from  the  two  cam¬ 
eras  ue  not  taken  at  the  same  time.  For  our  binocular 
algorithm,  however,  since  asynchronism  is  allowed,  the 
two  cameras  can  function  independently. 

4  Monocular  Imagery 

The  next  two  sections  briefly  explore  the  motion- 
structure  estimation  problem  for  the  case  of  a  dynamic 
observer  moving  in  a  static  environment.  For  detailed 
derivations  and  more  experimental  results,  the  interested 
reader  is  refered  to  [33,  34]. 
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Figure  1:  Monocular  imaging  and  motion  models  of  the 
moving  camera. 


4.1  Image  Model 

As  shown  in  Figure  1,  a  camera  is  moving  with  respect  to 
a  fixed  environment.  A  3-D  inertial  (world)  coordinate 
system  /  is  fixed  on  the  ground,  and  the  camera  coor¬ 
dinate  system  C  is  fixed  on  the  camera  with  its  z-axis 
pointing  along  the  optical  axis.  The  focal  length  is  /, 
so  the  image  plane  is  z  =  /  in  the  C  coordinate  system. 
Suppose  that  at  time  t  the  t***  feature  point  P,-  has  3-D 
coordinates 

£ct0)  —  (*Cix(0>  *Civ(0>  *Cw(0) 

in  coordinate  system  C.  By  the  central  projection  model, 
the  coordinates  on  the  image  plane  can  be  represented 
as 


where  the  n.’s  are  the  additive  noise  variables. 

The  coordinate  systems  C  and  I  are  set  to  be  coin¬ 
cident  at  time  t  =  to-  As  time  proceeds,  the  system  C 
starts  to  move  with  the  camera  and  the  system  I  is  left 
behind,  fixed  on  the  ground.  The  rotation  center  is  as¬ 
sumed  to  be  fixed  on  the  rotation  axis  in  system  C  at  all 
times.  Let  ^  be  the  coordinates  of  the  rotation  center  in 
C,  and  denote  the  trajectory  of  this  center  in  /  at  time 
/  by 

r,(t)  =  (ri,(t),r/,(t),r/,(t)f 

Since  coordinate  system  C  coincides  with  coordinate 
system  /  at  t  =  toi  we  also  have 

L/(io)  =  d 

4.2  Motion  Model 

Under  the  assumption  of  constant  translational  motion, 
the  camera  velocity  can  be  expressed  as 

iL=  =  Li  (2) 


Integrating  Equation  (2),  we  have 

Li{t)  =  it-to)L  (3) 

The  six  time-invariant  components  of  d  and  Tj  ue  called 
the  translational  motion  parameters. 

Let  Hi  be  the  constant  angular  velocity  of  the  camera 
relative  to  the  inertial  coordinate  system  /.  The  vector 
representations  of  this  quantity  in  C  and  I  are  the  same 
because  we  have  already  assumed  that  C  and  I  coincide 
at  time  t  =  to-  These  three  time-invariant  components 
of  Hi  are  the  rotational  motion  parameters  that  we  want 
to  estimate. 

Consider  another  coordinate  system  C  whose  origin  is 
at  d  uid  which  has  the  same  directional  vectors  along  the 
X-,  y-  and  z-axes  as  system  C  at  all  times.  The  camera  is 
rotating  about  the  rotation  center  (whose  location  is  d) 
with  constant  angular  velocity  hl-  If  flic  feature  point 
Pi  has  coordinates  Sji  in  h  decomposition  of  rotation 
and  translation  gives 

T 

Sji  —  (*/»ri  */tyi  S/»j) 

=  Li(t)  +  R{l!i.,t-to)  - £c'i(i)  (4) 

Let  us  denote  system  C  at  time  t  by  C'(t);  then  at  time 
t,  vector  8c>(to)t  rotates  to  sc>(t)i-  Since  vector  sc'(«o)t  i® 
C‘(to)  has  the  same  coordinates  as  vector  sc^f^f  in  C'{t), 
and  the  total  angle  rotated  is  |u;|(t  —  to),  the  coordinates 
of  Sc»(f)i  in  C'(to)  can  be  shown  to  be 

»■«  C’'(to)  (5) 


where  iZo[nii®2>®3;  is  the  standard  rotation  matrix 
which  rotates  a  3-D  vector  representation  by  the  angle  9 
with  respect  to  the  unit  directional  vector  (ni,n2,n3)^. 
Also,  C'{t)  is  a  shifted  version  of  C(t),  so 


iC'iit)  =  Sciit)  -  d 

Using  Equations  (3)  and  (5),  Equation  (4)  can  be  written 
as 


7 

Sji  —  (S/iX)  8/«y )  8/w  ) 

=  Li{t)  +  R{!!L,l-to)[Sci{t)-d] 

=  d+{i-to)v+R{w,t-to)  ■  [lci(tMl  (6) 


where 


R{Hi,t-to)  =  Ro 


Wx  w, 

.N’  N’  N 


H(t-«o) 


Rearranging  Equation  (6),  we  get 


(7) 


(*Cix(t)i  SCiy  (0i  *Cw(t))  (8) 

=  4+ (ill.t  -  to)  •  [«/,  -  d- (t -<o)r/] 


where  il“'(ti;,  t  -  to)  =  R^{tH,  t-to)  =  R{w,  -(t  -  to)). 

In  our  algorithms,  the  three  time-invariant  compo¬ 
nents  of  each  Sji  sfc  called  the  structure  parameters. 
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4.3  Batch  Formulation 


In  this  formulation,  the  unrecoverable  global  scaling  fac¬ 
tor  is  taken  care  of  by  normalizing  the  translational  and 
structure  parameters  by,  say,  the  z-component  of  the 
Af***  feature  point,  »tu*-  Then  these  normalised  pa¬ 
rameters  can  be  exprosed  as 


/  dx  dx 

V.S/M*  ’  S/M*  ’  S/Af*  , 

/  Vx  «y  V* 

V.S/J1/*’  S/Af,  ’  a/JUt. 

= 

(S/i*iS/i^,sJi,)^ 

/  ^/i*  ^/ty  ^/ii 

(9) 


In  order  to  remove  the  redundant  degree  of  freedom 
caused  by  random  shift  of  the  rotation  center,  we  set, 
say,  dj  =  0  in  our  algorithms.  Using  Equation  (9),  Equa¬ 
tion  (8)  can  be  written  as 


(*c«*(0>  *ci*(0)  (19) 

<-<o) ^/< Ji^)] 


Equation  (1)  thus  becomes 


Xi{t)=f 


+  nyi(0 


where  /Z,-  *  is  the  i***  row  vector  of  R~^  and  d^  is  set  to 
zero  in  vector  ^ . 

The  3M  -t-  7  unknown  motion  and  structure  parame¬ 
ters  for  motion  estimation  from  a  sequence  of  monocular 
images  are  chosen  as 

]l!=(dx  ,d^ ,JL^ -  •/*/*>*/*/»)  ■ 

The  least  souares  estimate  of  the  motion  and  structure 
parameters  ^  for  Ekiuation  (11)  is  obtained  by  finding 
the  minimum  of  the  following  cost  function: 


m'V 

[Ik, J 


If  the  answer  obtained  by  our  algorithm  is  = 
{dx,dy,0),  then  any  vector  i  that  has  the  form  i  = 
d  +  oijii  (where  a  is  some  constant)  will  also  satisfy  Equa^ 
tion  (8).  This  is  because  jn  is  an  eigenvector  of  rotation 


matrix  R  (and  also  of  /Z~^)  with  eigenvalue  1.  This  also 
confirms  the  fact  that  any  point  along  the  rotation  axis 
can  serve  as  the  rotation  center  without  affecting  the 
image  plane  coordinates. 


4.4  Recursive  Formulation 
In  Section  4.3,  we  set  d«  =  0;thusr^(f)  =  (<— fo)r/j^, 
so  that  r^(0  can  be  dropped  out  of  the  estimation  pro¬ 
cess.  The  state  vector  £^(t)  chosen  for  the  recursive 
algorithm  thus  consists  of  the  following  normalized  mo¬ 
tion  and  structure  parameters: 


Plant  Equation 

Under  the  assumption  of  constant  translational  and 
angular  velocity,  the  time  derivatives  of  ij  ,  uh  £n> 

Using  this  fact  and  Equa¬ 
tion  (2),  we  get  the  linear  plant  equation 

i^(f)  =  F^i^(l)  (13) 

where  is  the  sparse  square  matrix 

=  {Fi3  =  F^  =  1;  all  other  elements  Fi^  =  0}  .  (14) 


Measurement  Equation 
Let 


£'"(0 


/  »‘/«(0  »-/v(0  {t~to)nzY 

V  »/Afi  ’  «/Mi  ’  S/M*  / 

(r^.(t),  rr,(0,  (f-fo)rry 


Substituting  this  into  Equation  (11),  we  obtain  the  mea¬ 
surement  equation 

=  A'^[£'^(f.) ;  ti]  +  t;'^(f.)  (15) 


where 


(  Xi,{U)  \ 

(  »»X;i(f|)  \ 

YiAU) 

z''(ti)  = 

(fi) 

\  "yi..(^) 

and 


,»-y(«.)-(«.-<o)rf',-»-«r*0lL.«.-«o)  Ur'i,  -£.'^(«.)] 


iJi. 

• 

(«.) 

m" 

• 

0.) 
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Figure  2:  Binocular  imaging  and  motion  models  of  the 
moving  cameras. 

where  ji,  uo  unoccluded  feature  points  at 

time  ti.  Therefore,  only  unoccluded  points  are  incorpo¬ 
rated  in  the  measurement  equation  (15). 

Since  our  plant  model  is  linear,  and  matrix  is 
sparse,  closed  form  solutions  for  the  state  and  covari¬ 
ance  transition  equations  can  be  derived  directly  without 
a  numerical  integration  step. 

State  Transition  Equation 

i^L  Ml  Ini  Sj2i  •  •  •  i^M  ue  all  constant  in  <  under  the 
rigidity  assumption  and  the  model  of  constant  transla¬ 
tional  and  angular  velocities;  this  together  with  (3)  gives 
the  state  transition  equation 

X-^iiT+i)  =  U  +  Or+i  -  *t)  F'"]  (16) 

where  I  is  the  identity  matrix. 

Covariance  Transition  Equation 

By  (14),  we  see  that  (F^)^  is  a  zero  matrix.  Direct  sub¬ 
stitution  may  then  be  us^  to  verify  that  the  covariance 
transition  equation  has  the  form 

[j + (ir+i  -  [I + (ir+i  - 

5  Binocular  Imagery 

5.1  Image  and  Motion  Models 

As  shown  in  Figure  2,  the  two  cameras  are  installed  on 
a  platform  which  is  moving  in  a  stationary  environment. 
Two  camera  coordinate  systems  with  the  same  orien¬ 
tation,  LC  and  RC,  are  attached  to  the  left  and  right 
cameras,  respectively,  with  their  z-axes  coinciding  with 
the  optical  axes  of  the  cameras.  A  3-D  inertial  coordi¬ 
nate  system  I  is  chosen  to  be  coincident  with  the  left 
camera  coordinate  system  at  time  f  =  to,  without  loss 
of  generality.  As  time  proceeds,  the  platform  moves  and 
the  inertial  coordinate  system  is  left  behind,  fixed  on  the 
ground. 

Let  a  point  have  3-D  coordinates  in  inertial  coor¬ 
dinate  system  7,  and  let  it^  rcor  iinate  representations 


in  the  left  and  right  camera  coordinate  systems  at  time 
i  be  g/i(;(t)  and  respectively.  Suppose  that  the 

two  cameras  are  link^  together;  then  the  relative  orien¬ 
tation  and  the  displacement  vector  between  the  cameras 
remain  unchanged  at  ail  times.  Thus,  the  transformation 
between  these  systems  can  be  expressed  as 

Aiic(0  =  ir.c(*)  “ 

Atc(lo)  =  it  (18) 

where  D  is  the  displacement  vector  from  the  origin  of 
the  LC  coordinate  system  to  that  of  the  iZC  system  and 
is  assumed  to  be  known. 

Suppose  that  at  time  t  the  feature  point  of  the 
scene.  Pi,  has  3-D  coordinates 

AlCiCO  —  (•LC<r(f)i  *iC«K(t)i  *iCw(0) 

and 

ittCii*)  =  (•RCuit)i»RCig{t)i»BCiMit))^ 
in  the  LC  and  RC  systems,  respectively.  Then,  by  the 
central  projection  model,  the  coordinates  of  the  images 
of  Pi  on  the  left  and  right  image  planes  can  be  repre¬ 
sented  as 

p“<‘)  = 


■  +  nxK(t) 

+  nrii(t) 


+  nXriCO 
+  nyr.(f ) 


1 

where  and  /,.  are  the  focal  lengths  of  the  two  cameras 
and  the  n/s  are  additive  noise  processes. 

Similarly,  the  motion  of  the  platform  can  be  decom¬ 
posed  into  a  rotation  about  the  rotation  center  and  a 
translation  of  this  center.  The  rotation  center  is  cho¬ 
sen  to  be  fixed  on  the  platform  and  has  coordinates  d 
in  the  LC  coordinate  system  at  all  times.  Denote  the 
trajectory  of  the  rotation  center  at  time  f  in  7  by 

Since  systems  7  and  LC  coincide  at  time  t  =  to,  we  have 

1/(0  =  li+ (21) 

where  £  =  £/  is  the  translational  velocity. 

Suppose  the  platform  undergoes  constant  transla¬ 
tional  motion  2  ss  well  as  constant  angular  motion  32 
with  respect  to  system  7.  Following  the  same  procedure 
as  in  the  monocular  case,  the  tranrformations  between 
the  LC  and  RC  systems  and  system  7  are  found  to  be 

(*ic<*(0i  SiCi>(0»  •ici*(t))  (22) 

=  i -f  71-^  (gi,  t  -  to)  •  Ia/,  -  i  -  (f  -  Mi] 


(•RCit(0>  •J*Ci»(f).  »JlCi»(l))^  (23) 

=  (jA.i  -  lo)  •  [A/j  -  it-  (f  -  Mill 

The  three  components  of  are  the  rotational  motion 
parametera  and  the  six  components  of  i  and  g  are  the 
translational  motion  parameters,  while  the  three  com¬ 
ponents  of  eadi  Sjf  are  the  stractsre  parameters.  Again, 
these  components  are  all  time-invariant  under  our  model 
setup. 
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5.2  Batch  and  Recursive  Formulations 

The  3A/  +  8  unknown  motion  and  structure  parameters 
for  motion  estimation  from  a  sequence  of  stereo  images 
are  chosen  as 

T 

J/  =  I SLiULiSji  t  ••  •  I SjM  ) 


Using  Equations  (22)  and  (23),  Equations  (19)  and 
become 


X,i(t  =  f, 
yK(0  =  f, 


+  nxii{t) 
+  nyK(<) 


(20) 

(24) 


and 


Xriit)  =  fr 

Yri(t)  =  fr 


+  nxrt(0 
+  ny,<(t) 


where  is  the  i*’’  row  vector  of  R~^  and  d,  is  set  to 
zero  in  vector 

The  least  squares  estimate  of  the  motion  and  structure 
parameters  y  of  Equation  (24)  is  obtained  by  finding  the 
minimum  ofthe  cost  function,  which  has  a  similar  form 
to  that  in  Equation  (12). 

In  the  recursive  formulation,  r/(f)  is  chosen  as  one 
of  the  states  instead  of  d,  to  account  for  the  dynamic 
property  of  the  Kalman  filter.  Since  dj  is  set  to  zero  to  fix 
the  rotation  center,  by  (21)  we  have  r/,(<)  =  (f  —  fo)v*. 
Thus  if  V,  is  included  in  the  state  vector,  r/x(f)  is  no 
longer  needed  in  the  estimation  process.  Hence,  the  state 
vector  chosen  for  the  recursive  algorithm  consists  of  the 
following  motion  and  structure  parameters: 

i(0  S  (r/.(l)  ,r/,(t)  ,1/  ,  •  •  •  .ijuf  ■ 

The  plant,  measurement,  state  and  covariance  transi¬ 
tion  equations  can  be  derived  similarly  to  their  deriva¬ 
tion  for  the  monocular  case.  After  these  equations  are 
obtained,  a  standard  lEKF  can  be  used. 

Detailed  proofs  of  the  uniqueness  of  the  estimated  pa¬ 
rameters  in  the  binocular  case  can  be  found  in  [33]. 


6  Experimental  Results 

This  section  describes  real  imagery  experiments  for  both 
the  monocular  and  binocular  cases.  For  results  on  simu¬ 
lated  imagery  and  additional  real  imagery  experiments, 
see  [33]. 

6.1  Monocular  Imagery 

Two  real  image  sequences  were  tested.  The  inputs  to 
our  algorithms  are  the  2-D  image  coordinates  of  the 
salient  feature  points  detected  using  the  algorithm  pro¬ 
posed  in  [32].  In  this  algorithm,  the  global  image  change 
due  to  unknown  camera  motion  is  first  compensated  by 
an  image  registration  algorithm  [35],  which  automati¬ 
cally  detects  feature  points  and  estimates  the  rotation, 
translation  and  scaling  between  two  images.  The  feature 
points  are  then  tracked  over  the  image  sequence  using  a 
weighted  cross-correlation  match  method.  The  method 


(b) 


Figure  3:  Locations  of  the  feature  points  in  the  first 
frame  of  each  sequence:  (a)  the  UMASS  Rocket  se¬ 
quence;  (b)  the  NASA  helicopter  (ARC)  sequence. 

can  only  find  the  best  matches  at  grid  points;  to  ob¬ 
tain  subpixel  accuracy  feature  point  correspondences,  a 
feature  point  in  the  current  frame  is  first  transformed  to 
the  coordinates  of  the  next  fraire  after  the  motion  of  the 
camera  has  been  computed  using  the  image  registration 
algorithm.  The  four  grid  points  nearest  to  this  feature 
point  are  found  and  their  best  grid  point  matches  in  the 
next  frame  are  located.  An  interpolation  scheme  is  then 
applied  to  these  four  grid  points  to  get  the  subpixel  ac¬ 
curacy  correspondences.  These  correspondences  serve  as 
the  input  to  our  algorithms.  The  detailed  implementa¬ 
tion  of  this  point  correspondence  algorithm  is  described 
in  [32]. 

In  our  experiments,  for  comparison  purposes,  feature 
points  with  3-D  ground  truth  are  hand-picked  in  the  first 
frame  of  the  UMASS  Rocket  sequence.  The  point  corre¬ 
spondence  algorithm  [32]  is  then  used  for  the  subsequent 
frames.  For  the  NASA  sequence,  feature  points  were  au¬ 
tomatically  detected  and  tracked  in  all  the  frames.  The 
locations  of  the  feature  points  used  for  each  sequence  are 
marked  in  Figure  3. 


646 


The  world  coordinate  system  is  set  to  be  the  first- 
frame  image  coordinate  system  with  the  origin  located 
at  the  center  of  the  image  plane.  The  z-axis  points  to  the 
right,  the  y-axis  points  upward,  and  by  the  right  hand 
rule,  the  z-axis  thus  points  out  of  the  image  plane.  The 
initial  guesses  of  the  parameters  for  the  batch  algorithm 
are  all  0.001. 

6.1.1  UMASS  Rocket  Sequence 

The  30-frame  UMASS  “Rocket”  ALV  sequence  [36]  is 
used  in  this  section;  the  first  and  the  twenty-first  images 
are  shown  in  Figure  4  (a)  and  (b).  Eight  feature  points 
with  known  ground  truth  were  u^.  Experiments  with 
the  batch  algorithm  were  based  on  four  feature  points. 
The  computed  motion  and  structure  parameters  and  the 
normalized  structure  ground  truth  of  these  points  in  the 
first  frame  image  coordinate  system  are  shown  in  Ta¬ 
bles  1  and  2. 


Table  1:  Structure  estimates  for  the  UMASS  Rocket  se¬ 
quence. 


Feature 

point 

Normalised  etructure 
KTound  truth 

Estimated 

etructure 

1 

0.494 

o.ii7 

"ojR? 

0.489 

0.337 

0.901 

Box 

3 

0.601 

0.383 

1.153 

0.563 

0.367 

1.109 

1-4 

3 

0.435 

0.357 

0.965 

0.430 

0.349 

1  003 

4 

0.030 

0.343 

1.0 

-0.001 

0.351 

1.0 

1 

0.366 

0.345 

1.015 

0.354 

0.341 

0.995 

Cone 

3 

0.433 

0.433 

1.394 

0.457 

0.456 

1.456 

1-4 

3 

0.169 

0.350 

0.593 

0.161 

0.348 

0.573 

4 

0.165 

0.341 

1.0 

0.163 

0.347 

1.0 

Table  2:  Motion  parameter  estimates  for  the  UMASS 
Rocket  sequence. 


Estimated 
rotational  velocity 

Estimated 

translational  velocity 

Cone 

-0.00017  0.0103^  O.OOOil 

0.00353  -0.00800  0.03347 

Box 

0.00414  -0.00341  0.00013 

0.01351  -0.00683  0.04316 

FYom  the  image  sequence,  we  see  that  there  is  almost 
no  rotational  motion  and  the  camera  appears  to  be  mov¬ 
ing  along  a  straight  line  going  leftward  and  into  the  im¬ 
age  plane.  Thus  the  z-component  of  the  translational 
motion  should  be  the  lar.^est  one,  followed  by  the  x- 
component,  and  the  y-con  ponent  should  be  very  close 
to  zero.  (Recall  that  the  depths  of  the  feature  points  are 
all  negative,  so  that  after  the  normalization  step,  the  z- 
and  z-components  of  the  translational  velocity  should  be 
positive.)  According  to  these  observations,  the  transla¬ 
tional  velocity  computed  in  Table  2  for  the  box  seems  to 
be  accurate. 

In  the  recursive  experiment,  the  set  of  four  cone  points 
in  Table  1  was  used.  Because  one  of  these  points  disap¬ 
pears  after  frame  6,  one  disappears  after  frame  10,  and 
all  the  others  are  outside  the  image  after  frame  21,  it 
was  not  possible  to  apply  the  normally  used  recursive 
algorithm.  Instead,  the  output  from  the  six-frame  batch 
algorithm  was  used  as  the  initial  guess,  and  the  Kalman 
filter  runs  from  frame  1  to  frame  21,  acting  like  a  non- 
causal  smoother  through  the  first  six  frames.  Since  mo¬ 
tion  ground  truth  is  not  available,  only  the  computed 


structure  error  percentages  are  shown  in  Figure  5  (due 
to  space  limitations,  only  points  1  and  3  are  shown). 

Since  the  camera  was  not  calibrated  in  this  experi¬ 
ment,  we  also  tried  to  estimate  the  field  of  view  of  the 
camera  as  well  as  the  center  of  the  image.  It  was  found 
that  the  fields  of  view  of  the  camera  were  very  close  to 
the  given  specification,  but  the  center  of  the  image  was 
estimated  to  be  at  (240,306)  instead  of  the  assumed  lo¬ 
cation  (256,270).  This  is  because  the  original  size  of  each 
image  is  512  x  484,  and  our  location  of  the  origin  is  at 
the  lower  left  corner  of  a  512  x  512  image  plane.  The 
two  poles  in  the  far  field  were  also  tested,  but  since  they 
are  close  to  the  FOE  and  their  depths  are  large,  one- 
or  two-pixel  errors  introduced  by  the  feature  detection 
algorithm  can  cause  fairly  large  amounts  of  error  in  the 
structure  estimates. 

6.1.2  NASA  Sequence 

A  ten-frame  NASA  Helicopter  (ARC)  sequence,  as 
shown  in  Figure  4  (c)  and  (d),  was  also  used.  Since  this 
sequence  is  too  short  to  apply  the  recursive  algorithm, 
only  the  batch  algorithm  was  tested.  Also,  ground  truth 
information  and  camera  calibration  parameters  were  not 
available,  so  in  running  the  batch  algorithm,  the  field  of 
view  was  assumed  to  be  40  degrees  and  the  optical  center 
was  assumed  to  coincide  with  the  center  of  each  image. 
The  estimates  of  the  structure  and  motion  parameters 
are  shown  in  Tables  3  and  4. 

Table  3:  Structure  estimates  for  the  NASA  Helicopter 
sequence 


Feature 

Estimated 

point 

structure 

1 

0.036887  -0.045965  0.971495 

2 

-0.047572  -0.079270  1.071193 

3 

-0.242800  -0.066243  1.002631 

4 

-0.326975  -0.045814  1.0 

Table  4:  Motion  parameter  estimates  for  the  NASA  He¬ 
licopter  sequence 


Estimated 
rotational  velocity 

Estimated 

translational  velocity 

-0.0003  0.0194  -0.0058 

-0.03S5  -0.0043  0.0405 

6.2  Binocular  Imagery 

The  10-frame  “Forward”  stereo  image  sequence  is  used 
in  this  section.  The  first  and  last  image  pairs  are  shown 
in  Figure  6  with  the  feature  trajectories  superimposed 
on  them.  The  inertial  coordinate  system  is  taken  to  be 
the  first-frame  left  camera  coordinate  system  with  its 
origin  located  at  (246.6,  225.6).  The  z-axis  points  to 
the  right,  the  y-axis  points  downward,  and  by  the  right 
hand  rule,  the  z-axis  thus  points  into  the  image.  The  dis¬ 
placement  vector  H,  is  (1.0,  0,  0)  inch,  while  the  motion 
of  the  platform  is  0.2  inches/frame  straight  ahead  (pure 
translation).  Structure  information  was  not  available  to 
us. 
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(C)  (d) 


Figure  4:  Image  plane  trajectories  of  feature  points:  (a),  (b)  first  and  21**  frames  of  the  UMASS  Rocket  sequence; 
(c),  (d)  first  and  last  frames  of  the  NASA  sequence. 


(a)  (b) 

Figure  5:  Normalized  structure  error  percentages  for  the  UMASS  Rocket  sequence  computed  by  the  Kalman  filter 
(cone):  (a)  point  1;  (b)  point  3. 
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(C)  (d) 


Figure  6:  Image  plane  trajectories  of  feature  points  (in  circled  areas)  in  the  Forward  sequence:  (a)  first  frame  of  left 
sequence,  (b)  last  frame  of  left  sequence,  (c)  first  frame  of  right  sequence,  (d)  last  frame  of  right  sequence. 


Four  feature  points  are  used  to  test  both  the  batch 
and  recursive  algorithms;  their  locations  are  shown  in 
Figure  7.  These  feature  points  are  first  detected  and 
tracked  over  the  frames  of  the  left  image  sequence  using 
the  method  proposed  in  [32].  Then  the  first  frame  of 
the  left  sequence  is  registered  with  the  first  frame  of  the 
right  sequence  to  find  the  corresponding  feature  points 
in  these  two  frames  [35].  The  correspondences  of  the 
feature  points  in  the  first  frames  of  the  left  and  right  se¬ 
quences  are  shown  in  Figure  8.  After  the  matching  points 
in  the  first  frame  of  the  right  sequence  are  found,  the 
same  tracking  algorithm  can  be  used  for  the  right  image 
sequence.  The  coordinates  of  these  feature  points  then 
serve  as  the  input  to  our  algorithms.  The  estimated  mo¬ 
tion  and  structure  parameters  using  the  10-frame  batch 
method  are  listed  in  Table  5.  The  error  in  the  velocity 
along  the  r-direction  is  around  8  percent.  The  outputs 
of  the  2-frame  batch  method  were  used  as  the  inputs 
to  our  recursive  algorithm;  the  results  for  the  estimated 
velocity  are  shown  in  Figure  9. 


Table  5:  Structure  and  motion  parameter  estimates  for 
the  Forward  sequence. 


Feature 

point 

Estimated 

structure 

Estimated 

velocities 

1 

-4  414  0  527  22  338 

Rotational  (radians) 

2 

-2  170  2  953  22  588 

-0.0002  -0  0002 

-0  0001 

3 

-0  804  1  916  22  372 

lYanslational 

4 

4  168  0  542  26  788 

0  0052  0.0009 

0  184A 

7  Conclusion 

Complete,  automatic  algorithms,  starting  from  feature 
correspondences  and  yielding  motion  and  structure  esti¬ 
mates,  are  reported  in  this  paper.  Both  the  batch  and 
recursive  algorithms  for  motion  and  structure  recovery 
are  found  to  be  robust  in  spite  of  different  sources  and 
levels  of  noise.  Our  algorithms  perform  well  even  when 
as  few  as  four  feature  correspondences  are  used;  this  fact, 
together  with  the  ability  to  handle  occlusion,  makes  it 
possible  to  handle  appearances  and  disappearances  of 
features  as  well  as  relatively  short-span  observations  of 
features,  thus  reducing  the  computational  cost  a.ssoci- 
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Figure  7:  Locutioiis  of  feature  p<wts  in  the  fint  frame 
of  the  left  image  eequence. 


Figure  8:  Correaponding  feature  points  in  the  left  and 
ri^t  image  sequences  superimpo^  on  the  first  frame 
of  the  left  sequence. 

ated  with  maintaining  feature  correqxuidaices.  In  the 
binocular  case,  if  images  are  takoi  at  the  same  time  by 
both  cameras,  stereo  triangulatkm  can  pve  us  good  ini¬ 
tial  guesses  for  the  structure  parameters  to  q>eed  up  the 
convergence  of  both  the  batch  and  recursive  algorithms. 
With  little  modification,  the  algorithms  can  abo  be  ^>- 
plied  to  situations  where  the  two  cameras  undago  de¬ 
ferent  motions. 
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Abstract 

We  show  that  the  bounded-error  recognition 
problem  for  images  of  3D  objects  using  point 
features  can  be  decomposed  into  ID  search 
tasks,  along  lines  joining  the  origin  of  the  object 
coordinate  system  to  the  feature  points  chosen 
to  model  the  object.  Points  are  ccuistructed 
along  these  lines  at  locations  given  by  the  co¬ 
ordinates  of  the  detected  image  points;  concur¬ 
rent  bracketing  of  these  points  by  segment  tree 
search  along  each  cS  these  lines  provides  maxi¬ 
mal  matchings  between  feature  points  and  im¬ 
age  points.  Depth  of  search  is  limited  by  pixel 
resolution.  This  method  is  well  adapted  to  the 
task  of  tracking  objects  in  the  presence  of  vari¬ 
able  occlusion  and  clutter. 

1  Introduction 

The  task  considered  here  is  model-based  recognition  and 
tracking  using  point  features  to  describe  3D  objects.  We 
formulate  the  problem  using  the  techniques  introduced 
by  Baird  [1]  extended  by  Cass  [4],  Breuel  [2,  3], 
and  others  [7,  8].  The  approach  consists  of  matching 
model  features  and  image  features  to  determine  the  pose 
of  the  object,  while  assuming  that  there  are  spurious 
or  missing  image  features,  and  uncertainties  in  detect¬ 
ing  these  features.  The  image  features  are  assumed  to 
be  located  somewhere  within  bounded  regions  around 
their  detected  locations;  the  problem  posed  with  this 
uncertainty  model  is  sometimes  called  the  bounded-error 
recognition  problem.  Authors  have  mostly  applied  this 
model  to  the  matching  of  2D  images  and  the  recognition 
of  flat  objects.  One  of  the  major  obstacles  to  the  practi¬ 
cal  extension  of  past  work  to  general  non-planar  objects 
has  been  that  the  search  in  the  general  case  has  to  be 
performed  in  an  8-dimensional  transformation  space  [4]. 
The  proposed  method  reduces  this  to  ID  eearck  iy  seg¬ 
ment  trees  along  lines  defined  in  the  object  model. 

We  introduce  new  equations  for  expressing  the  rela- 
ticmship  between  model  points  and  image  points  in  a 
perspective  model  of  projection,  generalising  a  formu¬ 
lation  introduced  for  iteratively  computing  object  pose 
(the  POSIT  algorithm;  see  Appendix  and  [5]).  These 
equations  place  the  nonlinear  terms  of  the  transforma¬ 
tion  on  the  right  hand  side,  in  combination  with  the  im¬ 


age  coordinate  terms.  The  advantage  of  this  formulation 
is  that  when  initial  estimates  of  these  nonlinear  terms  are 
made,  ike  uneeriuinip  in  these  estimates  can  be  modeled 
as  additional  image  uneertaintg.  We  obtain  linear  con¬ 
straints  on  two  4D  vectors  I  and  J  proportional  to  the 
first  and  second  rows  of  the  hcxnogeneous  transformation 
matrix  of  the  object.  These  linear  constraints  represent 
slabs  of  space  which  are  perpendicular  to  feature  vectors 
(joining  the  origin  of  the  object  coordinate  system  to 
the  object  feature  points)  at  points  that  depend  (m  the 
image  coordinates  of  these  feature  points.  Regions  of 
space  where  the  largest  numbers  of  slabs  intersect  corre¬ 
spond  to  maximal  matchings  between  object  points  and 
image  points.  To  find  these  regions  we  ad:q>t  the  binary 
tree  search  advocated  by  Breuel  [3]  for  this  type  of  prolv 
lem;  in  our  formulatimi,  the  search  can  be  decomposed 
into  ID  searches  by  segment  trees  [9]  along  the  feature 
vectors.  Simultaneous  searches  are  performed  for  the 
regioiu  containing  I  and  J,  and  are  pruned  by  mutual 
constraints  resulting  from  the  fact  that  I  and  J  belong 
to  slabs  corresponding  to  the  same  image  points.  Other 
pruning  criteria  use  the  fact  that  the  first  three  compo¬ 
nents  of  I  and  J  define  vectors  which  are  perpendicular 
and  equal  in  length.  When  an  object  is  tracked,  the  non¬ 
linear  terms  of  the  equations  can  be  evaluated  from  the 
pose  results  obtained  for  the  previous  image  frame,  and 
the  dimensions  of  the  initial  search  space  can  be  reduced 
because  the  position  and  extent  of  this  search  space  can 
be  deduced  from  predictive  techniques  and  bounds  on 
admissible  motions.  This  method  accommodates  the  dis- 
iq>pearance  of  features  due  to  self-occlusion  during  the 
object’s  motion. 

2  Notation 

In  Figure  1,  we  show  the  classic  pinhole  camera  model, 
with  its  center  of  projection  O,  its  image  plane  at  dis¬ 
tance  /  (the  focal  length)  from  O,  its  axes  Ox  and  Oy 
pointing  along  the  rows  and  colunuis  of  the  camera  sen¬ 
sor,  and  its  third  axis  Oz  pointing  along  the  optical  axis. 
The  unit  vectors  for  these  three  axes  are  called  i,  j  and 
k.  In  this  paper,  the  focal  length  and  the  intersection  of 
the  (^tical  axis  with  the  image  plane  (image  center  C) 
are  assumed  known. 

An  object  with  feature  points  Mn 

is  located  in  the  field  of  view  of  the  camera.  The  object 
coordinate  frame  is  (Afou,  A/qv,  Mqw).  The  coordinates 
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in  the  second  row  vector,  P3,  the  coordinates 
are  the  coordinates  of  a  3D  vector  j  which  is  the  second 
row  vector  of  the  rotation  matrix;  j  is  also  the  unit  vector 
for  the  y-axis  of  the  camera  coordinate  system,  expressed 
in  the  object  coordinate  system  (Mqu,  Mqv,  A/qw).  In 
the  third  row  vector,  P3,  the  coordinates  k„,k„,k^  are 
the  coordinates  of  a  3D  vector  k  which  is  the  cross  prod¬ 
uct  of  i  and  j.  In  the  fc^wing,  we  show  how  the  vectcvs 
1  =  *p.,j  =  4-p,  can  be  conq>uted.  Once  they  are 
obtained,  Tj  is  found  by  noticing  that  the  first  three  co- 
(»dinates  of  I  and  J  define  3D  vectors  Ri  and  R3  with 
norms  equal  to  f/T, .  The  completion  oi  the  object  pose 
matrix  P  is  then  straightforward  (see  step  3  in  the  Ap¬ 
pendix). 

3  IHindamental  Equations 


Figure  1:  Perspective  projections  m,-  for  object  points 
Mi 


{Ui,Vi,Wi)  of  the  points  Mi  in  this  frame  are  known. 
The  images  of  the  points  Mi  are  called  m^,  and  the  im¬ 
age  coordinates  (zi,yi)  of  each  are  known.  In  the 
recognition  problem,  one  of  the  goals  is  to  be  able  to  say 
that  mi  is  indeed  the  image  of  Mi,  which  is  not  obvious 
since  the  pose  of  the  object  is  also  unknown.  In  other 
words,  the  correspondences  between  the  image  points 
and  the  object  points  have  to  be  found. 

The  rotation  matrix  R  and  translation  vector  T  of  the 
object  in  the  camera  coordinate  system  can  be  grouped 
into  a  single  4x4  transformation  matrix  which  will  be 
called  the  pose  matrix  P  in  what  follows: 


P  = 


•"«  •*  T* 

ju  jv  jw  Tff 

ku  k„  ku  Tt 

0  0  0  1 


(1) 


To  obtain  the  coordinates  of  an  object  point  Mi  in  the 
camera  coordinate  system  using  this  pose  matrix  P  in¬ 
stead  of  the  more  traditional  rotation  matrix  and  trans¬ 
lation  vector,  one  simply  multiplies  P  by  the  coordinates 
of  point  Mi  or  vector  MoM,  in  the  object  coordinate 
system.  This  operation  requires  that  point  Mi  or  vector 
MoMj  be  given  a  fourth  coordinate  (a  fourth  dimen¬ 
sion)  equal  to  1.  The  four  coordinates  are  said  to  be 
the  homogeneous  coordinates  of  the  point  or  vector.  In 
the  following,  vectors  and  points  are  four-dimensional 
(4D)  entities  in  this  homogeneous  space,  unless  other¬ 
wise  specified. 

The  first  line  of  the  matrix  P  is  a  row  vector  that  we 
call  Pi.  The  other  row  vectors  are  called  P3,  P3  and 
P4.  The  coordinates  Tg,T^,T,,l,  the  fourth  column  of 
the  matrix,  are  the  coordinates  of  the  translation  vec¬ 
tor  T  (in  Figure  1,  the  translation  vector  T  is  the  vec¬ 
tor  OMq).  In  the  first  row  vector.  Pi,  the  coordinates 
>«)<••  >w  (ure  the  coordinates  of  a  3D  vector,  i,  which  is 
abo  the  first  row  of  the  rotation  matrix  R  of  the  trans- 
formati<Mi.  Notice  that  i  is  also  the  unit  vector  for  the 
z-axis  of  the  camera  coordinate  system  expressed  in  the 
object  coordinate  system  (Mqu,  Mov,  Mqw).  Similarly, 


The  fundamental  relations  which  relate  the  row  vectcsrs 
Pi ,  P3  of  the  pose  matrix,  the  coordinates  ot  the  object 
vectors  MoM,  in  the  object  coordinate  system,  and  the 
coordinates  z,-  and  y,  of  the  perspective  images  m,-  of  Mi 
are 

MoMiI  =  z;., 

MoM.J  =  y;,  (2) 

with 

I  =  ^Pi,  J  =  ^P2.  (3) 


z-  =  z,(l  +  e,),  i/i  =  yi(l  +  a),  (4) 

and 

Ci  =  MoMiP3/r,-l  (5) 


It  is  useful  to  introduce  the  unknown  coordinates 
(Xi,Yi,Zi)  of  vector  MoM,  in  the  camera  coordinate 
system  fw  the  sole  purpose  of  demonstrating  that  these 
equations  are  correct.  We  remember  that  the  dot  prod¬ 
uct  MoM,  -  Pi  is  the  operation  performed  when  multi¬ 
plying  the  first  row  of  the  transformation  matrix  P  by 
the  coordinates  of  an  object  point  in  the  object  frame 
of  reference  to  obtain  the  z-coordinate  Xi  of  Mi  in  the 
camera  coordinate  system.  Thus  MoM,-  Pi  =  Xi.  For 
the  same  reason,  the  dot  product  MoM,-  ■  P3  is  equal  to 
Zi,  thus  (1  -1-  e,-)  =  Zi/Ti  .  Also,  in  perspective  pro¬ 
jection,  the  relatbn  Z{  =  fXifZi  holds  between  image 
point  coordinates  and  object  point  coordinates  in  the 
camera  coordinate  system.  Using  these  expressions  in 
the  equations  above  leads  to  identities,  which  proves  the 
validity  of  the  equations. 

Two  problems  must  be  addressed  before  applying 
these  equations; 

1.  The  terms  Ci  are  generally  unknown.  These  terms 
depend  on  P3  (Equation  (5)),  which  can  be  com¬ 
puted  (mly  after  I  and  J  have  been  computed  (Sec¬ 
tion  2). 

2.  These  equations  can  be  used  only  after  the  corre¬ 
spondences  between  image  points  and  object  points 
have  been  estiblished.  Only  then  can  the  correct 
values  for  the  image  coordinate  z,-  be  written  on 
the  right  hand  side  for  a  given  vector  MoMj  on  the 
left  hand  side. 
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Starting  with  the  first  problem  of  dealing  with  un¬ 
known  e,',  notice  that  x\  —  z,'(l  -t-  Ci)  and  yj  =  ]fc(l  -l-e,) 
are  the  image  coordinates  of  the  object  points  Af,  when 
we  use  a  scaled  orthographic  projection  model.  Indeed 
z<  =  fXifZi  can  be  written  as  z;  =  i^XilTt,  and 
we  obtain  z|-  =  fXi/Tf  In  other  wcvds,  image  points 
{Xfjj/i)  are  obtained  by  ‘fattening”  the  object  by  ortho¬ 
graphic  projection  of  its  points  onto  the  plane  z  =  Tt 
through  Mq  before  performing  a  perspective  projection. 
To  obtain  estimates  for  I  and  J,  we  first  use  Zi  and  y, 
instead  of  zj  and  y(  in  Equations  (2),  thereby  nmlcing 
errors  z^e,-  and  y^Cj  which  are  added  to  the  estimates  of 
the  image  errors.  Once  estimates  for  I  and  J  have  been 
obtained,  these  estimates  can  be  used  to  find  more  pre¬ 
cise  values  of  Ci,  which  in  turn  lead  to  better  estimates 
of  I  and  J. 

Regarding  problem  (2),  in  some  computer  vision  ap¬ 
plications  the  correspondences  can  be  obtained  prior  to 
the  pose  information.  For  example,  in  calibration  appli¬ 
cations,  the  feature  points  may  be  the  centroids  of  marks 
that  are  easy  to  distinguish  on  the  target  calibration  ob¬ 
ject.  Then  Equations  (2)  can  be  solved  iteratively  by 
first  making  rough  estimates  of  the  e/s  (setting  them  to 
zero  when  no  information  is  available),  solving  the  lin¬ 
ear  systems  for  I  and  J,  finding  better  estimates  for  a, 
and  repeating  the  process.  A  least  square  object  pose 
is  generally  found  in  a  few  iterations.  The  Appendix 
summarizes  the  steps  of  this  algorithm,  which  was  in¬ 
troduced  in  a  more  geometrical  form  in  [5].  The  rest  of 
this  paper  addresses  the  more  difficult  situation  where 
the  correspondences  between  image  points  and  object 
points  cannot  be  obtained  prior  to  the  pose  information. 

4  Geometric  Constraints  for  the 
Solutions  I  and  J 

The  following  discussion  shows  that  the  solutions  I  and 
J  are  located  within  small  polyhedral  regions  which  can 
be  identified  with  respect  to  the  4D  homogeneous  coor¬ 
dinate  system  of  the  object. 


Figure  2:  Left:  In  the  absence  of  uncertainties,  the  head 
of  vector  I  belongs  to  a  plane  orthogonal  to  the  fea¬ 
ture  vector  MoMj  at  the  z-point  located  at  abscissa 
z|/|MoMi|.  Right:  Because  of  uncertainties  in  image 
detection  and  Cj,  the  head  of  I  lies  in  a  40  slab  perpen¬ 
dicular  to  MoMi. 

Equations  such  as  MoM,-  ■  I  =  z|  (Equation  (2))  can 


be  viewed  as  geometric  constraints  on  the  vector  I  in 
space  with  respect  to  the  feature  vectors  MoM,-:  If  the 
foot  of  vector  I  coincides  with  Mo,  the  head  of  I  must 
project  on  the  feature  line  MoMt  cxito  a  point  Hgi  with 
abscissa  Zj/|MoMi|.  Ek]uivalently,  the  head  of  I  must 
belcuig  to  a  plane  perpendicular  to  MoM.  at  Hti  (Fig¬ 
ure  2).  In  the  following,  the  points  Hgi  are  called  z- 
pointa.  They  are  pmnts  constructed  on  the  feature  lines 
MoM,-  using  the  z-coordinates  of  the  image  points.  Sim¬ 
ilarly,  the  points  Hyi  considered  in  constructing  the  vec- 
tot  J  have  abscissae  y(/|MoM,-|  and  are  called  y-points. 

In  most  situations,  the  terms  z^  =  Zi(l  +  £,-)  are 
known  only  approximately.  Bounds  for  the  uncertain¬ 
ties  in  these  terms  can  be  computed  by  adding  the  im¬ 
age  error  e  and  the  error  e'  made  in  estimating  z,-ei. 
The  d  terms  are  the  projections  of  the  vectors  MoMi 
on  the  camera  optical  axis,  divided  by  the  distance 
from  the  camera  to  the  point  Mo  along  the  camera  op¬ 
tical  axis  (Equation  (5)).  Therefore  an  upper  bound  for 
these  terms  is  R/D,  where  R  is  the  radius  of  a  sphere 
centered  at  Mo  containing  the  object  and  D  is  a  lower 
bound  for  the  distance  T, .  Clearly,  estimating  tight  up¬ 
per  bounds  for  these  errors  is  made  easier  if  we  have 
some  idea  of  the  range  in  which  the  object  is  expected 
to  be  found.  When  the  object  is  being  tracked  and  an 
approximate  pose  for  the  object  has  been  found  from  a 
previous  image,  e,-  can  be  estimated  from  this  previous 
pose,  and  the  uncertainty  interval  can  be  reduced  and 
centered  around  Zi(l  -I-  Ei). 

Because  of  these  uncertainties,  all  we  can  say  is  that 
the  head  of  I  projects  onto  the  feature  vector  within 
the  uncertainty  interval  around  an  x-point  Hxi  ■  Equiva¬ 
lently,  the  head  of  I  must  belong  to  a  slab  perpendicular 
to  MoMi  at  x-point  Hsi  and  with  thickness  defined  by 
the  uncertainty  interval  (Figure  2). 

For  I  to  be  solution  of  a  system  of  n  Equations  (2) 
for  i  =  1,2, 3, ...  ,n,  the  head  of  vector  I  must  belong 
to  the  slab  Sn  defined  at  x-point  Hgi  on  the  feature 
vector  MqMi,  the  slab  523  defined  at  on  M0M3, 
etc.  Therefore  I  must  belong  to  the  intersection  of  these 
n  slabs.  A  necessary  condition  for  this  to  occur  is  that 
there  exists  a  region  E  in  space  contained  in  at  least  n 
slabs  Sii  (Figure  3). 

Similarly,  J  must  belong  to  the  intersection  of  n  slabs 
Tii-  A  necessary  condition  for  this  to  occur  is  that  there 
exist  a  region  6  in  space  contained  in  at  least  n  slabs  Tu 
(Figure  3). 

Furthermore,  the  n  slabs  5,,-  that  contain  region  E  and 
the  n  slabs  Tu  that  contain  region  6  must  be  defined 
by  the  same  feature  vectors  MoMi  and  the  same  image 
points  mi.  Therefore  the  n  slabs  7ii  at  the  y-points  Hyi 
computed  from  the  same  points  mi  which  produced  the 
slabs  5ii  must  intersect  in  a  non-empty  polyhedral  region 

e. 

As  additional  conditions,  the  solution  vectors  I  and 
J  are  constrained  in  relative  amplitude  and  orientation; 
the  first  three  coordinates  of  I  and  J  define  two  3D  vec¬ 
tors  R]  and  R3.  These  vectors  are  proportional  to  i  and 
j  respectively,  with  the  same  coefficient  of  proportional¬ 
ity  f/Tf  Therefore  Ri  and  R3  must  be  orthogonal  and 
have  equal  lengths.  Therefore  a  pair  of  regions  E  and 
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Figure  3;  The  head  of  1  is  found  at  the  intersection  of  the 
slabs  corresponding  to  the  x-coordinates  z\  (corrected 
by  1  +  fi)  of  the  feature  points  Mj.  The  vector  J  is 
then  found  at  the  intersection  of  the  slabs  using  the  y- 
coordinates  y(  and  the  same  correspondence.  A  further 
verification  is  obtained  from  the  property  that  I  and  J 
(or  3D  vectors  defined  by  the  first  three  coordinates  of  I 
and  J  in  4D)  are  perpendicular  and  have  same  lengths. 

0  can  contain  the  heads  of  vectors  I  and  J  on/y  if  (1) 
the  range  of  3D  distances  from  Afo  to  the  points  of  E 
overlaps  the  range  of  distances  from  A/q  to  the  points  of 
6,  and  (2)  the  extrema  of  the  3D  dot  products  of  pairs 
of  vect<»s  with  heads  in  each  region  have  opposite  signs. 

5  Finding  Solution  Regions  With 
Unknown  Correspondences 

In  the  problem  addressed  here,  the  correspondences  be¬ 
tween  the  N  feature  points  and  some  of  the  v!  detected 
image  points  are  not  known.  Given  an  object  point  Af,-, 
we  do  not  know  which  image  point  among  011,013,013, 
etc.,  is  the  image  of  Af,-.  Furthermore,  some  points  Mi 
may  not  have  images,  and  some  image  points  may  not 
correspond  to  any  of  the  object  points.  Let  us  assume 
for  the  moment  that  the  number  no  of  image  points  that 
match  the  object  points  is  known  (finding  this  number 
is  the  objective  of  the  next  section). 

The  best  we  can  then  do  is  to  consider,  for  the  N 
feature  points  Mi ,  all  the  slabs  defined  by  the  n'  detected 
image  points.  On  each  feature  vector  MoMj  we  can 
construct  an  x-point  H,ij  for  each  detected  image  point 
mj,  and  consider  the  corresponding  slab.  Slabs  obtained 
from  a  given  feature  vector  MoMj  are  parallel,  and  slabs 
obtained  from  two  different  feature  vectors  intersect  each 
other  (object  feature  points  Afj  can  be  chosen  so  that  the 
lines  MoMi  are  well  separated). 

The  proposed  method  finds  small  regions  of  space  E 


and  6  that  contain  the  heads  of  vectors  I  and  J.  If 
indeed  at  least  no  of  the  detected  image  points  are  images 
of  object  feature  points,  and  if  the  bounds  for  the  image 
and  Si  uncertainties  are  correct,  there  exists  a  pair  of 
regions  (E,  6)  such  that  E  is  contained  in  at  least  no 
slabs,  defined  by  x-points  Hgij  located  on  feature  vectors 
MoM(  and  corresponding  to  image  points  m,- ,  and  such 
that  6  is  also  contained  in  at  least  no  slabs,  defined 
by  y-points  H^ij  located  on  feature  vectors  MoM,  and 
corresponding  to  the  same  image  points  mj . 

The  method  is  based  on  eliminating  pairs  of  regions 
of  space  which  do  not  satisfy  the  geometric  constraints 
defined  in  the  previous  section,  proceeding  from  coarse 
to  fine  regions  by  bisection  of  space.  A  region  E'  and  a 
region  &  cannot  contain  the  heads  of  I  and  J  if: 

1.  E'  or  O'  is  not  intersected  by  no  slabs  (then  no  point 
inside  the  region  can  be  contained  in  no  slabs). 

2.  E'  and  &  are  not  intersected  by  no  slabs  con¬ 
structed  from  the  same  image  points. 

3.  The  range  of  3D  distances  from  Mo  to  the  points 
of  E*  does  not  overlap  the  range  of  distances  from 
Mo  to  the  points  of  6'.  (Hence  the  two  regions 
cannot  respectively  contain  heads  of  I  and  J  at  equal 
distances  from  Mo)- 

4.  The  extrema  of  the  3D  dot  products  of  pairs  of  vec¬ 
tors  with  heads  in  each  region  have  the  same  sign. 
(Hence  the  two  regions  cannot  respectively  contain 
heads  of  perpendicular  vectors  I  and  J). 

There  exists  a  pair  of  regions  (E,  6)  which  cannot  be 
eliminated  by  the  above  criteria,  and  we  can  recognize 
and  label  in  the  image  the  no  points  that  contributed 
to  these  regions.  We  find  these  regions  by  simultane¬ 
ously  performing  two  recursive  bisections  of  space.  We 
simultaneously  explore  two  binary  trees  by  depth-first 
search,  pairing  branches  of  the  two  trees  and  pruning 
paired  branches  excluded  by  the  above  criteria. 

To  further  verify  the  matching  of  the  no  points  (and 
also  to  provide  a  more  accurate  pose  matrix),  we  pro¬ 
ceed  as  follows:  The  terms  e,  can  now  be  computed 
from  the  pose  matrix,  using  Equation  (5).  The  terms 
Xf  =  Xi(l  +  d)  and  =  yi(l  Ci)  can  then  be  com¬ 
puted,  as  well  as  reduced  uncertainty  intervals.  These 
corrected  coordinates  and  intervals  define  thinner  slabs 
at  slightly  different  locations  which  result  in  smaller  re¬ 
gions  E  and  6,  less  ambiguity  between  possible  match¬ 
ings,  and  a  mure  accurate  pose. 

6  Finding  the  Best  Regions 

The  explanations  so  far  have  focused  on  finding  regions  E 
and  6  in  the  intersection  of  at  least  no  slabs.  We  would 
actually  like  to  find  regions  contained  in  the  intersection 
of  the  highest  number  of  slabs,  because  this  provides  the 
maximal  number  of  matches  between  image  points  and 
object  points.  We  start  the  search  with  no  =  n',  the 
total  number  of  image  points  detected.  Usually,  some 
image  points  have  no  matches,  and  the  search  quickly 
fails.  We  then  decrement  no  until  a  search  succeeds. 
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7  Searching  for  Regions  E  and  0 

A  binuy  tree  search  was  advocated  by  Breuel  for  this 
type  of  problem  [3].  To  search  for  a  single  region,  say 
we  start  with  a  large  box  which  is  guaranteed  to  contain 
all  the  regions  of  interest,  and  recursively  divide  the  box 
into  two  child  b<«es.  At  depth  1,  the  plane  used  to  divide 
the  box  is  perpendicular  to  the  x-axis  of  the  4D  space;  at 
depth  2  the  plane  is  perpendicular  to  the  y-axis,  at  depth 
3  to  the  z-axis,  and  at  depth  4  to  the  ir-axis.  At  depth 
5  we  are  back  to  a  division  perpendicular  to  the  x-axis, 
and  so  on.  Eventually,  we  have  divided  the  space  into 
boxes  so  small  that  at  least  one  of  them  is  contained  in 
the  region  £,  in  the  intersection  of  no  slabs.  This  process 
is  illustrated  in  2D  in  Figure  4. 


Figure  4:  A  search  by  bisection  of  space  locates  a  box 
contained  in  a  region  in  the  intersection  of  three  slabs 
(white  box).  Boxes  which  are  not  iniersecied  by  three 
slabs  are  pruned  (black  box).  Tests  for  intersections  be¬ 
tween  boxes  and  slabs  are  performed  using  box  projec¬ 
tions  cm  the  feature  vectors. 

7.1  Simultaneously  Searching  for  Two  Regions 

We  start  with  an  initial  box  Ao  large  enough  to  contain 
the  head  of  I  and  an  initial  box  Bq  large  enough  to  con¬ 
tain  the  head  of  J  (from  Equations  (2)  and  (3),  one  can 
show  that  the  coordinates  of  these  vectors  can  be  ex¬ 
pressed  in  pixels  and  are  smaller  than  the  largest  image 
point  coordinates).  Box  Aq  is  divided  into  two  boxes  At 
and  A],  and  box  Bq  into  two  boxes  Bi  and  Bj.  The  elim¬ 
ination  criteria  of  the  previous  section  are  then  applied 
to  the  pair  of  boxes  At  and  Bi .  If  none  of  the  criteria 
applies,  the  two  boxes  we  thc^mselves  divided,  and  the 
process  is  repeated  recursively.  If  any  elimination  crite¬ 
rion  sq>plies,  another  pair  of  boxes  at  the  same  depth,  say 
At  and  Bj,  is  considered.  If  all  four  possible  pairs  are 
eliminated,  we  backtrack  to  a  pair  which  was  not  consid¬ 
ered  at  the  previous  depth.  If  the  previous  depth  is  the 
root  depth,  the  search  has  failed.  The  sewch  succeeds 
when  two  boxes  cd  small  predefined  size  (a  few  pixels,  if 
the  coordinates  of  I  and  J  are  expressed  in  pixels)  survive 
the  elimination  criteria,  and  are  contained  in  no  corre¬ 
sponding  slabs;  this  second  test  is  added  when  the  boxes 


become  small  enough  to  fit  into  the  slabs;  the  elimina¬ 
tion  criteria  use  only  the  necessary  (but  not  sufficient) 
condition  that  a  box  be  intersected  by  no  slabs. 

7.2  Tests  for  Intersection  of  a  Box  With  n 
Slabs 

If  a  box  does  not  intersect  n  slabs,  no  subdivision  of 
this  box  will  be  contained  in  n  slabs,  and  this  branch  of 
the  tree  can  be  pruned  [3].  This  is  one  of  the  elimination 
criteria  defined  above.  The  tests  for  containment  and  in¬ 
tersection  are  simpler  here  than  in  Breuel’s  formulation. 
A  box  intersects  n  slabs  if,  for  each  of  n  feature  vectors 
MoM,,  the  ID  projecticm  of  the  b<nc  on  MoM,-  intersects 
the  uncertainty  interval  around  an  x-point  Hgij . 

Instead  of  checking  for  intersections  of  intervals,  we 
augment  the  interval  of  the  box  projection  on  each  side 
by  the  amplitude  of  the  uncertainty  interval,  and  we 
count  the  x-points  contained  in  this  interval.  The  uncer¬ 
tainty  interrals,  and  the  lengths  pa  of  the  projections 
of  a  box  at  depth  d  on  the  feature  vectors  MqM,-,  are 
precomputed  (Figure  5). 


Figure  5:  The  projection  segment  of  a  child  box  shares 
one  bound  with  the  projection  segment  of  its  parent  box. 
A  binary  search  finds  the  other  bound  of  this  segment  in 
the  sorted  parent  list  of  x-points.  This  position  provides 
the  element  count  and  the  address  fw  the  child  list. 

The  count  of  intersections  between  boxes  and  slabs  is 
incremented  by  I  for  each  feature  vector  when  we  find 
at  least  one  x-pdnt  inside  the  (augmented)  projection 
interval  of  the  box.  Thus  we  have  to  keep  track  of  which 
x-points  are  contained  in  the  augmented  projection  in¬ 
terval  of  the  ben.  Having  done  this  already  for  the  parent 
box,  we  know  the  list  of  x-points  contain^  in  the  parent 
interval.  Each  child  inter>^  shares  one  bound  with  the 
parent  interval.  The  other  bound  is  inside  the  parent  in¬ 
terval  and  must  be  located  with  respect  to  the  x-points 
(Figure  5).  This  is  achieved  by  bisection  of  the  parent 
list  of  x-points  (the  root  list  of  x-points  was  sorted).  The 
location  of  this  bound  provides  the  element  count  for  the 
new  list.  The  list  of  x-pennts  for  the  left  child  has  the 
same  address  as  the  parent  list,  whereas  the  list  for  the 
right  child  has  an  address  ofiset  by  the  position  of  its  left 
bound  in  the  parent  list. 
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7.3  Tests  for  Containment  of  a  Box  in  a  Region 

We  verify  that  a  box  at  depth  d  is  contained  in  no  slabs 
by  verifying  that  for  each  of  at  least  no  feature  vec¬ 
tors  MoMj,  the  ID  projection  of  this  box  on  MoM,-  is 
contauned  in  the  uncertainty  interval  around  an  x-point 
H^j. 

The  depth  for  which  all  the  lengths  of  the  projec¬ 
tions  of  boxes  at  depth  d  on  feature  vectors  Mo  Mi  are 
smaller  than  the  lengths  u,-  of  the  uncertainty  intervals 
is  also  precomputed,  and  we  start  checking  for  contain¬ 
ment  of  the  box  projections  in  the  uncertainty  intervals 
only  when  the  tree  search  has  reached  this  depth.  In¬ 
stead  of  checking  whether  a  projection  is  contained  in 
an  uncertainty  interval  around  a  point  /fxiji  check 
whether  there  is  a  point  J/rij  in  the  interval  of  length 

(Ui-Pdi). 

8  Tracking 

In  a  tracking  task,  we  assume  that  the  object  has  been 
found  in  the  previous  image  frame,  and  its  pose  has  been 
estimated  after  finding  acceptable  vectors  I  and  J  (Sec¬ 
tion  2).  We  perform  the  tracking  in  the  4D  space  where 
the  search  for  I  and  J  takes  place.  First,  the  previ¬ 
ous  pose  allows  us  to  compute  estimates  for  the  terms 
£i,  and  to  take  =  i<(l  -b  Si)  and  y-  =  yi(l  +  £{)  as 
defining  the  positions  of  the  x-points  and  y-points  on 
the  feature  vectors.  The  error  made  by  using  e,  from  a 
previous  frame  contributes  to  the  uncertainty  intervals 
around  these  points  and  can  be  computed  from  upper 
bounds  dff  and  dT  on  possible  rotation  angle  and  trans¬ 
lation  increments  between  the  two  frames.  Second,  the 
vector  I  is  transformed  into  T  =  I  +  dl  between  the  two 
frames,  and  the  search  for  the  head  of  I'  can  be  limited 
to  a  box  centered  around  the  head  of  I  and  of  size  larger 
than  |dl|,  also  depending  on  d0  and  dT.  Predictive  tech¬ 
niques  can  be  used  to  predict  I'  and  the  uncertainty  on 
r  in  order  to  further  reduce  the  size  of  the  initial  search 
box  for  I',  2U)d  similarly  for  J'. 

9  Summary 

We  have  introduced  new  equations  that  express  the  rela¬ 
tionship  between  model  points  and  image  points  in  a  per¬ 
spective  model  of  projection,  the  nonlinear  terms  of  the 
perspective  transformation  are  placed  on  the  right  hand 
side  in  combination  with  the  image  coordinate  terms. 
The  uncertainty  in  the  estimates  of  these  nonlinear  terms 
can  be  modeled  as  additional  image  uncertainty.  We  ob¬ 
tain  linear  constraints  on  two  4D  vectors  I  and  J  propor¬ 
tional  to  the  first  and  second  rows  of  the  homogeneous 
transformation  matrix  of  the  object.  These  constraints 
represent  slabs  of  space  which  are  perpendicular  to  fea¬ 
ture  vectors  (joining  the  origin  of  the  object  coordinate 
system  to  object  feature  points)  at  points  that  depend  on 
the  image  coordinates  of  these  feature  points.  Regions  of 
sp2u;e  where  the  largest  numbers  of  these  slabs  intersect 
are  used  to  locate  the  vectors  I  and  J  and  correspond 
to  maximal  matchings  between  object  points  and  image 
points.  Simultaneous  binary  tree  searches  are  performed 
in  regions  containing  I  and  J,  and  are  pruned  by  mutual 
constraints  resulting  from  the  fact  that  I  and  J  belong 


to  slabs  corresponding  to  the  same  image  points.  Other 
pruning  criteria  use  the  fact  that  the  first  three  compo¬ 
nents  of  I  and  J  define  vectors  which  are  perpendicular 
and  equal  in  length.  Most  of  the  search  is  ID  search  by 
segment  trees  along  the  feature  vectors. 

Appendix:  Iterative  Pose  Computation 
from  Point  Correspondences 

'  'ere  we  summarize  a  simple  iterative  algorithm  for  find¬ 
ing  the  pose  of  an  object  when  a  matching  between  ob¬ 
ject  feature  points  and  image  points  is  known.  It  is 
an  analytic  formulation  of  the  POSIT  (Pose  from  Or¬ 
thography  and  Scaling  with  ITerations)  algorithm  [5]  in 
homogeneous  form,  which  removes  the  need  to  locate 
the  image  of  the  origin  Mo  of  the  object  coordinate  sys¬ 
tem.  Note  that  this  pose  calculation  is  presented  inde¬ 
pendently  of  the  search  method  described  above,  which 
finds  the  matching  and  the  pose  by  binary  search  of  space 
when  the  matching  is  not  known. 

The  equations  to  be  solved  are  Equations  (2).  The 
steps  of  the  iterative  pose  algorithm  can  be  summarized 
as  follows: 

1.  £•=  best  guess,  or  e,  =  0  if  no  pose  information  is 
available 

2.  Start  of  loop:  Solve  for  I  and  J  in  the  following 
systems  (see  next  paragraph): 

MoMiI  =  »5,MoM.J  =  y; 

with 

Xi=*i(l  +  e.),  yi  =  y.(l  +  ei) 

3.  From  I,  get 

Ill  =  (/l,/2./3), 

//T,  =  |Ri|, 
i  =  (r,//)R„ 

Pi  =  (Z//)I 

Similar  operations  yield  j  and  P2  from  J. 

4.  k  =  1  ji  P3  —  (^u>  ^vt  ^wi  T, ),  Cj  = 
MoMi  -Pa/T,  -  1 

5.  If  all  Ci  are  close  enough  to  the  c,  from  the  previous 
loop,  EXIT,  else  go  to  step  2. 

6.  Pi.Pj.Pa,  along  with  P4  =  (0,0,0, 1),  are  the  four 
rows  of  the  pose  matrix. 

We  now  provide  details  about  finding  I  and  J  for  step 
2  of  the  iterative  algorithm.  The  equations  for  I  are 

MoM.- -1  =  1; 

The  unknowns  are  the  four  coordinates  (/i,  /2, 13, 14),  of 
I,  and  we  can  write  one  equation  for  each  of  the  object 
points  Mi  for  which  we  know  its  position  m,-  in  the  image 
and  its  image  coordinate  Xi .  One  such  equation  has  the 
form  UiIi+Vil2-\-WiIa-^-U  =  where  (CA,,  V;,  Wi,  1)  are 
the  four  coordinates  of  A/. .  These  equations  for  several 
object  points  Af.  constitute  a  linear  system  of  equations 
which  can  be  written  in  matrix  form  as  AI  =  V,,  where 
A  is  a  matrix  with  i-th  row  vector  Aj  =  (f/.,  Vj,  Wi,  1), 
and  Vx  is  a  column  vector  with  t-th  coordinate  x( . 
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Similarly,  J  can  be  found  by  solving  the  linear  system 
AJ  =  Vy,  where  A  is  the  same  matrix,  and  V,  is  a 
column  vector  with  t-th  coordinate 
Since  there  are  four  unknown  coordinates  in  vectors 
I  and  J,  matrix  A  must  have  at  least  rank  4  for  the 
systems  to  provide  solutions.  This  requirement  is  sat¬ 
ined  if  A  has  at  least  four  rows  and  the  object  points 
are  noncoplanar;  therefore  at  least  four  noncoplanar  ob¬ 
ject  pmnts  and  their  corresponding  image  points  are  re¬ 
quire.  The  pseudo-inversion  operation  is  ^plied  to  ma¬ 
trix  A;  the  pseudo-inverse  is  called  the  object  matrix  B. 
Since  A  is  defined  in  terms  of  the  known  coordinates  of 
the  object  points  in  the  object  coordinate  system,  B  de¬ 
pends  only  on  the  geometry  of  these  object  points  and 
can  be  precomputed. 

Experiments  [5]  show  that  this  iterative  approach  gen¬ 
erally  provides  an  accurate  pose  of  the  object  in  a  few 
iteration  steps,  as  long  as  the  points  Mi  are  contained 
within  a  camera  field  of  view  of  leas  than  90  degrees. 
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Abstract 

An  automatic  egomotion  compenaation  baaed 
point  correapondence  and  motion  detection  al¬ 
gorithm  ia  presented.  First,  the  motion  of 
the  camera  ia  compensated  using  a  compu¬ 
tational  vision  baaed  image  registration  algo¬ 
rithm.  Then  consecutive  frames  are  trans¬ 
formed  to  the  same  coordinate  system  and  the 
feature  correspondence  and  motion  detection 
problems  are  solved  as  though  for  a  station¬ 
ary  camera.  For  point  correspondence,  feature 
points  are  detect^  using  a  Gabor  wavelet  de¬ 
composition  and  local  interaction  based  algo* 
rithm.  Methods  of  subpixel  accuracy  feature 
matching  and  tracking  are  introduced.  For  mo¬ 
tion  detection,  we  firrt  determine  the  changed 
regions  from  the  camera  motion  compensated 
frame  differences.  We  then  detect  moving  ob¬ 
jects  by  grouping  these  changed  regions  and  es¬ 
timate  the  object  motion  parameters.  Experi¬ 
mental  results  on  several  real  image  sequences 
are  presented. 

1  Introduction 

A  practical  issue  in  camera  pose  estimation  and  struc¬ 
ture  from  motion  problems  ih  computer  vision  research 
is  the  motion  correspondence  problem-automatic  detec¬ 
tion  and  tracking  of  features  over  successive  frames.  The 
basic  task  is  to  locate  the  same  features  over  consecutive 
frames,  a  non-trivial  problem  when  the  camera  motion 
between  the  frames  is  complicated,  for  example  when 
there  is  significant  camera  rotation  and  translation  be¬ 
tween  the  frames  and  the  camera  motion  is  irregular. 
Tracking  becomes  even  more  difficult  when  dealing  with 
automatically  detected  features  since  they  may  not  al¬ 
ways  be  located  at  significant  points  such  as  the  cor¬ 
ners  of  buildings.  Various  methods  for  solving  the  cor¬ 
respondence  problem  have  been  studied  [3,  7].  In  gen¬ 
eral,  feature  displacements  over  consecutive  frames  can 
be  approximately  decomposed  into  two  components;  (i) 
displacements  due  to  camera  motion  that  can  be  com¬ 
pensated  by  image  rotation,  scaling,  and  translation;  (ii) 
displacements  due  to  object  motion  and/or  perspective 
deformation.  The  displacements  due  to  camera  motion 
are  usually  much  larger  and  more  irregular  than  the  dis¬ 


placements  caused  by  object  motion  and  perspective  de¬ 
formation.  Most  existing  methods  require  the  camera 
motion  to  be  small  and  smooth.  When  tracking  features 
in  a  long  sequence,  the  problem  of  feature  drift  exists,  i.e. 
the  features  may  gradually  change  due  to  spatial  sam¬ 
pling  and  to  the  deformation  of  the  image  due  to  the 
motions  of  objects  and/or  the  camera.  Feature  drift  al¬ 
ters  the  target  being  tracked  and  does  not  contain  any 
useful  information.  Feature  drift  correction  is  especially 
important  for  tracking  automatically  detected  features, 
which  may  be  located  in  relatively  smooth  areas  and 
hence  may  be  more  vulnerable  to  steady  location  drift. 

In  this  paper,  we  introduce  a  two-step  approach  to 
solving  the  feature  point  correspondence  problem.  First, 
the  motion  of  the  camera  is  compensated  using  a  compu¬ 
tational  vision  based  image  registration  algorithm  [14]. 
A  method  for  subpixel  accuracy  feature  matching  is  then 
implemented  to  improve  the  camera  motion  compensa¬ 
tion.  Then  consecutive  frames  are  transformed  to  the 
same  coordinate  system  and  the  feature  point  correspon¬ 
dence  problem  is  posed  as  one  of  tracking  moving  objects 
using  a  still  camera.  A  subpixel  interpolation  method  is 
use  to  suppress  feature  drift. 

Our  approach  is  fully  automatic  and  robust  to  various 
kinds  of  camera  motion.  It  is  simple  because  it  compen¬ 
sates  for  camera  motion  at  the  first  step,  which  signif¬ 
icantly  simplifies  the  matching  process  and  reduces  the 
computation.  No  higher  level  primitives  such  as  edges, 
and  no  structure  information,  are  required  for  the  track¬ 
ing  step,  and  no  post  processing  method  such  as  relax¬ 
ation  [6]  or  Kalman  filtering  [5]  is  required.  Good  results 
have  been  obtained  for  several  real  image  sequences  ac¬ 
quired  from  indoor  and  outdoor  scenes  under  different 
camera  motions.  Successful  motion  estimates  based  on 
the  method  of  trajectory  estimation  reported  here  are 
described  in  [8,  11). 

Automatic  motion  detection  is  an  important  practical 
problem.  Intuitively,  motion  detection  can  be  accom¬ 
plished  by  taking  differences  of  successive  images  and 
detecting  non-sero  parts.  The  problem  becomes  mcMre 
complicated  if  the  camera  is  on  a  moving  platform  so 
that  translation,  rotation,  and  scale  change  between  the 
two  images  have  to  be  taken  into  account/'  We  introduce 
a  two-step  motion  segmentation  algorithm  for  detecting 
moving  objects  in  image  sequences  acquired  from  a  mov¬ 
ing  platform.  First,  the  camera  motion  is  compensated 
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using  a  subpixel  accuracy  image  registration  algorithm. 
We  then  select  regions  where  changes  may  have  occurred. 
A  method  of  detecting  actual  moving  objects  and  esti¬ 
mating  their  motion  parameters  is  introduced. 

In  the  following  sections,  we  first  present  the  camera 
motion  estimation  algorithm,  then  describe  a  subpixel- 
accuracy  motion  correspondence  algorithm,  and  finally 
discuss  the  motion  detection  algorithm.  Experimental 
results  on  motion  correspondence  and  motion  detection 
from  several  real  sequences  are  presented  at  the  end  of 
the  paper. 

2  Camera  Motion  Estimation 


borhood  around  the  feature  point 
-Wd  <  *.j  <  Wd 

we  have  (2ud  +  1)’  simultaneous  equations 

7  =  GD  (3) 

The  offset  vector  D  can  be  computed  as 

)=(G‘G)-‘G‘/  (4) 

where 

/  d{-ud,-ud)  \ 


2.1  Review  of  an  Earlier  Algorithm 

Let  (xti  Ifc)  be  the  image  frame  coordinates,  measured 
with  respect  to  the  position  of  the  camera  at  time  U,  for 
i  =  {1, 2}.  The  relationship  between  the  two  frames  can 
be  approximated  by  [14] 

where  s  =  is  the  scaling  factor,  $  is  the  ro- 

tation  angle  between  the  two  frames,  and  (Ar^,  Ays)  is 
the  translation  measured  in  the  image  coordinate  system 
of  frame  ts.  The  effect  of  camera  motion  is  characterized 
by  the  four  parameters  Axj,  Ays,  0,  and  s. 

We  use  the  image  registration  algorithm  reported  in 
[14]  to  estimate  Azs,  Aysi  0,  and  s.  The  camera  rota¬ 
tion  is  estimated  and  compensated  early  in  the  match¬ 
ing  process  using  an  illuminant  direction  estimator  [12]. 
A  small  number  of  feature  points  are  then  located  us¬ 
ing  a  Gabor  wavelet  model  for  detecting  local  curvature 
discontinuitie8[4].  The  feature  points  extracted  from  dif¬ 
ferent  frames  are  matched  using  a  metric  often  used  in 
atrea-based  correlation  techniques.  Here,  however,  cor¬ 
relation  is  performed  using  feature  points.  Multiresolu¬ 
tion  transform-and-correct  matching  is  implemented  to 
obtain  accurate  estimates  of  camera  motion  parameters. 
At  each  resolution,  frame  ti  is  first  transformed  to  the  co¬ 
ordinates  of  frame  using  the  estimated  camera  motion 
parameters  (0,  Az,  Ay,  a);  then  matching  refinement  is 
performed  on  the  feature  points  of  frame  t2- 

2.2  Subpixel  Matching 

The  area  correlation  matching  used  here  can  only  find 
the  best  grid  point  to  grid  point  matches.  Further  pro¬ 
cessing  is  required  for  subpixel  accuracy  matching.  Since 
a  good  initi^  match  has  been  obtained  by  the  area  cor¬ 
relation  step,  a  simple  and  effective  way  to  achieve  sub¬ 
pixel  accuracy  is  by  using  an  image  differential  method 
[1,  10].  Assume  f]  is  offset  by  (6x,  6y)  relative  to  fi;  then 
the  frame  difference  can  be  written  as 


d(ij)  = 


6y 


(2) 


where  ^  and  ^  are  derivatives  of  fi  and  can  be  ap¬ 
proximated  by  forward  differences.  For  a  small  neigh- 
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In  our  implementation  of  this  algorithm  we  use  a  small 
Ud,  for  example  b;,i  =  4,  to  reduce  the  computation  and 
achieve  better  localization.  Incorporation  of  subpixel  ac¬ 
curacy  feature  matching  significantly  improves  the  accu¬ 
racy  of  camera  motion  estimation.  A  test  involving  the 
estimation  of  simulated  camera  motion  is  presented  in 
Section  5. 


3  Motion  Correspondence 

3.1  Overview 

As  pointed  out  in  Section  1,  the  fundamental  task  in  mo¬ 
tion  correspondence  is  to  track  features  over  the  image 
sequence.  When  the  motion  of  the  camera  is  compen¬ 
sated  using  the  registration  algorithm  discussed  in  Sec¬ 
tion  2,  the  displacement  of  a  feature  point  in  the  new 
frame  can  only  be  caused  by  perspective  distortion  and 
the  motions  of  objects.  In  this  paper  we  use  intensity 
based  area  correlation  to  match  features  over  consecu¬ 
tive  frames.  A  problem  associated  with  intensity  based 
feature  matching  is  that  the  locations  of  feature  points 
are  defined  in  terms  of  local  intensity  variations  and  may 
drift  away  after  several  frames.  This  can  be  caused  by 
quantization  of  the  feature  locations  and/or  perspective 
deformation  of  the  local  image.  In  this  paper,  a  method 
using  subpixel  accuracy  tracking  and  weighted  correla¬ 
tion  is  introduced  to  overcome  this  feature  drift  problem. 

Consider  the  matching  process  illustrated  in  Figure  1 . 
Although  a  feature  point  is  located  at  grid  point  Pi  in 
the  first  frame,  its  location  will  be  transformed  to  P/  to 
compensate  for  the  camera  motion.  The  exact  location  of 
P[  is  usually  not  at  a  grid  point.  Since  correlation-based 
matching  can  only  locate  the  best  matches  at  grid  points, 
conventional  approaches  which  approximate  a  feature  lo¬ 
cation  to  its  nearest  grid  point  will  result  in  two  types  of 
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Figure  1:  Feature  matching  graph 


f2 


approximation  errors,  resulting  from  approximating  P{ 
and  P2  to  their  nearest  grid  points.  Both  errors  cause 
the  feature  locations  to  migrate.  The  cumulative  errors 
can  be  quite  large  when  working  with  a  long  sequence. 
A  subpixel  accuracy  tracking  method  is  needed  in  or¬ 
der  to  obtain  accurate  trajectories.  A  three  step  sub¬ 
pixel  accuracy  matching  algorithm  is  used  here.  First, 
an  initial  matching  to  the  nearest  grid  point  is  achieved 
by  a  weighted  cross  correlation  match.  Secondly,  the 
differential  method  discussed  in  Section  2.2  is  used  to 
achieve  subpixel  accuracy  matching  for  all  four  nearest 
grid  points.  Finally,  the  feature  location  is  interpolated 
from  the  matches  of  the  four  nearest  grid  points  using 
an  interpolation  function. 

The  main  steps  in  the  algorithm  are  as  follows.  Given 
an  image  sequence,  we  first  compute  the  camera  mo¬ 
tion  parameters  using  the  image  registration  algorithm 
presented  in  Section  2.  Then,  starting  from  the  feature 
points  detected  in  the  first  frame,  ^  for  every  two  con¬ 
secutive  frames  we  transform  the  first  frame  and  the  co¬ 
ordinates  of  its  feature  points  to  the  coordinate  system 
of  the  second  frame.  We  then  search  the  neighborhood 
of  the  anticipated  feature  locations  for  the  best  match. 
The  locations  of  the  feature  points  are  then  refined  us¬ 
ing  the  subpixel  accuracy  matching  method  discussed  in 
Section  2.2  and  an  interpolation  formula  introduced  in 
Section  3.3. 

3.2  Weighted  Correlation 

As  pointed  out  in  Section  3.1,  there  are  two  sources  of  er¬ 
rors  that  cause  feature  drift:  the  error  due  to  feature  lo¬ 
cation  quantization  and  the  error  due  to  image  deforma- 
tion  caused  by  the  relative  motion  between  the  camera 
and  the  objects.  The  feature  location  quantization  error 
can  be  suppressed  by  using  subpixel  accuracy  tracking. 
On  the  other  hand,  the  error  due  to  image  deformation  is 
usually  more  difficult  to  remove,  especially  when  the  area 
correlation  method  is  used.  In  implementing  the  area 
correlation  matching  method,  there  is  a  trade-off  in  the 

’The  feature  pouts  can  also  be  obtained  by  other 
methods — for  example,  manually  selected,  as  in  many  mo¬ 
tion  analysis  algorithnu.  The  motion  correspondence  part  is 
independent  of  the  selection  of  feature  points. 


selection  of  the  area  size  used  in  computing  correlation; 
the  larger  the  correlation  area  the  better  is  the  selectivity 
over  similar  features,  but  the  less  accurate  are  the  feature 
locations.  For  example,  when  occlusion(s)  occur  within 
the  correlation  area,  or  when  a  feature  point  is  near  the 
border  of  an  object  and  the  motion  of  the  camera  changes 
the  background  significantly,  the  best  cross-correlation 
will  be  away  from  the  correct  location,  causing  the  fea¬ 
ture  locations  to  migrate.  One  way  to  solve  this  dilemma 
is  to  use  a  hierarchy  of  windows:  use  a  large  window  to 
locate  the  correct  matching  peak,  and  gradually  reduce 
the  window  size  to  better  locate  the  feature  points.  We 
present  a  simple  method  using  weighted  correlation,  in 
which  greater  weights  are  put  on  the  neighbors  which 
are  closer  to  the  feature  center.  The  weighted  correla¬ 
tion  method  possesses  good  selectivity  since  a  large  area 
is  used,  and  at  the  same  time  it  possesses  good  feature 
localization  since  the  central  parts  have  higher  weights. 
The  modified  matching  criterion  is 

+  », »  +  »f2(u  +  »,  V  +  j)  (7) 

•> 

where 

j)  =  fi(m-l-i,n-|-i)-/»i 
f2(u-l-i,w+i)  =  f7{u  +  i,v  +  j)-  H2 

^2  =  +  +  (8) 

1*1  =  ^53fi(m-l-i,n-|- j) 

ij 

t*2  =  ^53  Mu +  «,«'  +  >) 

•i 

r  =  '£vi 

ij 

The  "fijS  are  non-negative  weights.  In  our  implementa¬ 
tion,  we  let  the  7,^8  be  the  same  for  pixels  at  the  same 
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distances  from  the  center,  where  the  distance  is  defined 
as  the  maximum  distance  along  the  z  or  y  direction.  To 
be  more  specific,  we  define  the  distance  to  be 

k  =  dij  =  max{|i|,  |j|}  (9) 


where  dj  and  6i,  i={l,  2,  3,  4  }  are  coefficients  deter¬ 
mined  from  the  mattes: 


{ 


*i  =  di»i  +  djyj  +  daZiPi  -|-  d^ 
Vi  =  +  ijlfc'  +  kiXil/i  +  ^4 


(14) 


Notice  that  AT*,  the  total  number  of  neighbors  at  the 
same  distance  k,  is 


{ 


1 

6k 


for  d  =  0 
for  t  >  1 


(10) 


If  we  further  require  the  sum  of  weights  at  each  distance 
d  to  be  constant  (note  that  Nk  increases  as  k  increases, 
hence  as  a  pixel  moves  farther  away  from  the  feature 
center  its  weight  decreases),  the  weighting  coefficient  fif 
can  be  generated  using 

™  =  ‘l‘l.  WO  >“».«}  (») 

where  the  weight  for  the  central  point,  is  positive,  for 
example  j0„  =  1,  and  e  is  the  ratio  of  to  the  weights 
of  the  outer  layers.  As  shown  in  (11),  the 'yij  s  are  positive 
and  their  values  decrease  as  points  move  away  from  the 
central  point;  hence  intensity  similarities  in  the  central 
part  are  given  higher  weights.  Using  Cauchy's  inequality 
we  can  show  [13]  that  the  value  of  the  weighted  correlar 
tion  (7)  is  in  the  range  [—1,1],  reaching  the  boundaries 
of  the  range  if  and  only  if 


fi(m-H‘,n-Hj)-pi  _ 

#  /  I  •  t  ‘X 

fa(u  +  t,w  +  j)-/ij 


V{i,  j]  €  Q 


(12) 


For  example,  when  the  two  areas  are  identical  we  have 


fi(m  + 1,  n  -b  j)  -Hi-  fa(«  + 1, «  +  j)  -  Hi 

and  —  1-  On  the  other  hand,  when  fa  is  the  nega¬ 
tive  of  ti  we  have 


where  t  =  {11,12,21,22}.  The  interpolation  function 
(13)  can  be  expressed  relative  to  the  match  of  (zii,yii) 


X-Xii  =  oif,  -I- oac,  -b osCsCy  +  04 

y  ~  Vii  =  +  ka<r  +  +  ^4 

t,  =  X-  xii 

Note  that 

*ia  =  *11  + 1  Via  =  »ii 

*ai  =  *11  PJi  =  yii  + 1 

zaa  =  *11  -H  yia  =  l/ii  +  1 

The  coefficients  (o^s  and  ijs)  are  [13] 

oi  =  *ia  —  *11 

ki  =  ihi  —  yii 

oa  =  zai  —  ill 

^3  =  yai  —  yii 

Os  =  *aJ  +  *11  —  *13  —  *31 

ka  =  ya3  +  yii  —  yi3  —  ysi 
04  =  0 

64  =  0 

and  the  interpolation  formula  becomes 

—  **^.'*’  ”  *ll)f*  +  (*31  —  *ll)<s 

+(*33  +  *11  —  *13  —  *3l)«*<v 

y  =  yii_+  (yi3  “  yii)c«  +  (ibi  -  yii)f* 
+(y33  +  yii  —  yi3  —  y3i)c»<ir 


(15) 


(16) 


(17) 


fi(m  +  i,  n  +  j)  -Hi  =  -  lf3(«  + «,  w  +  j)  -  Hi] 

^/i/a  =  “1- 

3.3  Feature  Location  Interpolation 

Using  the  differential  method  we  can  obtain  subpixel 
accuracy  matches  for  grid  points.  But  for  the  feature 
tracking  problem,  after  camera  motion  compensation, 
the  features  are  usually  not  located  at  grid  points.  Con¬ 
sider  the  general  feature  matching  situation  illustrated 
in  Figure  1.  A  feature  point  pi  in  frame  ti  is  trans¬ 
formed  to  the  coordinates  of  frame  fj  and  located  at 
Pi  =  (z,y).  Its  four  nearest  grid  points  are  (zii,yii), 
(*i3,yi3),  (*3i,y3i),  and  (zaa.yaa)  determined  by 

*11  =  *31  =  ini(x)  yii  =  yia  =  int(y) 

*13  =  *23  =  «n/(z)  -f-l  yai  =  y33  =  »»t(y)  + 1 


For  convenience  in  farther  discussion,  we  note  that 


Pi  =  (*ii>yii) 
Pi  =  (*i3.  yis) 
PS  =  (*33.  y33) 
P4  =  (*31.y3l) 


h  =  (*iiiWi) 
n  =  (»i3.yi3) 
W  =  (*33>M3) 
P4  =  (in,  Pit) 


(18) 


Equation  (17)  can  be  considered  as  a  generalisation 
of  the  affine  transform.  When  quadrangle  piPipapA  is  a 
parallelogram  we  have 


*23  +  *11  —  *13  —  *31  =  0 

y33  +  yii  -  yi3  -  y2i  =  0 


(19) 


and  (17)  becomes 

{*  =  *11  +  (*13  —  *ii)f«  +  (*21  —  *ii)f» 
y  =  yii  +  (yi2  -  yii)«.  +  (ibi  -  yii)<» 


(20) 


where  tnf(-)  is  the  truncation  function. 

Assuming  that  the  matches  to  these  points  in  frame 
<3  are  (zii,yii),  (zi3,yia),  (z3i,j^i),  and  (£33,^3)  re¬ 
spectively,  we  define  the  feature  location  (x,y)  in  frame 
fa  to  be 

f  z  =  aiz  +  d3y-f-  a3zy  +  d4 
\  y  =  6iz  +  63y  +  63zy-|-64  ^  ' 


which  is  the  well-known  affine  trwsform.  For  more  gen¬ 
eral  quadrangles,  the  affine  transform  cannot  guarantee 
correct  matches  at  all  four  nearest  grid  points,  while  (17) 
does,  as  discussed  below.  Compared  to  the  well-known 
second-order  wrap  transform 

{ X  =  aox^  +  aiy^  +  a^xy  +  aax  +  a^y  +  as  .... 
(y  =  6ox’  +  611/*  +  ijzy  63Z  +  54y  +  5$  '  ’ 
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(17)  is  simpler.  Equation  (21)  has  12  coefficients  to  be 
determined,  but  only  4  matching  pairs  (8  measurements) 
are  available.  Hence  (21)  is  an  iU-posed  problem,  while 
(17)  has  only  8  coefficients  and  can  be  easily  detennined 
from  the  matches  at  the  four  corners.  In  addition  to  its 
simplicity,  (17)  possesses  the  nice  properties  discussed 
below: 


Proposition  1  Tkt  four  corum  pi,  P3,  ps  and  p4  sre 
mapped  info^,  p^,  andp^,  respectivelp. 

Proof  This  is  the  constraint  we  use  to  define  our  miq>- 
ping;  hence  it  is  always  true.  It  can  also  be  verified  by 
substituting  the  coordinates  of  p,-,  t  =  {1,2, 3, 4),  into 
(17).  □ 

Proposition  2  Lines  paraUel  to  ike  z  and  y  axes  are 
mapped  to  lines  in  ike  transformed  domain. 


(22) 


Proof  For  lines  parallel  to  the  x  axis  we  have  = 
constant  and  the  interpolation  functions  can  be  written 
as 

*  =  *11  +  OKx  +  fljf*  +  OsCxCy 

=  *ii  +  a2f, +  (ai+a3C,)€, 

y  =  Pii  +  4i<*  + 

=  y\\  +  +  (6i  +  b3eg)eg 

which  is  the  equation  of  a  2-D  line  written  in  parametric 
form. 

Similarly,  for  lines  parallel  to  the  y  axis  we  have  e,  = 
constant  and  the  interpolation  functions  can  be  written 
as 

*  =  *11  +  ai«»  +  +  aaCgty 

=  *11  +  ®lf*  +  («J  +  a3fr)f» 

y  =  yn+  *if«  +  +  kaftff 

—  yii  +  ^if#  +  (h  +  63Cs)ey 
which  again  is  the  equation  of  a  2-D  line  written  in  para¬ 
metric  form.  □ 


I 


1 


(23) 


Proposition  3  The  four  bonndaries  of  ike  rectangle 
PiPaPsPs  are  mapped  into  ike  boundaries  of  ike  quad¬ 
rangle  pipapzPA- 

Proof  The  boundaries  of  the  the  rectangle  pipapspi 
are  parallel  to  the  axes,  and  from  Proposition  2  we  know 
they  are  mapped  into  line  segments  in  the  transformed 
domain.  from  Proposition  1  we  have  {p,-  -*  Pi, 

i=  1 . 4}.  Since  a  line  is  uniquely  determine  by  two 

points,  and  the  transform  functions  in  (17)  are  continu¬ 
ous,  Proposition  3  follows.  □ 


Proposition  4  The  bisectors  of  the  rectangle  PiPaPaPx 
sre  mapped  into  ike  bisectors  of  ike  quadrangle  PipaPsPe. 


Proof  Note  that  the  bisectors  of  the  rectangle 
PiPaP3P4  are  parallel  to  the  axes,  and  from  Proposition  2 
their  images  are  lines.  To  show  that  these  lines  are  the 
bisectors  of  the  quadrangle  pipapape  we  only  need  to 
prove  that  they  bisect  the  opposite  sides.  First  consider 
the  bisector  parallel  to  the  X  axis: 


{*  =  *11  +  f*  0  <  c*  <  1 

y  =  yii  + 1/2. 


(24) 


Its  image  in  the  transform  domain  is 

f*  =  *11  +  jOj  +  (oi  -1-  ^a3)(x 

y  =  yil  +  2^2  +  (^1  +  2^3)fx- 


(25) 


When  e,r  =  0  we  have 


{*  =  *11  +  2®2  =  2^*11  +  *ai) 

y  =  yii  +  =  |(yii  +  y»i) 


(26) 


which  is  the  center  of  p\  and  pe-  Similarly  at  the  other 
end  we  have  =  1  and 


*  =  *11  +  2®2  +  +  a®s)  ~ 

^y  =  yii  +  i^a  +  (^1  +  2^)  =  i(yia  +  yaa) 


(27) 


which  is  the  center  of  pi  and  ps.  So  (25)  bisects  two 
opposite  sides  of  the  quadrangle  pipipape.  Similarly  we 
can  show  the  bisector  parallel  to  the  Y  axis  bisects  the 
other  two  opposite  sides  of  the  quadrangle  piPipape.  □ 

Proposition  5  The  center  of  ike  rectangle  pipipspe  i» 
matted  into  ike  intersection  of  ike  bisectors  of  the  quad¬ 
rangle  PiPaPaPx.  wkick  ts  ike  center  of  the  comers  of  the 
quadrangle  piPiPaps- 

Proof  The  center  of  the  rectangle  piPipape  is  at  the 
intersection  of  the  two  bisectors.  Ftom  Proposition  4, 
its  image  must  be  on  the  bisectors  of  the  quadrangle 
PiPiPaPt,  i.e.  the  intersection  of  the  bisectors  of  the 

quadrangle  piP2^P4- 

The  center  of  the  rectangle  piP2P3P4  is  (*ii  +  ^,  yn  + 
^).  Its  image  in  frame  f2  is 


*  =  *11  +  2(*n  ~  *ii)  +  2(**i  ~ 

+i(*aa  +  *11  -  *ia  —  *ai) 

=  4(*ii  +  *ia  +  *ai  +  *aa)  /2g\ 

y  =  yii  +  i(yia-yii)  +  i(yai-yii)  '  ^ 

+i(yaa  +  Wi  -  yia  -  yai) 

=  4(yii  +  yia  +  yai  +  yaa) 

Hence 

(*.y)  =  4(pi +W  +  P3  +  P4)  (29) 

This  shows  that  the  center  of  the  rectangle  P1P2P3P4  is 
mapped  into  the  centroid  of  the  corners  of  the  quadran¬ 
gle  PiPa^Ps-  D 

Proposition  6  For  a  point  inside  the  rect- 
angle  PiPiPaPs,  its  image  is  also  inside  ike  quadrangle 
PiPaPaPe  provided  ike  quadrangle  piPiPaPe  >3  convex. 

Proof  Any  point  inside  the  rectangle  piPiPaPe  in 
frame  ft  can  be  expressed  as 


{p  =  (*ii +**,yii +  «y) 

0  <  5*<1  (30) 

0  <  6y<l 

Its  image  in  frame  f2  can  be  computed  as  [13] 

p  -  (1  -  «y)[(l  -  «*)pi  +  5x^1 

-f6y[(l  -  Sx)p4  fix^]  '  ' 


Fiom  Proposition  3  we  know  that  p  =  (1  —  6x)pi  bxpa 
is  on  boundary  pipi  and  p  =  (I  -  Sx)pt  +  6xpa  is  on 
boundary  p^.  Hence  the  linear  combination  p  =  (1  — 
6y)p  -I-  6yp  is  inside  the  quadrangle  piPipapA  when  the 
quadrangle  pipipspA  is  convex.  □ 
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Proposition  7  The  convexity  of  a  quadrangle  01030304 
con  be  verified  by  checking  the  areas  of  the  triangles 
AoiOsOa,  A01O4O3,  A03O3O4,  and  A03O1O4.  The  quad¬ 
rangle  <^1030304  is  convex  if  and  only  if 

83,3,3,  -f  Sa7-f;-3r^'  =  S3y4»d«  +  (32) 

Proof  Four  different  triangles  can  be  formed  from  the 
corners  of  the  quadrangle.  If  the  quadrangle  is  convex, 
then  the  sum  of  the  areas  of  the  complementary  triangles 
equals  the  area  of  the  quadrangle.  Hence  equiJity  holds 
for  (32).  On  the  other  hand,  when  the  quadrangle  is 
concave,  the  sum  of  one  pair  of  complementary  triangles 
still  equals  the  area  of  the  quadrangle.  But  the  other 
two  complementary  triangles  contain  areas  outside  the 
quadran^e.  Hence  the  sum  of  their  areas  is  larger  than 
the  area  of  the  quadrangle  and  the  two  sides  of  (32)  are 
not  equal. 

This  is  a  simple  way  to  check  the  convexity  of  a  quad¬ 
rangle.  The  area  of  a  triangle  ASiSjSa  can  be  calculated 
as 

zi  yi  1 

8^010303  =  d:  Z3  y3  1  (33) 

'3  I 

where  Z|,yi,t  =  {1,2,3}  are  the  coordinates  of  a,,!  = 
{1,2,3}.  Q 

3.4  Diaplacement  Prediction 
The  compensation  for  global  rotation,  scale,  and  trans¬ 
lation  between  the  two  frames  makes  the  displacement 
of  feature  locations  between  consecutive  frames  more 
predictable.  Accordingly,  the  search  window  for  match¬ 
ing  can  be  decreased  to  reduce  computations  and  mis¬ 
matches.  Nevertheless,  since  the  amount  of  displace¬ 
ment  is  inversely  proportional  to  the  depth  of  the  object 
along  the  optical  axis  and  proportional  to  the  distance 
between  the  feature  point  and  the  focus  of  expansion  in 
the  image  domain,  for  an  object  moving  close  to  the  cam¬ 
era,  the  displacements  of  corresponding  features  can  be 
quite  large.  Fortunately,  when  the  motion  of  the  camera 
is  compensated  first,  feature  displacements  can  only  be 
translations,  due  to  the  motions  of  objects,  and  “pure” 
perspective  distortions  due  to  changes  in  the  depths  of 
the  objects.  Hence  the  displacements  of  the  features  in 
the  new  frame  can  be  predicted  using  results  from  previ¬ 
ous  frames.  We  have  used  a  zeroth  order  predictor;  the 
displacement  of  a  feature  in  the  two  preceding  frames  is 
us^  as  the  bias  in  tracking  the  feature  in  the  current 
frame. 

4  Motion  Detection 

4.1  Change  Detection 

With  the  camera  motion  parameters  estimated  using  the 
procedure  discussed  in  Section  2,  we  transform  the  im¬ 
ages  into  a  common  coordinate  system  and  take  the 
difference  between  successive  frames.  We  then  detect 
changed  parts  from  the  camera  motion  compensated 
frame  difference.  Two  factors  need  to  be  addressed. 
First,  since  images  always  contain  noise,  an  intensity 
change  between  frames  can  be  due  to  either  object  mo¬ 
tion  or  image  noise.  A  difference  between  intensity 


changes  due  to  object  motion  and  changes  due  to  image 
noise  is  that  changes  due  to  object  motion  usually  occur 
in  conjunction  with  changes  in  the  intensities  of  nearby 
pixels,  while  changes  due  to  image  noise  are  usually  iso¬ 
lated.  Thus  by  checking  the  values  of  both  the  average 
of  the  local  intensity  changes  and  the  on-site  intensity 
change,  we  can  suppress  image  noise  and  detect  changes 
due  to  object  motion.  Secondly,  since  the  value  of  the 
intensity  ^ange  is  dependent  on  the  local  contrast  of 
the  image,  in  high-contrast  areas  intensity  interpolation 
due  to  camera  motion  compensation  can  cause  intensity 
changes.  Hence  the  threshold  used  for  segmenting  mov¬ 
ing  parts  should  be  a4ju8ted  according  to  the  local  image 
contrast. 

In  this  paper  we  classify  a  pixel  (>,  j)  as  belonging  to 
a  changing  part  if 

min{317,  dj  j  }  >  max{y Dij  ,T}  (34) 

where 

•jCw 

dij  =  \fij  ~  fij\ 
fjj  :  value  in  the  first  frame 

fij  :  value  in  the  second  frame 

7  ;  Threshold  for  relative  difference 
T  :  Threshold  for  absolute  difference 

u  :  The  neighborhood  clique 

In  our  implementation,  we  took  a;  to  be  the  3  x  3  neigh¬ 
borhood. 

4.2  Motion  Estinuition 

When  more  than  two  successive  frames  are  available,  we 
can  segment  moving  objects  from  the  background  uid 
estimate  their  velocities. 

Let  n  be  the  set  of  pixels  which  belong  to  a  moving 
object,  and  let  Di,  (23  and  Os  be  these  sets  of  pixels  in 
the  camera  motion  compensated  frames  fi,  ^3  and  <3  re¬ 
spectively.  Using  the  method  discussed  in  Section  4.1  we 
can  detect  Si  =  Hi  U(23  from  the  difference  of  frames  ti 
and  <3  and  53  =  03  U  Da  from  the  difference  of  frames 
I3  and  ti.  In  general,  due  to  the  similarity  between 
pixels  belonging  to  the  same  object,  the  virtually  de¬ 
tected  changed  parts  are  Sj,  which  is  a  subset  of  5,,  for 
i  =  1,2.  Assuming  that  the  displacement  from  Qi  to  Q3 
is  i^itVi),  we  have 

S{=  Si -St  (35) 

where 

51  =  {(•>i)l(ij)6ninna,/i{»j)=yi(«-“iv;-*'i)) 
Similarly,  assuming  that  the  displacement  from  Hi  to  fla 
is  (u3,  V3)  we  have 

^2  =  53  -  ^  (36) 

where 

52  —  {(••  i)l(t,j)6nannj./3(«j)=/j(i-ujJ-»a)} 


Note  that 

(«,»  €  Oi  O  (i+tn.i+wi)  €  02 

O  (t+Ul+U2,  J+V1+W2)  €  Os 

So 


(37) 


O2  =  0i  ®(ui,t>i) 

O3  =  02  ®  (tt2.  W2)  =  0l  ®  (ui  +  U2,  t>l  +  V2) 

where  ®  adds  the  coordinates  of  each  pixel  in  the  left 
operand  to  those  in  the  right  operand. 

When  ui  =  U]  =  u  and  vi  =  02  =  v  we  have 


(i.j)€S[o(i  +  u,j  +  v)£S'i 

so  that 

®  (u,  w) 

Let  (^ii  P>)  be  the  mass  center  of  SJI;  then 


E 


E  i 

(<j)€s; 


(38) 

(39) 

(40) 


Ns,  - 

®‘0j)€S{ 


IB  =  E  j=A^  E  i+«=in+» 


(<J)€5' 


Ni  ^ 
'(<J)€S' 


(41) 

(42) 


where  Nst  is  the  number  of  pixels  in  set  for  t  =  1,2. 
The  motion  velocity  (u,  v)  can  then  be  estimated  from 
the  change  in  the  mass  center  of  the  detected  changed 
area. 

We  then  refine  the  velocity  estimate  by  shifting  the 
first  frame  by  (u,  v)  and  computing  the  subpixel  accuracy 
match  using  (4).  While  doing  this  only  pixels  belong  to 
5}  are  used  to  get  I  and  G  in  (52_and  (6).  We  next 
construct  a  candidate  set  for  0i  ,  Oi ,  as  the  closure  of 
1^.  Hi  is  recursively  generated  from  5^  by  connecting 
any  two  points  which  are  inside  fli  and  have  the  same 
value  for  one  of  their  coordinates.  Finally,  we  detect 
fli  by  translating  frame  /i  by  the  estimated  (u,  v)  (after 
subpixel  accuracy  refinement)  to  get  ,  and  detect  zeros 
of|/(  — /2I.  Apixelindi  is  classified  as  belonging  to  Qi, 
the  moving  object,  if  M'uj  —  hij\  <  <  “d  at  least  one 
of  the  following  conditions  is  satisfied:  (1)  (i,j)  €  5i; 
or  (2)  at  least  one  of  its  3  x  3  neighbors  belongs  to  Qi , 
where  c  is  a  threshold  for  detecting  the  zeros  in  frame 
difference. 


5  Experiments 

5.1  Motion  Correspondence 

Our  first  example  shows  the  improvements  in  camera 
motion  estimation  that  result  from  using  the  subpixel 
accuracy  feature  matching  technique  discussed  in  Sec¬ 
tion  2.2.  Figure  2  shows  a  pair  of  balloon  images  with 
synthetic  camera  motion  [14].  The  true  camera  motion 
parameters  are  0  =  90*,  6x  =  50,  6y  =  50,  and  s  =  0.9. 


In  [14]  it  was  reported  that  the  estimated  parameters 
were  0  =  89.8083*,  Sx  =  49.907,  6y  =  50.047,  and 
s  =  0.90138.  For  the  same  set  of  images,  camera  mo¬ 
tion  parameters  equal  to  0  =  90.000*,  6x  =  50.0010, 
6y  =  49.9997,  and  s  =  0.900018  have  been  obtained 
by  using  the  improved  image  registration  algorithm  dis¬ 
cussed  in  Section  2.2.  The  improvements  are  significant. 


(b)  IVansformed  B036 

Figure  2:  Images  with  synthetic  camera  motion. 

Our  first  tracking  example  is  the  PUMA2  sequence 
[2].  The  sequence  consists  of  thirty  256  x  256  images.^ 
The  camera  motion  is  a  continuous  rotation.  For  dis¬ 
cussion  purposes,  in  this  example  we  give  the  tracking 
results  on  twenty  manually  selected  points.^  Figure  3.a 
shows  the  selected  feature  points  on  the  first  frame.  Fig¬ 
ures  3.b  and  3.c  show  their  trajectories  tracked  up  to 
frames  15  and  30  before  the  subpixel  accuracy  tracking 
technique  was  implemented.  Figures  3.d  and  3.e  show 
the  trajectories  tracked  up  to  frames  15  and  30  after 
the  subpixel  accuracy  tracking  technique  was  used.  Fig¬ 
ure  3.f  shows  the  feature  points  hand-picked  from  frame 
30.  Note  that  without  enforcing  subpixel  accuracy  track¬ 
ing,  most  of  the  trajectories  gradually  drift  away  from 
the  initial  chosen  points.  Comparison  of  Figures  3.c,  3.e 
and  3.f  shows  that  the  error  in  feature  drift  has  been 


’There  ate  black  strips  on  the  tope  of  the  original  images. 
In  Figure  3  these  black  strips  have  been  removed,  i.e.  only 
the  bottom  256  x  242  is  displayed. 

’Results  of  tracking  another  set  of  mannally  selected 
points  and  a  set  of  automatically  detected  feature  points  are 
given  in  [13]. 


667 


(d)  (e)  (f) 

Figure  3.  Motion  correspondence  for  PUMA2  sequence,  (a)  Feature  points  are  manually  picked  in  the  first  frame, 
(b)  Trajectories  of  the  feature  points  tracked  up  to  the  15th  frame  without  using  the  subpixel  accuracy  tracking 


technique,  (c)  Trajectories  of  the  feature  points  tracked  up  to  the  30th  frame  without  using  the  subpixel  accuracy 
tracking  technique,  (d)  Trajectories  of  the  feature  points  tracked  up  to  the  15th  frame  after  using  the  subpixel 
accuracy  tracking  technique,  (e)  Trajectories  of  the  feature  points  tracked  up  to  the  30th  frame  after  using  the 
subpixel  accuracy  tracking  technique  (f)  The  same  feature  points  manually  picked  in  the  30th  frame. 

significantly  reduced.  Table  1  lists  the  coordinates  of 
the  feature  points  obtained  by  our  algorithm  as  well  as 
those  selected  manually.  The  differences  in  the  feature 
coordinates  are  within  the  limits  of  accuracy  for  manual 
selection. 

Figure  4  shows  an  example  using  an  outdoor  image  se¬ 
quence.  The  original  sequence  consists  of  thirty  512x512 
images.  In  this  sequence,  the  motion  of  the  camera  is  not 
smooth.  Also  there  are  rapidly  moving  clouds  causing 
significant  illuminant  changes.  In  our  experiments  we 
first  cut  36  lines  from  the  top  and  bottom  of  each  frame 
to  remove  the  black  strips  on  the  bottoms  of  the  original 
images,  so  the  sequence  we  are  working  with  consists  of 
the  central  512  x  440  subimages  of  the  original  sequence. 

Figure  4. a  shows  feature  points  automatically  detected 
from  the  first  frame.  Figures  4.b  and  4  c  show  their  tra¬ 
jectories  tracked  up  to  frames  15  and  30  respectively. 

As  shown  in  Figure  4,  the  trajectories  are  ragged  due  to 
non-smooth  camera  motion.  Neverthele.ss,  the  algorithm 
tracks  the  same  features. 

Our  third  tracking  example  is  a  sequence  taken  from  a 
camera  on  a  helicopter  which  was  flying  along  a  straight 
line  path  between  rows  of  trucks  parked  on  the  runway 
[9].  The  original  sequence  consists  of  ten  -512  x  512  im- 
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(C) 


Figure  4:  Motion  correspondence  for  the  rocket  se¬ 
quence.  (a)  Feature  points  are  automatically  detected 
in  the  first  frame,  (b)  Trajectories  of  the  feature  points 
tracked  up  to  the  15th  frame,  (c)  Trajectories  of  the 
feature  points  tracked  up  to  the  30th  frame. 


236  22-31.31:061 


(c) 


Figure  5:  Motion  correspondence  for  NASA  helicopter 
line  sequence,  (a)  Feature  points  are  automatically  de¬ 
tected  in  the  first  frame,  (b)  Trajectories  of  the  feature 
points  tracked  up  to  the  6th  frame,  (c)  Trajectories  of 
the  feature  points  tracked  up  to  the  10th  frame. 
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Input  frame  7  Detected  changing  regions  Detected  moving  parts  Detected  moving  object 


Input  frame  8  Detected  changing  regions  Detected  moving  parts  Detected  moving  object 
(a)  (b)  (c)  (d) 


Figure  6:  Moving  object  detection  from  a  helicopter  sequence,  (a)  Input  frames,  (b)  Changing  regions  segmented  from 
camera  motion  compensated  frame  differences,  (c)  Moving  part  detected  using  our  algorithm,  (d)  Superimposition 
of  the  contour  of  the  detected  moving  part  on  the  image  to  show  the  correctness  of  motion  detection. 


Input  frame  46  Detected  changing  regions  Detected  moving  parts  Detected  moving  object 


Input  frame  47  Detected  changing  regions  Detected  moving  parts  Detected  moving  object 
(a)  (b)  (c)  (d) 


Figure  7:  Moving  object  detection  from  an  infra-red  image  sequence,  (a)  Input  frames,  (b)  Changing  regions 
segmented  from  camera  motion  compensated  frame  differences,  (c)  Moving  part  detected  using  our  algorithm,  (d) 
Superimposition  of  the  contour  of  the  detected  moving  part  on  the  image  to  show  the  correctness  of  motion  detection. 
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ages.  The  images  in  this  sequence  exhibit  perspective 
deformations  of  close  objects.  Figure  S.a  shows  the  fea¬ 
ture  points  detected  automaticidly  in  the  first  frame^ 
Figures  5.b  and  5.c  show  the  trajectories  up  to  frames  6 
and  10  respectively.^  The  tracking  is  succosful. 

5.2  Motion  Detection 

Figures  6  and  7  show  two  motion  detection  results  ob¬ 
tained  by  our  algorithm.  Figure  6  is  the  detection  of 
a  fiying  helicopter  from  an  image  sequence  taken  from 
another  helicopter.  Figure  7  is  the  detection  of  a  mov¬ 
ing  automobile  from  an  infra-red  image  sequence.  In 
both  examples,  (a)  shows  two  frames  of  the  input  im¬ 
age  sequence;  (b)  shows  the  changing  parts  segmented 
from  the  camera  motion  compensated  frame  difference; 
(c)  shows  the  detected  object.  The  estimated  velocity 
for  the  moving  helicopter  is  (u,  v)  =  (—2.94, 0.02)  from 
frames  #6  to  #7  and  (u,  v)  =  (—2.92, 1.26)  from  frames 
#7  to  #8.  The  estimated  velocity  for  the  moving  car 
is  (u,v)  =  (15.56,0.47)  from  frames  #45  to  #46  and 
(u,  v)  =  (13.87, 0.36)  from  frames  #46  to  #47.  (d)  shows 
the  contours  of  detected  moving  objects  superimposed 
on  the  camera  motion  compensated  images.  As  we  see, 
in  spite  of  the  complex  background,  poor  image  qual¬ 
ity,  and  unknown  camera  motion,  motion  detection  is 
quite  good.  The  moving  car  is  detected  correctly  from 
the  infrarred  image  sequence.  In  the  flying  helicopter  se¬ 
quence,  some  parts  of  the  background  are  also  detected 
as  belonging  to  the  moving  object.  This  is  due  to  the 
fact  that  these  background  areas  are  uniform,  yielding 
zeros  in  the  translated  frame  difference. 
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Abstract 

In  the  first  part  of  this  paper  we  show  that  a 
new  technique  exploiting  ID  correlation  of  2D 
or  even  ID  patches  between  successive  frames 
may  be  sufficient  to  compute  a  satisfactory  es¬ 
timation  of  the  optical  flow  field.  The  algo¬ 
rithm  is  well-suited  to  VLSI  implementations. 

The  sparse  measurements  provided  by  the  tech¬ 
nique  can  be  used  to  compute  qualitative  prop¬ 
erties  of  the  flow  for  a  number  of  different  vi¬ 
sual  tasks.  In  particular,  the  second  part  of  the 
paper  shows  how  to  combine  our  ID  correla¬ 
tion  technique  with  a  scheme  for  detecting  ex¬ 
pansion  or  rotation  ([l])  in  a  simple  algorithm 
which  also  suggests  interesting  biological  im¬ 
plications.  The  algorithm  provides  a  rough  es¬ 
timate  of  time-to-crash.  It  was  tested  on  real 
image  sequences.  We  show  its  performance  and 
compare  the  results  to  previous  approaches. 

1  Introduction 

The  problem  of  how  to  compute  efficiently  estimates  of 
the  optical  flow  at  sparse  locations  is  of  critical  impor¬ 
tance  for  practical  implementations  in  a  number  of  differ¬ 
ent  tasks.  A  specific  example  is  the  detection  of  expan¬ 
sion  of  the  visual  field  witn  a  rough  estimate  of  fime-fo- 
crash  (TTC).  The  question  has  aim  interesting  relations 
with  biology,  as  we  will  discuss  later.  In  this  paper  we 
propose  an  efficient  algorithm  for  computing  the  opti- 
cad  flow  which  performs  well  in  a  number  of  experiments 
with  sequences  of  real  images  and  is  well  suited  to  a  VLSI 
implementation. 

Optical  flow  algorithms  based  on  patchwise  correlation 
of  filtered  images  perform  in  a  satisfactory  way  [2]  and 
better  in  practice  than  most  other  approaches  (see  [3]). 
Their  main  drawback  is  computation^  complexity  that 
forbid  at  present  useful  VLSI  implementations.  In  this 
paper  we  show  that  ID  patchwise  correlation  may  pro¬ 
vide  a  sufficiently  accurate  estimate  of  the  optical  flow*. 

*In  this  paper  we  use  mainly  the  I2  distance  rather  than 
the  correlation  itself.  Since  the  two  measures  are  equivalent 


We  will  then  show  with  e~periments  on  real  image  se¬ 
quences  how  to  ^ply  this  technique  to  measure  time-to- 
crash,  by  exploiting  a  recently  proposed  scheme  [l].  The 
latter  scheme,  which  is  robust  and  invariant  to  the  posi¬ 
tion  of  the  focus  of  expansion  or  the  center  of  rotation, 
relies  on  sparse  measurements  of  either  the  normal  or 
the  tangential  component  of  the  optical  flow  (relative  to 
a  closed  contour).  We  will  also  discuss  some  broad  im¬ 
plications  of  this  work  for  the  practical  computation  of 
the  optical  flow  and  for  biology,  in  particular  its  relation 
to  Reichardt’s-type  models. 

There  are  two  main  and  quite  separate  contributions 
in  this  paper: 

1.  an  efficient  ID  correlation  scheme  to  estimate  the 
optical  flow  along  a  desired  direction 

2.  the  experimental  demonstration  that  a  previously 
propos^  algorithm  for  estimating  time-to-crash 
performs  satisfactorily  in  a  series  of  experiments 
with  real  images  in  which  the  elementary  measure¬ 
ments  of  the  flow  are  obtained  by  the  new  ID  cor¬ 
relation  scheme. 

2  Computing  the  Optical  Flow  along  a 
Direction 

How  can  the  component  of  the  optical  flow  be  measured 
efficiently  along  a  certain  desired  direction?  As  argued 
by  Verri  and  Poggio  [4]  a  qualitative  estimate  is  often 
sufficient  for  many  visual  tasks.  For  the  task  of  detect¬ 
ing  a  potential  crash,  for  instance,  it  has  been  suggested 
([!])  that  a  precise  measurement  of  the  normal  compo¬ 
nent  of  the  flow  may  not  be  necessary,  since  the  precise 
definition  of  the  optical  flow  is  itself  somewhat  arbitrary: 
it  is  sufficient  that  the  estimate  be  qualitatively  consis¬ 
tent  with  the  values  of  the  perspective  2D  projection 
of  the  “true”  3D  velocity  field  for  ike  particular  stim¬ 
ulus.  In  other  words,  even  estimates  that  don’t  really 
measure  image-plane  velocity  (like  Reichardt’s  correlar 
tion  model  or  equivalent  energy  models),  since  they  also 


for  the  pniposet  of  this  paper,  we  wiU  often  ose  the  terms 
“correlation”  and  “distance”  in  an  interchangeable  way. 
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depend  on  spatial  structure  of  the  image,  may  be  ac¬ 
ceptable  for  several  visual  tasks,  if  their  estimates  are 
consistent  over  the  visual  field.  Certain  uses  of  a  crash 
detector  are  good  examples.  It  turns  out  that  even  a 
rough  estimate  of  timo'to-crash  (TTC)  is  possible  using 
approximate  estimates  of  the  optical  flow  field.  Flies  and 
other  insects  rely  for  landing  on  what  appears  to  be  a 
qualitative  estimate  of  the  time-to-crash! 

2.1  ID  correlation  of  2D  patches 

A  possit  le  approach  for  an  approximative  estimate  of 
the  optical  flow  is  to  use  a  ID  corrtlation  scheme  be¬ 
tween  two  successive  frames,  instead  of  2D  correlation, 
as  in  [2].  The  basic  idea  underlying  the  full  2D  correla¬ 
tion  technique  that  we  label  2D~2D  in  this  paper^  is  to 
measure,  for  each  desired  location,  the  (z,  y)  shift  that 
maximizes  the  correlation  between  2D  patches  centered 
around  the  desired  location  in  successive  frames.  The 
patchwise  correlation  between  the  image  at  time  t  and 
at  time  t  -t-  is  defined  as 


/Figure  1;  The  search  space  for  the  ID— 2D  scheme  used 
D'(^,fj;t)/(ii+^,6y+ij;t+6t)dfdfjfor  the  computation  of  the  z  and  y  components  of  the 

optical  flow. 


where  /*"((,  ?);t)  is  the  image  at  time  (  windowed  to  the 
patch  of  interest  and  set  to  0  outside  it.  The  distance 
has  very  similar  properties  to  the  correlation  measure 
In  the  context  of  this  paper,  minimizing  the  Lj  distance 
is  exactly  equivalent  to  maximizing  the  correlation  (the 
observation  is  due  to  F.  Girosi).  As  noticed  before  [2], 
the  previous  idea  can  be  regarded  as  an  approximation 
of  a  regularization  solution  to  the  problem  of  computing 
the  optical  flow^.  Usually,  one  does  not  use  grey  values 
directly  but  rather  some  filtered  version  of  the  image, 
for  instance  through  a  Laplru:ian-of-a-Gaussian  filter  (see 
[2]),  possibly  at  different  resolutions. 

Let  us  call  D(6g,6y)  the  Lj  distance  between  2  patches 
in  2  frames  at  location  (z,  y)  as  a  function  of  the  shift 
vector  (6g,Sy).  The  “winner-take-all”  scheme  finds  s*  = 
(6*,6p  that  minimizes  D  (or  maximize  the  correlation 
function  4( ,  jy))  and  assumes  that  the  optical  flow  es¬ 
timate  is  u*  =  8*/Af,  where  At  is  the  interframe  inter¬ 
val. 

It  is  natural  to  consider  whether  the  component  of 
u*  along  a  given  direction,  for  instance  z,  may  be  es¬ 
timated  in  a  satisfactory  way  simply  by  computing  the 
Sz  that  minimizes  D(6x,0),  that  is  the  patchwise  cor¬ 
relation  as  a  function  of  z  shifts  only.  We  have  found 
in  our  experiments  that  ID  correlation  of  a  2D  patch 
provides  estimates  of  6x*  that  are  very  close  to  the  es¬ 
timates  obtained  from  the  2D-2D  technique.  We  label 
this  technique  1D~2D,  since  it  involves  one-dimensional 
correlations  on  2D  patches. 

^It  is  also  called  winner-take-all  method. 

^The  Li  distance  is  in  this  case  the  square  root  of  the 
sum  of  the  squares  of  the  differences  between  values  of  cor¬ 
responding  pixels.  Other  “robust”  distance  metric  may  be 
used,  such  as  the  sum  of  absolute  values. 

^And  in  turn  several  definitions  of  the  optical  flow  such 
as  Horn  and  Schunk’s,  can  be  shown  to  be  approximations  of 
the  correlation  technique  [S]. 


If  we  combine  horizontal  and  vertical  motion  detectors 
of  our  ID,  winner-take-all  type  (see  fig.l),  we  obtain  an 
appealing  scheme  to  estimate  the  optical  flow  field  at 
one  point.  The  optical  flow  in  one  point  is  the  vector 
sum  of  the  z  and  y  components  computed  by  using  such 
motion  detectors.  The  key  aspect  of  this  approach  is  its 
reduction  of  the  complexity  of  the  problem,  while  main¬ 
taining  a  good  estimation  of  the  flow  field:  a  complete 
two-dimensional  search  required  in  the  winner-take-all 
scheme  [2]  is  reduced  to  two  one-dimensional  searches. 
Let  us  call  Vmax  the  maximum  velocity  expected  on  the 
image  plane.  In  [2]  the  search  space  size  to  scan  is 
(2vma«  +  1)^  for  each  point;  in  our  approach,  its  size 
is  limited  to  2(2vmax  +  1). 

2.2  ID  correlation  of  ID  patches 

So  far  we  have  discussed  that  ID  correlation  of  2D 
patches  gives  a  satisfactory  estimate  of  the  optical  flow 
between  two  successive  frames,  reducing  the  search  space 
of  corresponding  points.  This  is  equivalent  to  saying  that 
the 

minfx4(  jz,  0) 
and 

mtnjy4(0,  jy) 

give  a  satisfactory  estimate  of 

mirng,fy<b{6z,6y). 

This  suggests  a  further  simplification:  instead  of  4(  jz,  0) 
consider  a  projection  on  z  of  4(jz,  6y)  obtained  by  some 
form  of  averaging  operation  on  y,  that  is 

4  *  h] 
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where  ha  is  a  2D  filter  such  as  a  Gaussian  elongated  in 
the  y  direction  and  *  stands  for  the  convolution  operator. 
By  well  known  pnqierties  of  the  Gaussian  function,  hj 
can  always  be  written  as 

h2  —  h  *  h, 

where  h  are  Gaussian  functions  of  appropriate  variance. 
Assuming  that  we  can  neglet  the  patch  sise  in  the  defi- 
nition  of  4,  we  can  write: 

♦  *  =  (ft  *  fi)  ®  (fi+«  ♦  fi)  (2) 

where  /,  =  /(*,  y,  f). 

Thus,  in  the  approximation  of  a  large  patch  sise,  pro¬ 
jecting  the  correlation  function  is  equivalent  to  appropri¬ 
ately  filtering  the  two  images  before  correlation.  Since 
it  is  usually  better  to  discount  the  average  intensity  as 
well  as  smdl  gradients  through  a  high-pass  filtering  op¬ 
eration,  in  order  to  estimate  the  z-component  of  u,  we 
just  perform  a  Gaussian  smoothing  in  the  y  direction,  as 
shown  in  eq.  2,  and  then  perform  an  additional  convo¬ 
lution  with  the  first  or  second  derivative  of  a  Gaussian 
function  elongated  in  the  z  direction.  Therefore  the  in¬ 
tensity  function  that  is  used  in  practice  in  the  correlation 
operation  is: 


Figure  2:  A  TTC  detector  consisting  of  elementary  mo¬ 
tion  detectors  (see  figure  1)  at  several  locations  along  a 
closed  contour.  Each  of  the  elementary  motion  detec¬ 
tors  could  be  replaced  by  a  single  detector  normal  to  the 
circle. 


/.  =  (G,.(y)*/.)*G".(»)  (3) 

where  trs  and  define  the  receptive  field  of  such  an 
elementary  motion  detector.  After  this  filtering  step,  it 
is  sufficient  to  evaluate  the  maximum  of  the  correlation 
function  only  on  ID  patches  to  obtain  an  estimate  of  the 
z  component  of  the  flow.  The  previous  argument  does 
not  strictly  apply  to  the  L2  distance  measure  that  we 
have  used  in  our  experiments.  The  very  close  similarity 
between  correlation  and  distance,  however,  suggests  a 
very  similar  behavior  in  both  cases.  We  label  this  tech¬ 
nique  the  ID-ID  scheme  since  it  involves  ID  correlations 
of  ID  patches. 

3  A  crash  detector:  the  Green  theorem 
scheme 

As  described  in  [l]  (see  also  [6]),  the  divergence  of  the 
optical  field  Vu(z,  y)  is  a  differential  measure  of  the  local 
expansion  (Vu(z,y)  =  -1-  ***jf*'^^)-  For  a  linear 

field  (i.e.  u(x)  =  Ax),  the  divergence  of  u  is  the  same 
everywhere.  In  the  case  of  linear  fields  (and  all  fields  can 
be  approximated  by  linear  fields  close  to  the  singularity), 
the  integral  of  the  divergence  over  an  area  is  invariant 
with  respect  to  the  position  of  the  center  of  expansion. 
Green’s  theorenns  show  that  the  integral  over  a  surface 
patch  5  of  the  divergence  of  a  field  u  is  equal  to  the 
integral  along  the  patch  boundary  of  the  component  of 
the  field  which  is  normal  to  the  boundary  (u  ■  n).  In 
formula 

j  Vu(z,  y)dxdy  zz  J  \i-  ndl.  (4) 


Therefore,  since  for  a  linear  field  Vu  —  Ijr  where  t 
is  the  time  to  crash  (TTC),  a  TTC  detector  that  ex¬ 
ploits  the  Green  theorem  just  needs  to  sum  over  a  closed 
contour,  say  a  circle,  the  normal  component  of  the  flow 
measured  at  n  points  along  the  contour.  We  assume 
that  the  task  is  to  compute  timt  to  crash  (TTC)  for 
pure  translational  motion.  Possibly  the  simplest  TTC 
detector  of  this  type,  shown  in  figure  3,  is  composed  of 
just  4  elementary  motion  detectors.  In  this  case  we  have 
to  sum  the  z-component  of  u  for  the  horizontal  detectors 
and  the  y-component  of  u  for  the  vertical  ones,  with  the 
correct  sign. 

Due  to  the  invariamce  with  respect  to  the  position  of 
the  focus  of  expansion  (or  contraction)  we  can  in  princi¬ 
ple  arrange  a  certain  number  of  them  (see  fig.4)  on  the 
image  plane.  Our  simulations  suggest  that  one  detec¬ 
tor  with  a  large  radius  (fig.  3)  is  better  than  several, 
“smaller”  detectors  (fig.  4)  in  situations  in  which  the 
whole  visual  field  expands,  probably  because  of  better 
numerical  stability  of  the  estimates.  Of  course,  a  “large” 
detector  has  a  poorer  spatial  resolution  and  this  may  be 
a  problem  in  some  applications  (but  not  ours). 

We  have  discussed  so  far  schemes  for  detecting  expan¬ 
sion.  Similar  arguments  hold  for  rotation.  The  Green 
’s  theorem  relevant  to  this  case  is  usually  called  Stokes’ 
theorem  and  takes  the  form 

y  V  A  u(z,  y)  •  dS  =  J  u  •  dr  (5) 

which  says  that  the  total  flux  of  the  differential  measure 
of  “rotationality”  of  the  field  V  A  u  across  the  surface 
patch  S  is  equ^  to  the  integral  along  the  boundary  C 
of  the  surface  patch  of  the  component  of  the  field  which 
is  tangential  to  the  boundary.  As  described  in  [l],  each 


675 


Figure  3;  Time-to-crash  detector  that  exploits  Green 
theorem. 


Figure  4;  A  possible  arrangement  of  TTC  detectors  in 
the  image  plane  that  is  not  as  efficient  as  a  single  TTC 
detector  with  greater  radius  but  has  higher  spatial  reso¬ 
lution. 


elementary  detector  evaluates  the  tangential  flow  com¬ 
ponent  at  the  contour  of  the  receptive  field  (see  fig.5). 
In  this  case  a  detector  has  to  compute  the  component  of 
u  along  the  tangential  direction  at  the  contour. 

4  Experimental  results 

4.1  The  1D-2D  scheme 

We  have  extensively  tested  our  approach  on  real  image 
sequences.  Each  sequence  was  acquired  from  a  camera 
mounted  on  mobile  platform  moving  at  constant  velocity. 


Figure  5:  Motion  detector  that  exploits  Stokes’  theorem. 


In  all  experiments  the  movement  of  the  vehicle  was  a 
forward  translation  along  a  straight  trajectory.  We  have 
verified  the  results  obtained  from  our  lD~iD  approach 
with  the  standard  winner-take-all  {2D~2D)  scheme  [2] 
[3]. 

Figure  9  shows  the  first  and  last  image  of  a  sequence 
composed  of  100  frames.  Each  image  of  the  sequence  is 
first  convolved  with  a  Gaussian  filter  having  a  =  0.5.  In 
both  the  algorithms  we  have  used  Vma*  =  9  u  =  2Q 
pixels,  where  Vmac  i*  the  maximum  expected  velocity  of 
the  points  on  the  image  plane  and  i/  is  the  ray  of  the 
patch  used  for  the  evaluation  of  4.  In  other  words,  the 
correlation  window  used  for  the  optical  flow  computation 
is  41  X  41  pixels  and  the  search  space  used  is  19  x  19 
by  2D~2D  and  19  -I-  19  by  tD-2D.  Figures  10  shows  the 
optical  flows  computed  by  the  two  methods  using  two 
successive  images  of  the  sequence.  The  position  of  the 
focus  of  expansion  was  computed  by  using  the  approach 
described  in  [7]. 

We  have  u^  the  method  described  in  [7]  and  [l]  to 
verily  the  TTC  estimation.  To  compute  the  TTC  at  a 
point  by  using  the  method  in  [7],  we  used  an  area  of 
81  X  81  pixels  around  that  point.  The  points  were  10 
pixels  apart.  To  compute  TTC  by  using  the  method  de¬ 
scribed  in  [l],  we  used  a  lattice  of  overlapping  motion  de¬ 
tectors.  The  distance  between  two  points  on  the  lattice 
was  10  pixels.  Each  detectors  had  a  receptive  field  of  ray 
r  =  40  pixeb.  In  fig.  11,  we  compare  the  results  obtained 
by  using  the  2D-2D  estimation  of  the  optical  flow  with 
the  1D-2D  one,  by  using  the  two  different  methods  in  the 
first  stage  of  the  TTC.  Performing  a  linear  best  fit  on  the 
TTC  measurements,  we  obtain  a  slope  of  m  =  -1.036 
by  using  the  optical  flows  computed  by  2D~2D  and  the 
method  described  in  [7],  and  m  =  —1.139  by  using  the 
optical  flows  compute  by  1D-2D  and  the  method  de¬ 
scribed  in  [l].  Comparing  the  true  TTC  (straight  line 
in  fig.  11)  with  the  TTC  measures  obtained  by  using 
the  second  method,  we  estimate  an  absolute  error  in  the 
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Figure  6:  2D  distance  function.  The  arrows  indicate  the 
position  of  the  minimum. 


mean  of  2.63,  with  a  standard  deviation  of  3.35  frame 
unit.  In  terms  of  relative  units,  the  error  in  the  mean  is 
5.7%  with  a  standard  deviation  of  6.1%. 

4.2  The  1D~1D  scheme 

In  this  section  we  compare  the  results  obtained  by  us¬ 
ing  iD-iD  and  1D~1D.  In  both  techniques,  we  have  used 
Vmas  =  9  and  t/  =  20  pixels.  In  other  words,  the  cor¬ 
relation  window  used  for  the  optical  flow  estimation  is 
41  X  41  pixels  for  2D~2D  and  41  pixels  for  1D~1D.  In 
the  filtering  step  we  have  used  <Ty  =  6  and  <rx  =  3  pix¬ 
els  for  computing  the  z-component  of  the  optical  flow. 
These  values  of  a  produce  a  receptive  field  of  an  elemen¬ 
tary  motion  detector  equals  to  that  used  by  2D-2D  (1681 
pixels).  The  fig.  (6)  shows  a  plot  of  the  2D  correlation 
function  used  in  2D-2D  over  a  2D  search  space  and  a 
2D  integration  area.  Figures  (7)  and  (8)  show  a  plot  of 
ID  correlation  functions  used  by  the  1D~1D  technique 
to  estimate  both  components  of  the  optical  flow.  In  this 
case  we  used  two  ID  search  spaces  (in  z  and  y  directions 
respectively)  and  a  ID  integration  area.  Notice  that  this 
approach  is  capable  of  computing  a  reliable  estimation 
of  the  flow  vectors,  while  reducing  the  complexity  of  the 
problem. 

Figures  (13),  (17),  (21)  show  the  first  and  the  last  im¬ 
age  of  three  sequences  acquired  from  a  camera  mounted 
on  a  mobile  platform  moving  at  constant  velocity,  along 
a  straight  trajectory.  Figures  (14),  (18),  (22)  show  the 
optical  flows  computed  by  the  two  methods,  by  using 
two  successive  frames  of  the  sequences.  The  mean  (con¬ 
tinuous  line)  and  the  standard  deviation  (dashed  line) 
of  the  error  on  the  optical  flow  estimation  is  shown  in 
figures  (15),  (19),  (23).  Figures  (16),  (20),  (24)  show 
the  TTC  estimation  by  using  the  two  different  methods. 
In  each  experiments,  we  have  used  only  one  TTC  detec- 


Figure  7:  ID  distance  function  computed  on  the  z  di¬ 
rection  by  using  ID-ID. 


Figure  8:  ID  distance  function  computed  on  the.y  direc¬ 
tion  by  using  ID-ID. 


tor,  with  receptive  field  of  r  s  80  pixel,  composed  by  32 
elementary  motion  detectors  (see  fig.  (2)). 

5  Conclusions 

5.1  Extensions  of  the  optical  flow  algorithm 

There  are  several  directions  in  which  we  plan  to  improve 
and  extend  our  scheme: 

•  it  may  be  possible  to  reduce  further  the  number  of 
sample  points  for  Df  (i.e.  the  number  of  shifts)  by 
using  techniques  for  learning  from  examples  such 
as  the  RBF  technique  ([8])  to  approximate  Dp{6x) 
as  Dp{6x)  =  ^CnG{x  —  t„),  and  then  find  the 
minimum  of  Dp  in  terms  of  the  dynamical  system 
dx/dt  =  —e^CnG'{x  —  t„).  An  alternative  strat¬ 
egy  is  to  try  to  learn  directly  the  function  minDp{x) 
from  the  samples  of  Dp,  using  a  few  examples  of  Dp 
‘‘typical”  for  the  specific  situation.  The  conjecture 
is  that  the  RBF  technique  may  be  able  to  learn  the 
mapping  minDp(z)  from  examples  of  functions  of 
the  same  class  (compare  Poggio  and  Vetter,  1992). 
A  similar  idea  is  to  try  to  learn  how  to  sample  the 
correlation  function  as  a  function  of  past  sampled 
values.  Again,  the  training  examples  would  be  func¬ 
tions  of  the  same  class.  This  would  provide  at  each 
t  an  estimate  of  the  most  appropriate  correlation 
shifts  to  try. 

•  instead  of  simply  measuring  that  is  the  dis¬ 

tance  between  frame  t  and  frame  i—  1,  we  could  mea- 
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sure  in  addition  also  A.i-s,  -  and  combine 

them  in  an  estimate  of  the  optical  flow  component 
by  taking  the  average  of  A,,_i/At,  D<,j_a/2At, 
A,<-3/3A<,  etc.  This  technique  may  be  improved 
further  by  using  a  Kalman  filter. 

•  the  same  basic  scheme  of  figure  1  may  be  used  to 
compute  horizontal  and  vertical  disparities  among 
the  two  frames  of  a  stereo  pair. 

•  confidence  measures  will  be  developed  to  further  im¬ 
prove  the  performance  of  the  technique. 

5.2  Biological  implications  of  our  ID  technique 

Poggio  et  al.  ([l])  conjectured  that  "the  specific  type 
of  elementary  motion  detectors  that  are  used  to  pro¬ 
vide  the  estimate  of  the  normal  component  of  the  flow 
is  probably  not  critical.  Radially  oriented  (for  ex¬ 
pansion  and  contraction),  two  input  elementary  mo¬ 
tion  detectors  such  as  the  correlation  model  [9;  10;  11; 
12]  -  or  approximations  of  it  are  likely  to  be  adequate. 
The  critical  property  is  that  they  should  measure  mo¬ 
tion  with  the  correct  sign.”  Our  results  confirming  their 
conjecture  (since  they  suggest  that  ID  correlation  (or  Ij 
distance  estimation)  are  suflicient  for  an  adequate  esti¬ 
mate  of  qualitative  properties  of  the  optical  flow)  have 
interesting  implications  for  biology.  Consider  a  2D  array 
of  Reichudt’s  detectors  (for  motion  in  the  x  direction) 
with  spacing  Ax  and  also  detectors  with  spacings  2Ax 
etc.  Take  the  sum  of  all  detectors  with  the  same  spacing 
over  a  2D  patch.  Perform  a  winner-take-all  operation  on 
these  sums.  Select  the  set  with  optimal  spacing  as  the 
one  corresponding  to  the  present  estimate  of  optical  flow. 
This  scheme  is  analog  in  time  but  otherwise  equivalent 
to  the  one  we  have  implemented.  In  formulae 

-  liMt  -  At))’ 

where  At  is  the  interframe  interval  in  our  implementa¬ 
tion  and  is  the  delay  in  Reichardt’s  model®,  k  represents 
the  shift  in  our  computation  of  D  and  represents  the  sep¬ 
aration  between  the  inputs  to  Reichardt’s  modules,  Ii{t) 
is  the  image  value  (in  general  spatially  and  temporally 
filtered)  at  location  i  and  time  t  and  the  sum  taken 
over  the  2D  patch  of  detectors  of  the  same  type. 

Thus  an  array  of  Reichardt’s  models  with  different 
spacings  of  the  2  inputs  (in  x)  could  be  used  in  a  plau¬ 
sible  way  to  estimate  the  optical  flow  component  along 
the  direction  of  the  two-inputs  detectors.  Notice  that 
a  plausible  implementation  in  terms  of  Reichardt’s  de¬ 
tectors  of  the  2D  correlation  based  algorithm  would  be 
much  harder,  since  it  would  effectively  require  detectors 
with  all  possible  2D  spacings.  This  seems  implausible 
and  contrary  to  experimental  evidence  in  the  fly,  where 
only  a  small  number  of  separations  and  directions  (as 
small  as  3)  seem  present. 

The  above  description  is  equivalent  to  our  1D-2D 
scheme  and  involves  the  summation  over  x  and  y 
“patches”  of  elementary  ID  mction  detectors.  In  the 

®  We  have  written  here  the  quadratic  version  of  Reichardt’s 
model;  the  same  argument  carries  over  to  the  standard  model 
with  multiplication;  for  the  basic  equivalence  of  the  the 
quadratic  and  multiplication  version  see  [ll]) 


fly  this  is  plausible,  given  the  known  summation  proper¬ 
ties  of  specific  wide  field  lobula  plate  cells*.  Our  ID- ID 
scheme  on  the  other  hand  would  require  a  summation 
over  the  x  dimension  only  (in  our  example)  but  an  ori¬ 
ented  filtering  of  the  image  with  receptive  fields  elon¬ 
gated  in  y  before  the  elementary  motion  detectors.  It 
is  possible  that  this  second  scheme  may  be  used  in  the 
fly  by  different  summation  cells  with  smaller  receptive 
fields.  It  is  also  possible  that  the  wide  field  lobula  plate 
cells  effectively  implement  a  scheme  between  the  JD-2D 
and  the  ID-ID  by  using  some  oriented  filtering  before 
motion  detection  and  limited  y  integration  of  the  output 
of  the  elementary  motion  detectors.  Similar  considera¬ 
tions  may  apply  to  some  of  the  motion  selective  cortical 
cells. 

5.3  The  Time* to- Crash  detector 

The  TTC  detector  we  have  simulated  is  not  the  only 
possible  scheme.  Others  are  possible  (see  for  instance 
[6])  that  take  into  account  more  complex  motions  than 
just  frontal  approach  to  a  horizontal  surface. 

It  is  also  conceivable  that  the  scheme  we  suggest  may 
be  simplified  even  further  in  certain  situations.  For  in¬ 
stance,  it  may  be  sufficient  in  the  summation  stage  to  use 
the  value  of  the  correlation  for  a  fixed  (and  reasonable) 
shift  -  instead  of  an  estimate  of  the  optical  flow,  that  is 
the  shift  that  maximize  correlation.  This  is  equivalent 
to  use  directly  the  output  of  Reichardt’s  correlation  nets 
instead  of  using  the  result  of  a  winner-take-all  operation 
on  a  set  of  Reichardt’s  nets  with  different  spacings  (or 
delays). 

Another  related  idea  is  to  continuously  adjust  the  cor¬ 
relation  shifts  in  order  to  track  as  closely  as  possible  the 
maximum  of  the  correlation  (or  the  minimum  of  the  dis¬ 
tance):  in  this  way  it  may  be  possible  to  reduce  the  com¬ 
putation  of  the  correlation  to  just  a  few  shifts,  especially 
if  time-filtering  techniques  are  also  used. 
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Figure  10:  An  example  of  optical  flows  computed  by 
using  (a)  2D-2D,  (b)  1D-2D  and  (c)  ID-ID.  In  most 
frame  pairs  in  a  sequence  the  three  flows  are  much  more 
similar  to  each  other. 
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Figure  11:  TTC  measurements  comparison  by  using  2D- 
2D  and  1D-2D.  In  this  and  the  following  figures  the  ab¬ 
scissa  gives  the  time  in  terms  of  elapsed  frames;  the  or¬ 
dinate  gives  the  estimate  of  the  time  to  crash  in  frame 
units. 
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Figure  13:  (a)  First  and  (b)  last  image  of  the  sequence.  Figure  15:  Mean  (dotted  line)  and  standard  deviation 

(dashed  line)  of  the  error  of  the  optical  flow  estimation. 


Figure  14:  An  example  of  a  flow  field  computed  at  one 
point  in  time  in  the  above  sequence,  obtained  by  using 
(a)  2D-SD  and  (b)  ID-ID. 
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Figure  16:  TTC  estimation  by  using  one  TTC  detector, 
with  receptive  field  of  r  =  80  and  32  elementary  motion 
detectors.  The  slope  of  the  true  TTC,  computed  by  us¬ 
ing  the  optical  flows  obtained  by  tD-tD,  is  m  =  —0.672. 
The  slope  of  the  straight  line,  computed  by  using  the 
TTC  measures  obtained  by  ID-ID,  is  m  =  —0.64.  A 
comparison  of  the  TTC  measures  obtained  by  ID-ID 
with  the  true  TTC  yields  a  mean  absolute  error  of  9.02, 
with  a  standard  deviation  of  9.54.  The  relative  error  in 
the  mean  is  10.79%  with  a  standard  deviation  of  9.49%. 
In  order  to  evaluate  the  error  in  the  time  to  crash  esti¬ 
mation,  the  following  steps  have  been  performed.  The 
true  time  to  crash  was  estimated  from  a  linear  best  fit  of 
the  TTC  measures  obtained  by  using  the  2D~tD  scheme 
for  the  optical  flow  estimation.  The  figures  show  the 
straight  line  that  represents  the  theoretical  behavior  of 
the  TTC.  A  linear  bat  fit  of  the  TTC  measures  obtained 
by  using  the  ID-ID  scheme  for  the  optical  flow  estima¬ 
tion  was  then  performed  in  order  to  evaluate  the  slopes 
of  the  two  straight  lines.  The  absolute  and  relative  error 
between  the  "true”  TTC  and  the  one  measured  by  the 
ID'ID  scheme  was  then  estimated.  Let  us  call  r*  the 
true  TTC.  The  absolute  error  is  =  |r*  —  r|  and  the 
the  relative  error  u  Er  =  |r*  —  T|/|r|. 


Figure  17:  (a)  First  and  (b)  last  image  of  the  sequence. 
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Figure  19:  Mean  (dotted  line)  and  standard  deviation 
(dashed  line)  of  the  error  relative  to  optical  flow  estima¬ 
tion. 
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Figure  20:  TTC  estimation  by  using  one  TTC  detector, 
with  receptive  field  of  r  =  80  and  32  elementary  moticm 
detectors.  The  slope  of  the  true  TTC,  computed  by  us¬ 
ing  the  optical  flows  obtained  by  tD^iD,  is  m  =  -0.77. 
The  slope  of  the  straight  line,  computed  by  using  the 
TTC  measures  obtained  by  ID-ID,  is  m  =  —0.83.  Com¬ 
paring  the  TTC  measures  obtained  by  ID-ID  with  the 
true  TTC,  we  had  a  mean  absolute  error  of  8.02,  with  a 
standard  deviation  of  8.97.  With  respect  to  the  relative 
error  we  had  a  mean  of  10.9%  and  a  standard  deviation 
of  9.72%. 


Figure  21:  (a)  First  and  (b)  last  image  of  the  sequence. 
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Figure  22:  Flow  field  obtained  by  using  (a)  tD-tD  and 
(b)  ID-JD. 


Figure  23:  Mean  (dotted  line)  and  standard  deviation 
(dashed  line)  of  the  error  relative  to  optical  flow  estima¬ 
tion. 
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Figure  24:  TTC  estimation  by  using  one  TTC  detector, 
with  receptive  field  of  r  =  80  and  32  elementary  motion 
detectors.  The  slope  of  the  true  TTC,  comput^  by  us¬ 
ing  the  optical  flows  obtained  by  wia-tD,  is  m  =  —1.24. 
The  slope  of  the  straight  line,  computed  by  using  the 
TTC  measures  obtained  by  ID-ID,  is  m  =  —1.14.  Com¬ 
paring  the  TTC  measures  obtained  by  ID-JD  with  the 
true  TTC,  we  had  a  mean  absolute  error  of  7.6,  with  a 
standard  deviation  of  7.9.  With  respect  to  the  relative 
error  we  had  a  mean  of  11.4%  and  a  standard  deviation 
of  10.3%. 
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Abstract 

In  this  paper,  we  present  a  method  recovering 
both  the  shape  of  an  object  and  its  motion  relative 
to  the  camera,  fiom  a  sequence  of  images  of  the 
object,  using  feature  points  tracked  throughout  the 
sequence.  Our  method  uses  a  projection  model 
known  as  paraperspective  projection,  which 
closely  approximates  perspective  projection  by 
modelling  two  ^ects  not  modelled  under 
orthographic  projection:  the  apparent  change  in 
sue  of  an  object  as  it  moves  along  the  camera’s 
(qttic^  axis,  and  the  different  angle  from  which  an 
object  is  viewed  as  it  moves  parallel  to  the  image 
plane.  Our  paraperspective  factorization  method 
can  be  opplied  to  a  wide  range  of  motion  scenar¬ 
ios,  and  can  recover  the  distance  from  the  camera 
to  the  object  in  each  image.  The  method  assumes 
no  model  of  tiie  motion  or  of  tite  objects  shape, 
and  recovers  the  shape  and  motion  accurately  even 
for  distant  objects.^ 

1.  Introduction 

Recovering  the  getHnetry  of  a  scene  and  the  motiOT 
of  the  camera  from  a  stream  of  images  is  an  impor¬ 
tant  task  in  a  variety  of  api^cations,  including  nav¬ 
igation,  lobraic  manipulation,  and  aerial 
cartognq^y.  While  this  is  possible  in  principle,  tra- 
ditioiud  methods  have  failed  to  produce  reliable 
results  in  many  situations  [Broida  et  al..  1990]. 

Tomasi  and  Kanade  [1991a,  1991b]  developed  a 
robust  and  efficient  method  for  accurately  recover¬ 
ing  the  stuqje  and  motion  of  an  object  ftran  a 
sequence  of  images  using  extracted  feature  points. 


1.  Thit  reteach  wu  pvdalty  nipported  by  the  Airioiiict  Labo- 
ratocy,  Wright  Researdi  and  Devdopment  Cenlei;  Aeionauti- 
cd  Syftemt  Divition  (AFSC),  U.S.  Air  Force. 

Pauenon  AFB.  Ohio  4S433-6S43  under  Gmiract  r33613-90- 
C146S.  ARPA  Order  No.  7S97. 


Their  method  uses  an  orthographic  projection 
modd,  which  is  described  by  lin^  equatkms.  It 
addeves  its  accuracy  and  robustness  by  using  a 
large  number  of  images  and  feature  points,  and  by 
directly  computing  sluq)e  without  cmnputing  the 
depdi  as  an  intermediate  step.  The  method  was 
ted^  m  a  variety  of  real  and  synthetic  images, 
and  was  shown  to  perform  well  even  for  distam 
(d>jects. 

There  are,  however,  some  limitations  of  the  method 
due  to  its  use  of  die  oithogrqdiic  projection  model. 
The  model  omtains  no  nodm  at  all  of  the  distance 
fecm  the  camera  to  the  object  As  a  result  image 
sequences  containing  large  translations  toward  or 
away  fiom  the  camera  often  produce  deformed 
object  shapes,  as  die  method  tries  to  explain  the 
size  differences  in  the  images  by  creadng  size  dif¬ 
ferences  in  the  object  It  also  supplies  no  estima- 
ti(m  of  transladon  along  the  camera’s  c^cal  axis, 
which  limits  its  usefirlness  for  certain  ta^. 

Fortunately,  there  exist  several  perspective  approx- 
imadrais  which  capture  more  of  the  effects  of  per- 
qiecdve  projectitm  while  remaining  linear.  Scaled 
mdiograpliic  projectitm,  sometimes  referred  to  as 
“weak  perspe^ve”  [Mundy  and  Zlsserman,  1992], 
accounts  for  the  sc^ng  effect  of  an  object  as  it 
moves  towards  and  away  fiom  the  camera.  Parap¬ 
erspective  projectitxi,  first  introduced  by  Ohta 
[1981]  and  named  by  Alotmonos  [1990],  accounts 
for  the  scaling  effect  as  well  as  the  different  angle 
fiom  which  an  object  is  viewed  as  it  moves  in  a 
direction  parallel  to  the  image  jdane. 

In  this  paper,  we  present  a  new  factorization 
tnetiiod  bawd  on  the  panqierspective  projection 
model.  The  panqierspective  factorization  metiiod 
takes  a  set  of  points  extracted  from  the  images  and 
tracked  from  one  image  to  the  next,  and  computes 
the  shape  of  the  object  and  the  motion  of  the  cam¬ 
era.  The  method  is  still  fast  and  robust  with  respect 
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to  noise.  However,  it  can  be  t^lied  to  a  wider 
realm  of  situations  than  the  original  factorization 
method,  such  as  sequences  containing  significant 
depth  translation  or  containing  objects  close  to  the 
camera,  and  can  be  used  in  applications  where  it  is 
important  to  recover  the  distance  to  the  object  in 
each  image,  such  as  navigation. 

We  begin  by  describing  our  camera  and  woild  ref¬ 
erence  frames  and  introduce  the  mathematical 
notation  that  we  use.  We  review  the  original  factor¬ 
ization  method  as  defined  in  [Tomasi  and  Kanade. 
^991b],  presenting  it  in  a  slightly  different  manner 
in  order  to  make  its  relation  to  the  paraperspective 
method  more  apparent  We  then  present  our  parap¬ 
erspective  factorization  method.  W;  conclude  with 
the  results  of  some  experiments  which  demonstrate 
the  practicality  of  our  system. 

2.  Problem  Description 

In  a  shape-from-motion  problem,  we  are  given  a 
sequence  of  F  images  taken  from  a  camera  that  is 
moving  relative  to  an  object.  We  locate  P  promi¬ 
nent  feature  points  in  the  first  image,  and  track 
these  points  from  each  image  to  the  rtext,  recording 
the  coordinates  of  each  point  p  in  each 

image  /.  Each  feature  point  p  that  we  track  corre¬ 
sponds  to  a  single  world  point,  located  at  position 
in  some  fixed  world  coordinate  system.  Each 
image  /  was  taken  at  some  camera  orientation, 
which  we  describe  by  the  orthonormal  unit  vectors 
i^.  If,  and  kf,  where  kf  points  along  the  camera’s 
line  of  sight.  If  corresjxrnds  to  the  camera  image 
plane’s  x-axis,  and  corresponds  to  the  camera 
image’s  y-axis.  We  describe  the  position  of  the 
camera  in  each  frame  /  by  the  vector  if  pointing  to 
the  camera’s  focal  point.  This  formulation  is  illus¬ 
trated  in  Hgure  1. 


Figure  1.  Coordinate  system 


The  result  of  the  feature  tracker  is  a  set  of  /*  feature 
point  coordinates  for  each  of  the  F  frames 
of  the  image  sequence.  From  this  information,  our 


goal  is  to  recover  the  estimated  shrqie  of  the  objea, 
given  by  the  position  of  every  poim,  and  the 
estimated  motion  of  the  camera,  given  by  t/,  ]/, 
and  \f  for  each  frame  in  tiie  sequence.  Rather  than 
recover  if  in  world  coordinates,  we  generally 
recover  the  three  separate  components  t/1/,  t/  j/, 
and  if  &/. 

Implicit  in  this  formulation  is  the  requirement  that 
every  feature  point  be  visible  in  every  frame.  Han¬ 
dling  occlusion  for  the  orthographic  factorization 
method  has  been  covered  in  [Ibmasi  and  Kanade, 
1991b],  and  handling  occlusion  for  the  pariqrer- 
^ctive  method  will  be  covered  in  a  future  p^r. 

3.  The  Orthographic  Factorization  Method 

This  section  presents  a  summary  of  the  orthogra¬ 
phic  factorization  method.  A  more  detailed 
description  of  the  method  can  be  found  in  [Tomasi 
and  Kanade.  1991a.  1991b]. 

3.1.  Orthographic  Projection 

The  orthograi^c  projection  model  assumes  that  all 
rays  are  projected  from  the  object  point  parallel  to 
the  camera’s  optical  axis  so  that  they  strike  the 
image  plane  orthogonally,  as  illustrated  in  Figure  2. 


dimensions 

Dotted  lines  indicate  true  perspective  projection 

Under  orthographic  projection,  a  point  p  whose 
location  is  %  will  be  observed  in  frame  /  at  image 
coordinates  ^f^^yf^ 

We  can  rewrite  these  equations  as 
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Xf^  =  m,s,+cx, 


where 


cx,  =  -  (t,- 1/)  cy,  =  -  (t,-  j,)  (3) 

™/  “  ®/  “  h 

3.2.  Decomposition 

We  oiganize  all  of  the  feature  point  coordinates 
(x^.yp  into  a  2FxP  measurement  matrix  w. 
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Each  column  of  the  measurement  matrix  contains 
all  the  observations  for  a  single  point,  while  each 
row  contains  all  the  observed  x-coordinates  or  y- 
coordinates  for  a  single  frame.  We  combine  equa¬ 
tion  (2)  for  all  points  and  frames  into  the  matrix 
equation 

w  =  A/5+r[i  ...  i],  (6) 

where  M  is  the  2F  X  3  motion  matrix,  s  is  the  3  x/> 
sluqie  matrix,  and  7  is  a  2F  x  i  translation  vector. 

Up  to  this  poim  we  have  not  put  any  restrictions  on 
the  location  of  the  world  origin,  except  that  it  be 
stationary  with  respect  to  the  object.  For  simplicity, 
we  set  the  world  origin  at  the  center-of-mass  of  the 
object,  denoted  by  c,  so  that 

c  =  4i»,  =  0-  (7) 

This  enables  us  to  compute  the  i**  element  of  the 
translation  vector  T  directly  from  W,  simply  as  the 
average  of  the  /*  row  of  the  measurement  matrix. 
We  then  subtract  the  translation  from  w,  leaving  us 
with  a  “registered”  measurement  matrix  tr. 
Because  w  is  the  product  of  a  2Fx3  motion 
matrix  M  and  the  3xP  shape  matrix  s,  it’s  rank  is 
at  most  3.  We  use  singular  value  decomposition  to 
factor  w  into 

W'  =  M.  (8) 

3.3.  Normalization 

The  decomposition  of  equation  (8)  is  not  unique.  In 
fact,  any  3x3  non-singular  matrix  A  and  its  inverse 
could  be  inserted  between  Air  and  and  their  prod¬ 
uct  would  still  equal  vr .  Thus  the  actual  motion 


aiKl  shape  are  given  by 

M  =  ikA  S  =  (9) 

when  the  3x3  invertible  matrix  a  is  selected 
^ropriately.  We  observe  that  the  motion  matrix 
M  must  be  of  a  certain  form.  Because  \f  and  are 
unit  vectors,  we  derive  from  equation  (4)  that 

|m/=i  |n/=l.  (10) 

and  because  they  are  orthogorud, 

XSif  Uf  =  0.  (11) 

Equations  (10)  and  (11)  give  us  3F  equaticms 
which  we  c^  the  metric  constraints.  Using  these 
constraints,  we  solve  for  the  3  x  3  matrix  a  which, 
when  multiplied  by  it,  produces  the  motion  matrix 
M  that  best  satisfies  these  constraints.  Once  the 
manix  a  has  been  found,  the  sh^  and  motion  are 
computed  from  equation  (9). 

4.  Paraperspective  Factorization  Method 

4.1.  Paraperspective  Projection 

In  this  paper,  we  use  an  approximation  to  perspec¬ 
tive  projection  kix>wn  as  paraperspective  projec¬ 
tion,  which  was  introduced  by  Ohta  in  order  to 
solve  a  shape  from  texture  problem.  Par^rspec- 
tive  projection  closely  approximates  perspeaive 
projection  by  modelling  both  the  scaling  effect  and 
the  position  effect  (so  called  because  the  amount  of 
parent  rotation  depends  on  the  object’s  position 
in  the  image  relative  to  the  center  of  projection 
[Aloimonos,  1990]),  while  retaining  the  linear 
properties  of  orthographic  projection.  The  paraper¬ 
spective  projection  of  an  object  onto  an  image, 
illustrated  in  Figure  3,  involves  two  steps. 

1.  The  points  of  the  object  are  projected  along  the 
direction  the  line  connecting  the  focal  point  of 
the  camera  to  the  object’s  center-of-mass,  onto  a 
plane  parallel  to  the  image  plane  and  passing 
through  the  object’s  center-of-mass. 

2.  These  points  are  then  projected  onto  the  image 
plane  using  perspective  projection.  Because  the 
points  are  all  on  a  plane  parallel  to  the  image 
plane,  this  is  equiv^ent  to  simply  scaling  the 
image  by  the  ratio  of  the  camera  focal  length  and 
the  distance  between  the  two  planes.^ 

In  general,  the  projection  of  a  point  p  along  direc¬ 
tion  r,  (xito  the  plane  with  normal  n  and  distance 
from  the  ori^  d,  is  ^ven  by  the  equation 
p*  =  p_L__lfr.  We  project  the  object  point  s. 
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Figure  3.  Paraperspective  projection  in 


two  dimensions 

Dotted  lines  indicate  true  perspective  projection 
indicate  parallel  lines. 


along  the  direction  c-t/.  which  is  the  direction 
frran  the  camera’s  focal  point  to  the  object’s  center- 
of-mass,  onto  the  plane  defined  by  normal  and 
distance  from  the  origin  c  •  k^.  giving 


(s,-k^-(ck,) 
(c-t,)k,  ^ 


(12) 


The  perspective  projection  of  these  points  onto  the 
image  plane  is  given  by  subtracting  if  from  s'^  to 
give  the  position  of  the  point  in  the  camera’s  coor¬ 
dinate  system,  and  then  scaling  the  result  by  the 
ratio  of  the  camera’s  focal  length  to  tf,  the  depth  to 
the  object’s  center-of-mass,  where  r,  =  (c-t^)  •  k,. 
This  yields  the  coordinates  of  the  projection  in  the 
image  plane, 


Substituting  (12)  into  (13)  and  simplifying  yields 
the  general  paraperspective  equations  for  and 

yj^ 

*> = 7 1  [•/- ''  -«)  +  <«- ‘/)  •  ‘/> 

^  .  /  (14) 

3’*  =  •  (Sp-e)  +  (c-t/)  •  j/} 


1.  The  scaled  orthogrqduc  projection  model  (also  known  as 
“weak  perspective’T  is  simUar  to  paraperspective  projecUon, 
except  that  the  direction  of  the  initial  projection  is  paraUel  to 
the  camera’s  optical  axis  rather  than  pvdlel  to  the  line  con¬ 
necting  the  object’s  center-of-mass  to  the  camera’s  focal  point 
This  model  captures  the  scaling  effect  of  perspective  projec¬ 
tion.  but  not  the  position  effect  See  [Poelman  and  Kanade, 
1992]  for  a  scaled  orthographic  factorization  method. 


For  simplicity,  we  assume  unit  focal  length,  /  =  i . 

We  have  not  up  to  this  point  put  any  requirements 
on  our  world  coordinate  system  except  that  it  be 
stationary  with  respect  to  the  object  We  simplify 
our  equations  by  placing  the  world  origin  at  tire 
object’s  center-of-mass,  or  setting  c  =  o,  reducing 
(14)  to 


=  “a- (*/■»/)} 

'  ,  ,  (15) 

*/  L  */  J 

These  equations  can  be  rewritten  as 

x^  =  m,-  s, + cxf  =  n,  •  s,  -*■  cjf  (16) 

where 


r,=  -t,k, 


vv 

cx,  =  --!—!■ 


cyf 


=  -V 


j/ 


n,= 


_  h-flh 


(17) 

(18) 

(19) 


Notice  that  equation  (16)  is  identical  to  its  counter¬ 
part  for  orthognq>hic  projection,  equation  (2), 
although  the  corresponding  definitions  of  cxp  cyf, 
m^,  and  n^  differ.  This  similarity  enables  us  to  per¬ 
form  the  basic  decomposition  of  the  matrix  in 
exactly  the  same  marmer  as  we  did  for  orthogra- 
I^iic  projectioa 


4,2.  Decomposition 


We  can  combine  equation  (16),  for  all  points  p 
from  1  to  P,  and  all  frames  /  from  i  to  F,  into  the 
single  matrix  equation 


x„  ...  x„ 

m, 

. 

... 

... 

•••  *rr 

I'll  •  yif 

s 

m, 

Di 

[s,  ...  s2  + 

CXf 

cyi 

yf\  ••• 

."iL 

fy? 

[l  ...  l].  (20) 


W  =  Af5-fT[i  ...  i].  (21) 

where  w  is  the  2FxP  measurement  matrix,  M  is 
the  2Fx3  motion  matrix,  5  is  the  3xP  shape 
matrix,  and  r  is  the  2Fxi  translation  vector.  We 
have  set  c  =  0,  so  by  definition 
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c  =  Is,  =  0. 

Using  this  and  equation  (16)  we  can  write 


(22) 


^  P  P 

+  CXf)  =  BH,-  X  +  l*M,= 

"V  'V  V‘  (23) 

Zy>=  Z  (B,-^  +  cy,)  =n,-  i,s^+Pcyf=Pcyf 

r-\  ,-i  ^-1 


Therefore  we  can  compute  cx^  and  cyf  immediately 
from  the  image  data  as 


vectors  and  are  orthogonal.  According  to  equation 
(19),  we  observe  that 


l»/l  = 


1  +cxJ 


(28) 


W:  know  the  values  of  cXf  and  c>y  horn  our  initial 
registration  step,  but  because  we  do  not  know  the 
value  of  die  dqith  tp  we  cannot  impose  individual 
constraints  on  the  magnitudes  of  and  as  we 
did  in  the  orthographic  factorization  method.  How¬ 
ever,  from  equation  (28)  we  see  that 


"/=  ?  Z*>  «>/=  j'Lyt*-  (2^) 

p-i 

We  subtract  these  values  from  the  corresponding 
rows  in  w,  giving  the  registered  measurement 
matrix 


..  Xj, 

cx, 

“l 

tv  = 

..  Xff 

cx. 

[>• 

■  0  = 

>11 

••  yir 

Bt 

-  yrr 

syr 

[>. 


(25) 


Since  tv  is  the  product  of  two  matrices  each  of 
rank  at  most  3,  tv  has  rank  at  most  3.  just  as  it  did 
in  the  orthographic  projection  case.  If  there  is  noise 
present,  the  rank  of  tv  will  not  be  exactly  3,  but  by 
computing  the  singular  value  decomposition 
(SVD)  of  tv  and  only  retaining  the  largest  3  singu¬ 
lar  values,  we  can  factor  tv  into 


tv  =  m,  (26) 

where  is  a  2Fx 3  matrix  and  s  is  a  3  xP  matrix. 
Using  the  SVD  to  perform  this  factorization  guar¬ 
antees  that  the  product  Sis  is  the  best  possible  rank 
3  approximation  to  tv ,  in  the  sense  that  it  mini¬ 
mizes  the  sum  of  squares  difference  between  corre¬ 
sponding  elements  of  tv  and  St^. 


4.3.  Normalization 


Just  as  in  the  orthographic  case,  the  decomposition 
of  tv  into  the  product  of  Si  and  £  is  not  unique.  We 
need  to  find  the  matrix  A  that  gives  the  true  shape 
and  motion 


M  =  SlA  S  =  (27) 

Again,  we  determine  this  matrix  a  by  observing 
that  the  motion  matrix  M  must  be  of  a  certain  form. 
We  take  advantage  of  the  fact  that  and  j/  are  unit 


If  1+ CXf  1  +  cyf 

Therefore  we  adopt  the  following  constraint  on  the 
magnitudes  of  and  n/. 


I"/  I"/  _  p 

1  +  cx*  1  +  cyf 


(30) 


In  the  case  of  orthographic  projection,  one  con¬ 
straint  on  10/  and  n^  was  that  they  each  have  unit 
magnitude,  as  required  by  equation  (10).  In  the 
above  partqierspective  version  of  those  constraints, 
we  simply  require  that  their  magnitudes  be  in  a  cer¬ 
tain  ratio. 


There  is  also  a  constraint  on  the  angle  relationship 
of  and  n^.  From  the  definition  of  and  n^,  we 
have 


cxfcyf 

“/•  “/  = 


(31) 


The  problem  with  this  constraint  is  that,  again,  if  is 
unknown.  We  could  choose  to  use  either  value 
from  equation  (29)  for  \/i},  since  theoretically 
they  should  be  equal,  but  we  use  the  average  of  the 
two  quantities.  We  choose  the  arithmetic  mean 
over  the  geometric  mean  or  some  other  measure  in 
order  to  keep  the  constraints  linear  in  Q  =  a^a. 
Thus  our  second  constraint  becomes 


cx/:y^m/  cxfcyf\n^^  ^  ^ 
2(1 +cx})  2(l  +  cy^  ~ 


(32) 


This  is  the  paraperspective  version  of  the  orthogra¬ 
phic  constraint  given  by  equation  (11),  which 
required  that  the  dot  product  of  and  be  zero. 


Equations  (30)  and  (32)  are  tmmogeneous  con¬ 
straints,  which  could  be  trivially  satisfied  by  the 
solution  M  =  0.  To  avoid  this  solution,  we  impose 
the  additional  constraint  that 


|m,|  =  1. 


(33) 
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This  does  not  effect  the  final  solution  except  by  a 
scaling  factor. 

Equations  (30),  (32),  and  (33)  are  the  paraperspec- 
tive  version  of  the  metric  constraints,  and  we  com¬ 
pute  the  3x3  matrix  A  such  that  M  =  SiA  best 
satisfies  the  metric  constraints  in  the  least  sum-of- 
squares  error  sense.  This  is  a  simple  problem 
broause  we  have  been  careful  to  ensure  that  they 
are  linear  constraints  in  the  6  unique  elements  of 
the  symmetric  3x3  matrix  Q  =  We  use  the 
metric  constraints  to  compute  Q,  compute  its 
Jacobi  Transformation  Q  =  Lm7,  where  a  is  the 
diagonal  eigenvalue  matrix,^ and  as  long  as  C  is 
positive  definite,  A  =  (LA^^)  . 

4.4.  Shape  and  Motion  Recovery 

Once  the  matrix  a  has  been  determined,  we  com¬ 
pute  the  shape  matrix  s  and  the  motion  matrix  M 
using  equation  (27).  For  each  frame  /,  however, 
there  is  a  more  complex  relationship  between  the 
actual  translation  and  rotation  vectors  and  the  m/ 
and  Tif  vectors,  which  are  the  rows  of  the  matrix  M. 
From  equation  (19)  we  can  see  that 

if  =  zyin,+  cxjkf  3/  =  */•>/+  cy/fe/-  (34) 

From  this  and  the  knowledge  that  if,  j/,  and  must 
be  onhonormal,  we  determine  that 

l,x3/=  (t/atf+cx^fi  X  (zfSf+cyjkf)  =  if 

liJ  =  I  V”/+ "/**/!  =  i 

IjJ  =  !*/“/+ 1 

Again,  we  do  not  know  a  value  for  z^,  but  using  the 
relations  specified  in  equation  (29)  and  the  addi¬ 
tional  knowledge  that  |k/|  =  i,  equation  (35)  can 
be  reduced  to 

Gjkf  =  Hf,  (36) 

where 


■(m/xn/)' 

-  1  - 

G,= 

-  "/  - 

«/  = 

-eXf 

.-cyf 

Wf  =  if  =  (38) 

We  compute  if  simply  as 

k,  =  G/Hf  (39) 

and  then  compute 


if  =  i/x  if  \f  =  fe/x  m,  (40) 

There  is  no  guarantee  that  the  1/  and  ]/  given  by 
this  equation  will  be  orthonormal,  because  and 
H/  may  not  have  exactly  satisfied  the  metric  con¬ 
straints.  Therefore  we  actually  compute  the 
onhonormal  V  and  ]f  which  are  closest  to  the  val¬ 
ues  given  by  equation  (40).  Due  to  the  arbitrary 
world  coordinate  orientation,  to  obtain  a  unique 
solution  we  then  rotate  the  computed  shape  and 
motion  to  align  the  world  axes  with  the  first 
frame’s  camera  axes,  so  that  ii  =  ft  o  d1  and 
j.  =  [o  1  o]^ 

All  that  remain  to  be  computed  are  the  translations 
for  each  frame.  We  ca.':ulate  the  depth  z^  from 
either  pan  or  some  combination  of  the  parts  of 
equation  (29).  Since  we  already  know  cx^,  cy^,  \f, 
and  J/,  we  can  calculate  if  using  equations  (17)  and 
(18). 

5.  Experiments 
5.1.  Parameters 

To  test  our  method,  we  created  synthetic  point 
quences  using  a  perspective  projection  model  of 
jbjects  undergoing  motion.  We  perturbed  the  coor¬ 
dinates  of  each  point  by  adding  gaussian  noise, 
whose  standard  deviation  was  varied  from  0  to  4 
ixxels  (of  a  512x512  pixel  image).  We  used  3  dif¬ 
ferent  object  shapes,  each  of  unit  size  and  contain¬ 
ing  ^proximately  60  feature  points.  All  of  the  test 
runs  consisted  of  60  image  hames  of  the  object 
rotating  through  a  total  of  30  degrees  each  of  roll, 
pitch,  and  yaw.  The  depth,  representing  the  dis¬ 
tance  from  the  camera’s  focal  point  to  the  front  of 
the  object  in  the  first  frame,  was  varied  from  3  to 
60  times  the  object  size.  In  generating  our  synthetic 
images,  for  each  depth  we  chose  the  largest  focal 
length  which  would  keep  the  object  in  the  field  of 
view.  For  each  combination  of  object,  depth,  and 
noise,  we  performed  three  tests,  using  different 
random  noise  each  time. 

We  used  several  different  methods  to  recover  the 
shape  and  motion  of  the  object  and  compared  the 
accuracy  of  the  results;  the  orthographic  factoriza¬ 
tion  method,  the  scaled  orthographic  factorization 
method  (see  [Poelman  and  Kanade,  1992]),  the 
paraperspective  factorization  method,  and  the  full 
perspective  method  which  iteratively  solves  the 
perspective  projection  equations  (see  [Poelman  and 
Kanade,  1992]).  Because  iterative  methods  are 
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generally  very  sensitive  to  initial  conditions,  we 
tested  the  latter  method  using  two  different  starting 
values;  the  results  of  the  paraperspective  factoriza¬ 
tion  method,  and  the  true  shape  and  motion.  The 
last  method  is  unfortimately  not  an  option  in  real 
systems,  but  indicates  what  is  essentially  an  upper 
bound  on  the  accuracy  achievable  using  a  least 
sum-of-squares  difference  formulation  of  the  full 
perspective  projection  model,  without  making 
assiunptions  about  the  motion  or  object  shape. 

5.2.  Error  Measurement 

We  present  here  the  total  shape  eror,  rotation  error, 
X-Y  offset  error,  and  Z  offset  (depth)  error.  The 
term  “offset”  refers  to  translations  along  the  cam¬ 
era  components;  the  X  offset  is  if  i/,  the  Y  offset  is 
if  if,  and  the  Z  offset  is  {/•  kf.  The  shape  and  trans¬ 
lations  are  only  determined  up  to  scaling  factor, 
since  it’s  not  possible  to  distinguish  a  house  SOm 
away  from  a  l/lOth  scale  model  of  the  house  5m 
away.  To  compute  the  shape  error,  we  find  a  scaling 
factor  which  minimizes  the  root-mean-square 
(RMS)  error  between  the  true  and  computed  shape, 
and  then  return  this  error.  We  use  the  same  method 
for  the  X-Y  offset,  and  for  the  Z  offset  The  rota¬ 
tion  error  is  computed  as  the  RMS  of  the  size  in 
radians  of  the  angle  by  which  a  computed  camera 
frame  must  be  rotated  about  some  axis  to  produce 
the  true  camera  frame. 

5.3.  Results 

We  found  that  the  paraperspective  method  per¬ 
formed  significantly  better  than  the  orthographic 
factorization  method  in  image  sequences  in  which 
there  was  depth  translation  or  the  object  was  not 
centered  in  the  image.  In  the  experiments  in  which 
the  object  was  centered  in  the  image  and  there  was 
no  depth  translation,  we  found  that  the  orthogra¬ 
phic  factorization  method  performed  well,  and  the 
paraperspective  factorization  method  provided  no 
significant  improvement.  This  is  not  surprising, 
since  the  orthographic  method  is  in  effect  incorpo¬ 
rating  knowledge  about  the  object  motion  -  that  the 
object  is  centered  in  the  image  and  not  translating 
toward  or  away  from  the  camera. 

The  average  error  results  were  very  similar  for  all 
of  the  objects,  so  our  graphs  show  the  average  error 
over  all  3  runs  of  all  3  objects.  Figure  4  shows  how 
the  various  methods  performed  in  a  scenario  in 
which  the  object  moved  across  the  screen  one  unit 
horizontally  and  one  unit  vertically,  and  moved 


away  from  the  camera  by  a  total  of  one  half  the 
object’s  initial  distance  from  the  camera  (thus  in  a 
test  case  in  which  the  object’s  depth  in  the  first 
frame  was  3.0,  its  depth  in  the  last  frame  was  4.5). 
These  tests  were  done  using  a  noise  standard  devia¬ 
tion  of  2  pixels,  which  we  consider  a  rather  high 
noise  level.  At  low  depths,  perspective  distortion  is 
a  significant  source  of  error  in  the  computed 
results.  Interestingly,  our  experiments  show  that  for 
objects  farther  from  the  camera  than  7  times  the 
object  size,  refining  the  paraperspective  solution 
using  the  perspective  iteration  technique  improves 
the  rotation  and  translation  very  little.  However, 
even  at  depths  beyond  30  times  the  object  size,  the 
perspective  refinement  method  significantly 
improves  the  shape. 


Figure  4.  Methods  compared  for  a  typical  case 


The  behavior  of  the  paraperspective  factorization 
method  for  the  same  motion  scenario  over  a  range 
of  noise  levels  is  shown  in  Figure  5.  Once  the 
object  is  far  enough  from  the  camera  that  perspec¬ 
tive  effects  are  minor,  the  error  in  the  computed 
solution  is  nearly  proportional  to  the  amount  of 
noise  in  the  input. 

We  implemented  the  methods  in  C  and  performed 
the  experiments  on  a  Sun  4/65.  Solving  the  system 
of  60  frames  and  60  points  required  about  20-24 
seconds  foi  each  of  the  three  factorization  meth- 
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recovery  by  noise  level 

ods,  with  much  of  this  time  spent  computing  the 
singular  value  decomposition  of  the  measurement 
matrix.  The  perspective  iteration  method  required 
about  250  seconds  to  solve  the  same  system. 

6.  Conclusions 

The  principle  that  the  measurement  matrix  has  rank 
3,  as  put  foith  by  Tomasi  and  Kanade  in  [1991a]. 
depended  on  the  use  of  an  orthographic  projection 
model.  We  have  shown  in  this  paper  that  this 
important  result  also  holds  for  the  case  of  paraper- 
spective  projection,  which  closely  approximates 
perspective  projection,  and  have  devis^  a  paraper- 
spective  factorization  method  based  on  this  model. 

In  general  image  sequences  in  which  the  object 
being  viewed  translates  significantly  toward  or 
away  from  the  camera  or  across  the  camera’s  field 
of  view,  the  paraperspective  factorization  method 
perfomis  significantly  better  than  the  orthographic 
method.  In  image  sequences  in  which  the  object  is 
close  to  the  camera,  the  paraperspective  factoriza¬ 
tion  method  still  provides  accurate  motion  results, 
and  provides  a  good  approximation  of  the  object 
shape;  this  solution  can  be  further  refined  using  an 
iterative  perspective  method. 


The  paraperspective  factorization  method  com¬ 
putes  the  distance  from  the  camera  to  the  object  in 
each  image,  which  enables  its  use  in  a  wider  range 
of  scenarios.  The  method  performs  well  over  a 
wide  range  of  motions,  is  efficient,  and  is  robust 
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Abstract 

In  Structure  from  Motion  algorithms,  the  error  in  the 
estimated  motion  affects  each  reconstructed  3D  point  in 
a  systematic  way.  This  paper  attempts  to  isolate  the 
effect  of  the  motion  error  (as  correlations  in  the  struc¬ 
ture  error)  and  shows  theoretically  that  these  correla¬ 
tions  can  improve  existing  multi-frame  Structure  &om 
Motion  techniques.  Finally  it  is  shown  that  new  experi¬ 
mental  results  and  previously  reported  work  confirm  the 
theoretical  predictions. 

1  Introduction 

Due  to  problems  in  two-frame  Structure  from  Motion 
(SFM)  [4],  an  obvious  solution  has  been  to  use  more  than 
two  frames  to  reconstruct  the  environment.  Although  it 
is  theoretically  conceivable  that  using  enough  different 
views  should  make  it  possible  to  achieve  any  required 
accuracy,  achieving  stable  and  reliable  3D  reconstruc¬ 
tions  is  still  difficult.  Based  on  an  algorithm  presented 
by  Thomas  and  Oliensis  [18]  [12]  we  show  that  in  or¬ 
der  to  obtain  a  stable  and  reliable  reconstruction  (for 
general  motion),  the  effect  of  the  interframe  motion  er¬ 
ror  h2is  to  be  taken  into  account;  we  argue  that  ignor¬ 
ing  this  component  could  have  resulted  in  the  faUure  of 
previous  (recursive)  multi-frame  SFM  (MFSFM)  algo¬ 
rithms.  The  theoretical  and  experimental  evidence  for 
the  critical  role  of  motion  error  (in  MFSFM)  is  the  main 
contribution  of  this  paper. 

2  Problems  in  MFSFM 

Using  multiple  images  introduces  different  sets  of  prob¬ 
lems,  depending  on  whether  the  algorithms  are  batch 
methods  or  recursive  methods.  Due  to  the  large  search 
spaces  involved,  batch  methods  impose  restraints  on  the 
camera  motion  ([13]  [2]  [10]  [5]  [17])  or  constrain  the 
camera  model  [20]).  However  recursive  MFSFM  algo¬ 
rithms  need  not  impose  such  restraints  (although  early 
research  typically  was  constrained;  cf.  discussion  in  Sec¬ 
tion  4).  Recursive  MFSFM  algorithms  are  also  more 
practical  for  robot  navigation  applications  since  neither 
time  nor  storage  is  lost  waiting  until  enough  frames  have 

‘This  work  wm  supported  by  DARPA  (via  TACOM)  un¬ 
der  contract  number  DAAE07-91-C-R035  and  by  NSF  under 
grant  CDA-8922572. 


been  acquired.  However,  in  order  to  recursively  refine 
the  3D  structure,  a  reliable  estimate  of  the  error  in  the 
3D  structure  is  required.  If  the  estimate  of  the  error  is 
unreliable,  this  results  in  random  behavior  or  possibly 
systematically  erroneous  behavior.  One  of  the  biggest 
problems  in  MFSFM  is  that  it  is  difficult  to  represent 
the  error  in  the  structure  reliably. 

One  representation  of  the  reconstruction  error  is  a 
complete  covariance  matrix.  That  is,  if  the  scene  is 
reconstructed  by  n  3D  points,  then  the  reconstruction 
error  is  represented  by  a  covariance  matrix  of  size  9n^. 
This  covariance  matrix  is  difficult  to  compute,  expensive 
to  store,  and  computationally  complex  to  manipulate. 
Presumably  for  these  reasons,  almost  all  of  the  work  in 
recursive  MFSFM  has  only  used  a  portion  of  the  covari¬ 
ance  matrix,  with  poor  results.  It  is  being  argued  in 
[19]  that,  in  general,  every  entry  of  the  covariance  ma¬ 
trix  is  meaningful;  arbitrarily  neglecting  entries  in  the 
matrix  could  amount  to  a  bad  approximation  of  the  ac¬ 
tual  reconstruction  error  (for  general  camera  motion). 
A  simplistic  explanation  is  as  follows.  The  main  source 
of  error  in  all  structure  from  motion  algorithms  is  the 
error  in  the  estimated  camera  motion.  The  motion  er¬ 
ror  affects  all  the  3D  coordinates  of  the  reconstruction 
in  a  systematic  way;  i.e.,  the  errors  in  ail  the  3D  coor¬ 
dinates  are  correlated.  For  example,  if  the  translation 
component  of  the  camera  motion  is  erroneous,  each  3D 
coordinate  would  be  displaced  along  the  same  direction. 
Since  every  element  of  the  covariance  matrix  represents 
the  correlation  of  the  error  between  pairs  of  3D  points, 
arbitrarily  neglecting  portions  of  the  matrix  (e.g.  all  off- 
diagonal  terms)  could  have  serious  consequences.  The 
following  section  is  a  theoretical  analysis  of  the  meaning 
and  the  role  of  cross-correlations  in  recursive  MFSFM 
algorithms. 

3  Theoretical  Motivation  for  Using 
Cross-correlations 

Let  R  be  the  entire  reconstruction,  made  up  of  n  3D 
points  (R^,  i  =  1 . .  .n). 

Since  each  R^  is  obtained  from  a  two-frame  algorithm 
effectively  by  triangulation,  it  has  two  sources  of  error. 
The  first  source  is  the  error  in  the  interframe  motion, 
or  the  relative  orientation  of  the  cameras.  The  second 
source  of  error  is  the  noise  in  the  image  coordinates.  A 
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reasonable  approximation  of  the  total  error  in  is  to 
express  it  (using  first  order  terms)  as  the  sum  of  the  error 
due  to  the  interframe  motion  and  the  error  due  to  the 
image  coordinates;  the  error  in  R,  is 

=  (.) 

where  W  represents  the  interframe  motion  and  I,  rep¬ 
resents  the  image  coordinates  of  the  point;  dW  and  dh 
represent  the  respective  errors. 

When  the  error  in  R  is  represented  as  a  covariance 
matrix  the  elements  of  this  matrix  are  given  by  the  fol¬ 
lowing  equation  : 

(  E{dKidKi)  ■■■  f:(dR,  <iRn)\ 

COV(dR  dR*')  =  (  •  •  •  1 

\E{dRndB.i)  ■■■  E{dB^dR^)  J 

(2) 

where  E{x)  denotes  the  expected  value  of  x. 

In  this  theoretical  analysis,  in  order  to  bring  out  the 
meaning  and  role  of  the  cross-correlation  terms  clearly, 
we  will  assume  that  we  have  a  reconstruction  consisting 
of  just  two  points. 


3.1  The  Meaning  of  Cross-correlations:  The 
Two  Point  Case 

In  this  case,  the  covariance  matrix  is  reduced  to 


COV(dR  dR*")  = 


£:(dRi  dRi) 
E{dK2  dRi) 


i:(dRi  dRi) 
EidRi  dRi) 


) 


This  covariance  matrix  has  four  correlation  terms,  two 


of  which  are  equivalent  {E{dRi  dR2)  and  E{dRi  dRi)). 
The  other  two  (E(dRi  dRi)  and  E(dR2  dRj))  are  the 
covariance  of  the  error  in  Ri  and  R2;  these  are  typi¬ 


cally  assumed  to  represent  the  complete  error.  However, 
here  we  will  concentrate  on  the  cross-corrtlaiion  term, 
E{dRi  dRi). 

Using  Equation  1  the  cross-correlation  term  can  be 
expanded  as  in  Equation  4: 


(4) 

Since  it  b  recilbtic  to  assume  that  any  two  arbitrary 
image  coordinates  (of  chosen  points)  are  corrupted  by 
independent  noise,  i.e. 


For  the  sake  of  exposition  let  us  assume  that  the  co¬ 
ordinates  of  the  two  points  have  changed  considerably 
between  the  two  images,  resulting  in  large  opticcil  flow. 
Therefore,  a  small  error  in  the  opticcd  flow  (which  corre¬ 
sponds  to  a  small  error  in  I)  has  little  effect  on  the  error 
in  the  motion,  dW ;  i.e. 

E(dWdlf)ssO  i=  1,2  (7) 

For  thb  particular  case,  the  expansion  of  Equation  6  b  : 

E(dRi  dR2)  =  ^COV^(dW  <iW)^  (8) 

Equation  8  shows  that  the  cross-correlation  b  directly 
proportional  to  the  motion  error,  represented  as  the  co- 
variance  of  the  error  in  the  motion  (dW).  If  Equation 
7  does  not  hold  the  situation  b  more  complicated:  the 
cross-correlation  b  influenced  not  only  by  the  motion 
error  but  also  (indbectly)  by  the  error  in  the  image  co¬ 
ordinates.  In  either  case,  the  cross-correlation  term  b 
closely  related  to  the  motion  error. 

3.2  The  Effect  of  Cross— correlations  in  Kalman 
Filtering 

In  thb  section  the  analysis  is  extended  to  study  the  ef¬ 
fect  of  cross-correlations  on  refining  reconstructions  us¬ 
ing  the  Kalman  filter. 

The  goal  of  the  Kalman  filter  b  to  optimally  fuse  the 
reconstructions  over  time  and  obtain  the  best  reconstruc¬ 
tion  (by  limiting  the  reconstruction  error).  If  we  assume 
that  the  noise  in  every  new  reconstruction  (R(t)  at  time 
t)  b  Gaussian,  then  the  optimal  fused  reconstruction  b 
the  sum  of  the  individual  reconstructions  weighted  by 
the  inverse  of  theu  covariances.  Given  thb  the  optimal 
fused  reconstruction  (R)  at  time  t  is  as  follows  (i.e.  stan¬ 
dard  Kalman  filtering  [6]) 

t 

R(t)  =  JV  53C0F(R(t))-^R(t)  (9) 

In  order  to  determine  the  exact  contribution  of  a  single 
reconstruction  (COF(R)“*  R  or  R"",  for  weighted  R) 
at  any  time  (t)  the  covariance  can  be  expanded  using 
Equation  3  (and  assuming  Equation  7  is  valid)  in  the 
following  way: 


E(dli  dl2)  =  0  (5) 

one  of  the  terms  in  the  expansion  of  Equation  4  will 
vanbh.  The  resultant  expansion  is  given  in  Equation  6: 

E(JR,  JR.)  =  ^E(JWJW')^ 

+  ^R(JWJir)^  +  ^B(j,rjw)f^  (,) 

Given  Equation  6,  the  only  way  the  cross-correlation 
term  wiU  end  up  being  zero  b  when  the  three  terms  for¬ 
tuitously  cancel;  in  all  other  cases  the  cross-correlation 
term  has  an  effect  on  the  performance  of  the  recur¬ 
sive  MFSFM  algorithm.  Furthermore,  the  situations  in 
which  the  three  terms  cancel  each  other  out  are  most 
likely  rare. 


COV(R)=  (*■  +  "■■  s,+‘k,)  <'“> 


where 


and 


S  =  ^COV(JI.  Ji.)a 


(11) 

(12) 


Si  represents  the  error  in  the  3D  coordinates  due  to  the 
error  in  the  image  coordinates  (dl)  assuming  that  the 
motion  is  perfectly  known;  Mij  represents  the  error  in 
the  3D  coordinates  due  to  the  error  in  the  interframe 


(in  Equation  9)  b  a  normalising  term  which  b  irrele¬ 
vant  for  thb  analysb. 
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camera  motion  (dW)  assuming  that  the  image  coordi¬ 
nates  are  perfectly  known. 

The  weighted  R  can  now  be  written  as: 

pw  _  /  5i -I- A/ll  Mi2  \  /  Ri 

V  ^^21  ^2  +  M22  )  \^2  ) 

(13) 

Equation  13  can  be  expanded  after  Bar-Shalom  and 
Fortmann  [14].  In  this  analysis,  let  us  now  concentrate 
on  the  effect  of  the  cross-correlation  on  a  single  optimally 
fused  3D  coordinate  (Ri);  all  of  the  relevant  information 
is  contained  in  the  first  row  in  the  expansion  of  Equation 
13,  which  is: 

RJLi  =  Sn  +  M^i-Mi2{S22  +  M22)-^M^^)-^ 
(Ri-A/i2(522  +  A/22)-'R2)  (14) 

The  second  term  (in  Equation  14)  can  be  thought  of  as 
a  Corrected  Ri: 

Corrected  Ri  =  Ri  -  Mi2{S22  +  A/22)~^R2  (15) 

If  there  is  no  error  in  the  motion  -  i.e.  Afi2  is  zero  - 
the  Corrected  Ri  is  identical  to  Ri.  However,  since  this 
is  generally  not  true  in  practice,  the  value  of  R2  has  a 
corrective  effect  on  Ri.  The  magnitude  of  the  correc¬ 
tion  depends  on  the  size  of  the  cross-correlation  A/12. 
Since  we  have  shown  that  the  cross-correlation  captures 
the  motion  error  (cf.  Section  3.1),  the  magnitude  of  the 
correction  depends  on  the  (shared)  motion  error  that 
corrupts  both  Ri  and  R2. 

The  covariance  of  Corrected  Ri  is 

COV{CorrectedRi)  =  ^([Ri-A/i2(S22+A/22)-^R2]*) 

Again,  this  can  be  simplified  to  obtain: 

COV{Corrected  Ri)  =  Sii-l-A/ii-A/i2(S22+A/22)‘‘A/f2 

As  stipulated  by  Kalman  filtering,  any  contribution  (to¬ 
wards  the  fused  optimal  estimate)  has  to  be  weighted 
by  the  inverse  of  its  covariance.  Thus  we  expect  that 
Corrected  Ri  (Equation  15)  should  be  weighted  by  the 
inverse  of  its  covariance.  Since  the  right-hand  side  of 
Equation  17  turns  out  to  be  equal  to  the  first  term  of 
Equation  14  above,  this  is  exactly  the  case. 

This  analysis  reveals  that  the  cross-correlation  terms 
are  important.  If  the  interframe  motion  error  is  large, 
then  the  cross-correlation  terms  become  significant  and 
play  a  crucial  role.  Since  in  SFM  the  motion  error  is  typ¬ 
ically  large  [4]  we  predict  that  without  cross-correlations 
the  benefits  of  Kalman  filtering  are  lost,  i.e.  the  fused 
reconstruction  would  be  neither  stable  nor  accurate.  In 
the  next  section  we  present  experimental  evidence  to  this 
effect. 

4  Experimental  Data 

The  previously  reported  MFSFM  algorithms  conform  to 
the  prediction  of  the  last  section.  Heel  [7]  approximates 
the  entire  covariance  matrix  by  just  the  error  terms  relat¬ 
ing  to  one  coordinate  Z  (i.e.,  when  reconstructing  n  3D 
points,  his  covariance  matrix  has  n  elements  rather  than 


the  full  9n^  elements);  only  qualitative  results  are  re¬ 
ported  and  the  camera  motion  is  restricted  to  a  straight 
line.  Matthies  et.  al.  [11]  also  use  only  n  elements 
to  represent  the  reconstruction  error.  However,  in  their 
experiment  the  camera  motion  is  known  accurately,  in 
which  case  the  cross-correlations  should  not  play  a  role; 
their  reconstruction  is  within  0.5%  error  using  1 1  images. 
Shigang,  Tsuji,  and  Imai  [15]  also  use  only  n  terms  to 
approximate  their  error,  but  consider  more  general  mo¬ 
tions  than  Heel.  When  they  allow  the  camera  to  move 
freely  in  a  plane,  their  reconstruction  error  is  15%  even 
with  as  many  as  40  images.  Ando  [1]  also  uses  n  ele¬ 
ments  (for  general  camera  motion)  but  only  simulation 
experiments  are  reported. 

The  next  category  of  approximations  involve  using 
9n  elements  to  approximate  the  9n^  covariance  matrix. 
Stephens  et  al.  [16]  report  reconstructions  within  1% 
error  for  1  point  after  50  frames  in  the  case  of  motion 
straight  ahead.  Cui,  Weng  and  Cohen  [3]  also  use  9n 
elements  to  approximate  the  full  covariance  matrix  and 
apply  the  algorithm  for  the  case  of  general  camera  mo¬ 
tion.  The  reported  accuracy  of  the  reconstruction  (from 
a  real  image  sequence)  fluctuates  randomly.  Since  no 
comparison  with  the  ground  truth  is  reported  it  is  un¬ 
clear  as  to  how  weU  this  algorithm  really  does. 

The  algorithm  developed  by  Thomas  and  Oliensis  [18] 
[12]  is  the  only  recursive  MFSFM  algorithm  (for  gen¬ 
eral  motion)  that  uses  the  full  covariance  matrix  with 
9n^  elements.  Apart  from  using  cross-correlations,  their 
algorithm  is  similar  to  previous  recursive  (Kalman  fil¬ 
ter)  MFSFM  algorithms.  Highly  accurate  reconstruc¬ 
tions  (as  accurate  as  the  ground  truth)  have  adready 
been  reported  by  Thomas  and  Oliensis  [18]  for  image 
sequences  with  no  constraints  on  the  robot  camera  mo¬ 
tion.  Here,  their  algorithm  is  used  to  test  for  the  effect  of 
cross-correlations  in  real  image  sequences  by  compairing 
results  from  the  same  algorithm  with  and  without  cross¬ 
correlations.  Such  a  comparison  has  not  been  previously 
done;  it  will  be  presented  in  the  following  section  for  two 
real  image  sequences.  ^ 

4.1  Experiment  I:  Reconstruction  of  a 
Rotating  Box 

A  box  was  rotated  in  steps  of  approximately  4  degrees 
around  its  vertical  axis,  and  nine  images  were  obtained 
by  a  stationary  camera  (cf.  Fig  1).  The  camera  parame¬ 
ters  are:  focal  length  6cm,  fov  (23.4°,  22.4°),  and  image 
size  256  x  242  pixels.  The  available  ground  truth  had 
an  accuracy  of  1.5mm.  In  this  experiment  35  corners  of 
the  white  squares  on  the  box  were  chosen  to  be  recon¬ 
structed;  the  corners  were  tracked  using  the  algorithm  of 
Williams  and  Hanson  [23].  Due  to  the  well  known  scale 
ambiguity  in  SFM  [21],  the  scale  of  the  reconstruction 
was  input  as  a  single  number  at  the  beginning  of  the 
process. 

In  order  to  determine  how  well  the  shape  of  the  box  is 
reconstructed,  each  reconstruction  is  rotated  and  trans¬ 
lated  (rigidly)  to  align  with  the  ground  truth.  The 

’Thanks  to  Haipreet  Sawhney  and  Rakesh  Kumar  for 
these  sequences  and  the  ground  truth  measurements. 
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mismatch  between  the  aligned  reconstruction  and  the 
ground  truth  is  the  error  in  the  shape.  The  alignment 
that  minimizes  the  mismatch  error  can  be  determined 
exactly  (in  closed  form)  by  Horn’s  absolute  orientation 
algorithm  [8].  The  overall  error  of  the  entire  reconstruc¬ 
tion  is  reported  as  an  average  of  the  individual  mismatch 
errors  over  the  set  of  reconstructed  points. 

The  performance  of  the  algorithm  with  and  without 
the  cross-correlation  terms  is  presented  here.  For  com¬ 
parison  we  include  the  results  from  a  standard  two-frame 
approach:  Horn’s  relative  orientation  algorithm  [9].  ^ 

From  the  graph  (in  Fig.  2)  it  can  be  observed  that  the 
error  in  the  two-frame  reconstruction  is  fairly  high  (the 
average  error  is  8.8  mm;  the  dimensions  of  the  box  are 
133  mm  x  157  mm  x  70  mm  and  the  distance  between 
any  two  points  ranges  from  15  mm  to  207.19  mm).  The 
random  and  high  fluctuations  (e.g.  in  frame  4  and  frame 
7)  make  the  two-frame  reconstructions  unreliable.  In 
the  case  when  cross-correlations  of  the  error  are  ignored 
in  the  multi-frame  algorithm  developed  in  [18],  we  can 
see  that  after  an  initial  drop  in  error,  the  error  fluctuates 
around  11  mm,  but  has  a  very  slow  decrease.  Notice  also 
that  in  frames  4  and  7  the  error  increases,  showing  that 
the  algorithm  is  unable  to  ignore  the  erroneous  individ¬ 
ual  two-frame  reconstructions.  However,  when  cross¬ 
correlations  are  not  ignored,  the  average  reconstruction 
error  falls  monotonically  and  remains  as  low  as  the  er¬ 
ror  in  the  ground  truth  (1.5  mm)  for  the  last  4  frames. 
Note  that  the  final  reconstruction  (after  9  frames)  of  the 
MFSFM  approach  without  cross-correlations  is  6  times 
more  erroneous  than  the  final  reconstruction  when  cross¬ 
correlations  are  considered. 

4.2  Experiment  II:  Reconstruction  of  the 
Computer  Science  Lobby 

A  sequence  of  ten  pictures  were  taken  of  the  Computer 
Science  lobby  by  a  camera  mounted  on  a  moving  Den¬ 
ning  mobile  robot  moving  almost  straight  ahead  (Fig. 
3).  The  camera  parameters  are:  focal  length  16mm,  fov 
(29.3°,  22.9°),  and  image  size  256  x  242  pixels.  For  this 
experiment  29  points  were  selected  from  the  first  image, 
making  sure  that  each  point  was  visible  in  the  rest  of 
the  images.  Point  tracking  was  done  as  in  the  previous 
experiment.  The  error  in  the  reconstruction  is  plotted  in 
Fig.  4.  In  this  case  we  are  interested  in  reconstructing 
the  structure  of  the  scene,  and  not  merely  the  shape.  The 
error  in  the  reconstructed  3D  coordinates  is  reported  as 
a  percentage  of  the  distance  of  the  true  3D  coordinates 
from  the  camera,  averaged  over  the  set  of  reconstructed 
points. 

From  the  graph  (Fig.  4)  it  can  be  observed  that  the 
error  in  the  two-frame  reconstructions  is  high  and  fluc¬ 
tuates  randomly  (around  8%).  Also,  from  the  graph  we 
can  see  that  the  MFSFM  approach  that  ignores  cross¬ 
correlations  leads  to  better  accuracies  than  individual 
two-frame  results.  However,  the  reconstruction  error 
does  not  decrease  monotonically;  instead  it  fluctuates 
around  an  error  of  3.3  %. 

^Horn’s  algorithm  provides  the  input  for  Thomas  and 
Olicnus’  MFSFM  algorithm. 


Agcdn,  using  cross-correlations  yields  the  best  accu¬ 
racy  of  the  three  approaches  compared  here.  Fig.  4 
shows  that  the  average  reconstruction  error  falls  2d- 
most  monotonically,  with  a  final  error  of  2.16%  after 
ten  frames.  The  final  reconstruction  (after  ten  frames) 
of  the  same  MFSFM  algorithm  which  ignores  cross¬ 
correlations  is  65%  more  erroneous  than  the  final  recon¬ 
struction  which  uses  cross-correlations. 

5  Conclusion 

We  have  argued  that  the  cross-correlation  terms  capture 
the  interframe  motion  error  and  account  for  it.  Ignoring 
the  cross-correlations  seem  to  have  direct  consequences 
on  the  accuracy  and  usefulness  of  the  reconstructed  mod¬ 
els  of  the  environment. 

Although  the  cross-correlations  have  presumably  been 
ignored  because  of  their  computational  complexity,  we 
have  shown  that  they  are  crucial  enough  to  warrant  an 
attempt  to  make  using  them  computationally  feasible. 
Since  the  bottleneck  of  including  cross-correlations  is 
the  time  required  to  invert  large  matrices,  one  solution 
is  a  strmghtforward  parallel  implementation  of  the  algo¬ 
rithm  on  a  SIMD  parallel  machine  such  as  the  Image  Un¬ 
derstanding  Architecture  [22].  Future  research  will  also 
be  directed  towards  discovering  other  ways  of  reducing 
computational  time  such  as  using  smaller  (intersecting) 
subsets  of  points  which  are  yet  large  enough  to  capture 
the  underlying  motion  error. 

References 

[1]  H.Ando,  “Dynamic  Reconstruction  of  3D  Surfaces  and 
3D  Motion,”  Proc.  IEEE  vorkshop  on  visual  motion, 
Princeton,  NJ,  pp.  101-110,  1991. 

[2]  T.  J.  Broida  and  R.  Chellappa,  “Estimating  the  Kine¬ 
matics  and  Structure  of  a  Rigid  Object  from  a  Sequence 
of  Monocular  Images”,  IEEE  Transactions  on  Pattern 
Analysis  and  Machine  Intelligence,  vol.  13,  no.  6,  pp. 
497-513,  1991. 

[3]  N.  Cui,  J.  Weng  and  P.  Cohen,  “Extended  Struc¬ 
ture  and  Motion  Analysis  from  Monocular  Image  Se¬ 
quences,”  Proceedings  3rd  IEEE  International  Confer¬ 
ence  on  Computer  Vision,  Osaka,  Japan,  1990,  pp.  222- 
229. 

[4]  R.  Dutta  and  M.  Snyder,  “Robustness  of  Correspon¬ 
dence  Based  Structure  from  Motion,”  Proceedings  3rd 
IEEE  International  Conference  on  Computer  Vision, 
Osaka,  Japan,  Dec.  1990. 

[5]  W.O.  Fransen,  “Structure  and  Motion  from  Uniform  3- 
D  Acceleration,"  Proc.  IEEE  workshop  on  xnsual  motion, 
Princeton,  NJ,  pp.  14-20,  1991. 

[6]  Technical  Staff,  The  Analytical  Sciences  Corp.,  and  A. 
Gelb,  ed..  Applied  Optimal  Estimation,  MIT  Press,  1986. 

[7]  J.Heel,  “Temporal  Surface  Reconstruction,”  CVPR, 
Haww,  June,  1991,  pp.  607-612. 

[8]  B.  K.  P.  Born,  “Closed  Form  Solution  of  Absolute  Ori¬ 
entation  Using  Unit  Quaternions,”  J:  Opt.  Soc.  Am.  A, 
vol.  4,  pp.  629-642,  1987. 

[9]  B.  K.  P.  Horn,  “Relative  Orientation,”  International 
Journal  of  Computer  Vuion,  Vol.  4,  pp.  59-78,  1990. 

[10]  R.  V.  R.  Kumar,  A.  Tirmalai  and  R.C.  Jain,  “A  Nonlin¬ 
ear  Optimization  Algorithm  for  the  Estimation  of  Struc¬ 
ture  and  Motion  Parameters,”  CVPR,  San  Diego,  CA, 
pp.  136-143,  1989. 


694 


[n]  L.  Matthies,  T.  Kanade,  and  R.  Szeliski,  ‘‘Kalman 
FUtet-Based  Algorithms  for  Estimating  Depth  from  Im¬ 
age  Sequences,”  International  Journal  of  Computer  Vi¬ 
sion,  vol  3,  pp.  209-236,  1969. 

[12]  J.  Oliensis  and  J.  I.  Thomas,  “Incorporating  Motion  Er¬ 
ror  in  Multi-frame  Structure  from  Motion,"  Proceedings 
IEEE  Workshop  on  Visual  Motion,  Princeton,  pp  6-13, 
1991. 

[13]  H.  S.  Sawhney,  J.  Oliensis,  and  A.  R.  Hanson,  “Descrip¬ 
tion  and  Reconstruction  from  Image  Trajectories  of  Ro¬ 
tational  Motion”,  in  ICCV,  Osala,  Japan,  December, 
1990,  pp.  494-496. 

[14]  Y.Bar-Shalom  rmd  T.E.Fortmann,  Tracking  and  Data 
Association,  Academic  Press,  Orlando,  FI,  1991,  pp. 
277. 

[15]  L.Shigang,  S.Tsuji  and  M.  Imai,  “Determining  of  Cam¬ 
era  Rotation  from  VanisUng  Points  of  Lines  on  Hoiison- 
tai  Planes,”  Proceedings  Srd  IEEE  International  Con¬ 
ference  on  Computer  Vision,  Osaka,  Japan,  1990,  pp. 
499-502. 

[16]  M.J.Stephens,  R.J.Blissett,  D.Charnley,  E.P.Sparks  and 
J.M.Pike,  “Outdoor  Vehicle  Navigation  Using  Passive 
3D  Vision,"  CVPR,  San  Diego,  CA,  pp.  556-562,  1969. 

[17]  C.J.Taylor,  D.J.Krei^an,  and  P.Anandan,  “Structure 
and  Motion  in  Two  Dimensions  from  Multiple  Images: 
A  Least  Squares  Approach,”  Proc.  IEEE  workshop  on 
vtsuol  motion,  Princeton,  NJ,  pp.  242-246,  1991. 

[16]  J.  Inigo  Thomas  and  J.  Oliensis,  “Recursive  Structure 
from  Multi-frame  Motion,”,  Proc.  Darpa  Image  Under¬ 
standing  workshop,  San  Diego,  CA,  1992. 

[19]  J.  Inigo  Thomas,  Scene  Reconstruction  Using  an  Image 
Sequence,  (in  progress)  Ph.D.  Dissertation,  University 
of  Massachusetts,  Amherst. 

[20]  C.  Tomasi  and  T.  Kanade,  “Factoring  Image  Sequences 
into  Shape  and  Motion,"  Proc.  IEEE  workshop  on  visual 
motion,  Princeton,  NJ,  pp.  21-26,  1991. 

[21]  R.Y.Tsai  and  T.S. Huang,  “Uniqueness  and  estimation  of 
3-D  motion  parameters  and  surface  structures  of  rigid 
objects,”  pp.  135-171. 

[22]  C.C. Weems,  S.P.Levitan,  A.R.Hanson,  E.R.  Riseman, 
D.B.Shu  and  J.G.Nash,  “The  Image  Understanding  Ar¬ 
chitecture,”  IJCV,  2,  pp.  251-262. 

[23]  L.  R.  Williams  and  A.  R.  Hanson,  “Translating  Optical 
Flow  into  Token  Matches  and  Depth  from  Looming,” 
Proc.  tnd  Inti.  Conf.  on  Computer  Vision,  pp.  441-446, 
1966. 


tlUUB 


Fig.  2:  Box  Shape  Error 


Fig.  3:  Lobby  Sequence 


Fig.  4t  Lobby  Reconitruetion  Error 
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Abstract 

We  introduce  a  low-level  description  of  image 
motion  called  the  local  translational  de¬ 
composition  (LTD).  This  description  asso¬ 
ciates  with  image  features  or  small  image  ar¬ 
eas,  a  three-dimensional  unit  vector  describing 
the  direction  of  motion  of  the  corresponding 
environmental  feature  or  small  surface  area. 

The  local  translational  decomposition  is  de¬ 
rived  by  applying  a  procedure  for  processing 
purely  translational  motion  to  small  overlap¬ 
ping  image  areas.  This  intermediate  represen¬ 
tation  of  motion  considerably  simplifies  the  in¬ 
ference  of  motion  parameters  for  ego-motion 
and  can  support  qualitative  inferences  for  non- 
rigid  motions.  We  first  show  how  to  compute 
the  LTD  from  optic  flow  fields  and  then  show 
how  the  LTD  is  used  to  recover  the  parameters 
of  rigid  body  motions.  We  present  three  cases 
for  which  the  recovery  of  motion  parameters 
is  particularly  robust:  motion  constrained  to 
a  determined  plane  (the  normal  to  the  plane 
is  known)  j  motion  constrained  to  an  undeter¬ 
mined  plane  (the  normal  to  the  plane  is  not 
known);  arbitrary  motion  relative  to  locally 
planar  surfaces.  We  then  discuss  techniques 
for  computing  the  locsd  translational  decom¬ 
position  directly  from  real  image  sequences 
without  the  initial  extraction  of  optic  flow  and 
other  areas  for  future  work. 

1  Introduction 

In  previous  work  [Lawton,  1982],  we  developed  a  tech¬ 
nique  to  process  relative  translational  motion  of  a  sensor 
with  respect  to  a  stationary  environment  or  indepen¬ 
dently  translating  objects.  This  and  related  algorithms 
[Burger  and  Bhanu,  1989;  Jain,  1983]  are  based  on  the 
strong  geometric  constraints  on  image  motion  in  the  case 
of  translation  -  radial  motion  of  image  features  from  a 
focus  of  expansion  (or  contraction)  determined  by  the 
intersection  of  the  axis  of  translation  with  an  imaging 

’This  research  is  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  is  mon¬ 
itored  by  the  U.  S.  Army  Topographic  Engineering  Center 
under  contract  No.  DACA76-92-C-0016 


surface  [Gibson,  1950;  Lee,  1980].  The  technique  pre¬ 
sented  in  [Lawton,  1982]  was  based  on  optimizing  a  mea¬ 
sure  which  described  the  quality  of  feature  matches  re¬ 
stricted  to  lie  along  the  radial  flow  paths  associated  with 
a  potential  axis  of  translation.  The  optimization  pro¬ 
cess  involved  searching  over  the  surface  of  a  unit  sphere 
where  each  point  corresponded  directly  to  a  possible  di¬ 
rection  of  translation.  The  optimization  combined  the 
determination  of  the  direction  of  translation  and  the  cor¬ 
responding  image  displacements  into  a  single,  mutually 
constraining  computation.  It  was  possible  to  determine 
the  direction  of  translation  to  within  a  few  degrees  in 
small  image  areas  using  a  few  distinctive  features. 

In  this  paper  we  extend  the  translational  processing 
algorithm  to  work  with  general  rigid  body  and  other 
cases  of  motion  by  applying  the  translational  procedure 
to  local  portions  of  a  flow  field.  This  processing  asso¬ 
ciates  a  direction  of  relative  environmental  motion  with 
a  local  portion  of  a  flow  field  and  also  an  error  mea¬ 
sure  reflecting  the  validity  of  the  translational  approx¬ 
imation.  We  call  this  description  of  image  motion  the 
local  translational  decomposition  (LTD).  Comput¬ 
ing  the  LTD  begins  by  decomposing  a  flow  field  into 
small  overlapping  neighborhoods  and  then  approximat¬ 
ing  the  motion  for  each  neighborhood  as  being  produced 
by  translational  motion  of  the  corresponding  portion  of 
the  environment.  This  approximation  associates  a  unit 
vector  describing  the  direction  of  environmental  motion 
with  local  portions  of  a  flow  field.  Each  unit  vector  has 
an  associated  fit-value  reflecting  the  validity  of  the  trans¬ 
lational  approximation. 

The  LTD  is  a  low  level  representation  of  environmen¬ 
tal  motion  which  considerably  simplifies  the  recovery  of 
the  sensor  motion  parameters.  The  local  directions  of 
motion  and  corresponding  error  measures  are  used  as 
constraints  to  determine  the  actual  parameters  of  motion 
and  to  recover  the  structure  and  layout  of  environmental 
surfaces.  This  is  broken  into  four  cases.  For  motion  con¬ 
strained  to  a  plane  of  a  known  orientation  (See  Section 
2.1),  the  local  translational  approximation  is  recovered 
directly  from  the  intersection  of  flow  vectors  with  the 
horizon  line  determined  by  the  plane  of  motion.  For  mo¬ 
tion  constrained  to  a  plane  of  unknown  orientation  (See 
Section  2.2),  all  of  the  computed  LTD  vectors  must  be 
perpendicular  to  the  normal  of  the  unknown  plane.  This 
constraint  leads  to  a  direct  fitting  procedure  to  recover 
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the  plane  of  motion.  For  motion  relative  to  locally  planar 
surfaces  (see  Section  2.3),  the  combination  of  local  pla¬ 
narity  and  rigidity  is  used.  For  arbitrary  motion,  rigidity 
between  environmental  points  is  used  to  recover  motion 
parameters  from  a  small  number  of  image  locations  (See 
Section  2  and  Section  3.1). 

The  reminder  of  this  section  introduces  the  notation 
used  throughout  this  paper.  Section  2  describes  how  the 
local  direction  of  translation  is  estimated  from  a  flow 
field  and  cases  of  motion  for  which  this  is  particularly  ro¬ 
bust.  Section  3  describes  how  the  parameters  of  relative 
sensor  motion  can  be  recovered  from  the  estimated  local 
directions  of  translation.  Section  4  discusses  computing 
the  local  translational  decomposition  directly  from  real 
image  sequences  without  the  initial  extraction  of  optic 
flow  and  other  areas  for  future  work. 

1.1  Notation 

The  coordinate  system  used  in  this  paper  is  shown  in  Fig¬ 
ure  1 .  The  origin  of  this  right-handed  coordinate  system 
lies  at  the  focal  point  of  the  camera.  The  image  plane 
is  parallel  to  the  xy-plane  and  is  centered  on  the  point 
(0,0,/),  where  /  is  the  focal  length  of  the  camera.  A 


three-dimensional  environmental  point  will  be  referred  to 
as  pij  =  (xij.yij,  2<j)-  The  corresponding  image  point 
is  pij  =  (iijiVij)-  The  first  subscript  i  is  used  to  dif¬ 
ferentiate  between  points.  The  second  subscript  denotes 
the  time  interval.  Thus,  p,-j  refers  to  the  ith  point  at 
time  j.  A  three-dimensional  displacement  which  trans¬ 
forms  p,j  into  Pi  j+i  forms  a  vector.  This  vector  will  be 
referred  to  as  v,- j .  The  corresponding  optic  flow  vector 
on  the  image  plane  is  In  Section  2  a  method  for 
estimating  Vij  is  present^.  This  estimated  vector  will 
be  referred  to  as  v, j .  If  v,- j  is  correct,  it  will  be  parallel 
to  Vij,  but  its  depth  will  be  unknown.  Vij  can  be  posi¬ 
tioned  anywhere  along  the  rays  of  projection  which  pass 
through  Pij  and  pij+i-  Unless  specified  otherwise,  t),j 
will  be  positioned  at  the  image  plane. 

The  motion  of  the  camera  can  be  described  by  six  pa¬ 
rameters.  Let  r  =  (r*,  r,,  r,)  denote  the  axis  of  rotation, 
and  t  =  the  direction  of  translation.  We  as¬ 

sume  the  axis  of  rotation  passes  through  the  origin  of 
the  camera  coordinate  system.  The  magnitude  of  r  is 
equal  to  the  angle  of  rotation,  and  t  is  a  unit  vector. 

2  Estimating  Local  Translation 

In  this  section  we  show  how  to  determine  an  axis  of  trans¬ 
lation  consistent  with  a  local  portion  of  a  computed  flow 
field.  In  section  4  we  briefly  discuss  how  to  compute  this 
directly  from  textured  images  without  the  initial  extrac¬ 
tion  of  a  flow  field. 

Figure  1  shows  that  the  plane  formed  by  a  flow  vec¬ 
tor  and  the  focal  point  of  the  camera  must  include  the 
estimated  local  translation  vector  (we  refer  to  this  as 
the  flow-vector  plane  for  a  given  flow  vector).  In  the 
case  of  purely  translational  motion,  the  estimated  local 
translation  vector  will  be  the  same  for  all  flow  vectors  in 
the  neighborhood.  Therefore,  the  estimated  local  trans¬ 
lation  vector  is  the  vector  which  is  paradlel  to  all  of  the 
flow  vector  planes  in  the  neighborhood.  This  observation 
leads  directly  to  a  method  of  solving  for  the  estimated 
local  translation. 

The  plane  formed  by  v.-j  and  the  focal  point  of  the 
camera  must  include  Vij.  Let  this  plane  be  designated 
by  its  normal  n,  j. 

^ij  —  Pij  ^  Pi j+i  (^) 

Since  n,  j  is  perpendicular  to  Vij 
^ij  '  ^ij  ~  ® 

In  the  case  of  purely  translational  motion,  the  direction 
of  Vij  is  constant  for  all  i.  Therefore,  Equation  2  can  be 
rewritten  as 

Hi  j  •  =  0  (3) 

where  vj  =  Vij  for  all  i.  This  equation  is  linear  with 
three  unknowns,  and  can  be  solved  using  a  least  squares 
technique. 

An  error  measure  is  used  to  evaluate  the  validity  of 
the  local  tran.slation  approximation.  The  error  measure 
we  use  is  the  average,  taken  over  the  local  neighborhood, 
of  the  angle  between  each  flow  vector  plane  and  the  local 
translation.  Using  the  normals  n,  j  from  Equation  1,  the 
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Figure  2;  Local  translation  associated  with  a  rotating 
line 


error  measure  is  defined  as 


where  m  is  the  number  of  flow  vectors  in  the  local  neigh¬ 
borhood.  Alternatively  (and  with  greater  expense),  this 
measure  could  be  optimized  directly  by  a  search  proce¬ 
dure  to  determine  an  axis  of  translation. 

In  general,  Vij  is  not  constant  for  all  t.  However,  in  lo¬ 
cal  areas  t),  j  is  approximately  constant.  For  example,  in 
Figure  2,  points  which  are  nearby  on  a  line  segment  are 
shown  to  have  approximately  the  same  local  translations 
when  the  line  is  rotated  about  its  midpoint.  Points  near 
the  axis  of  rotation  would  not  have  a  good  translational 
approximation  as  would  be  reflected  in  the  correspond¬ 
ing  error  measure.  Note  that  if  the  motion  is  composed 
of  both  a  rotation  and  translation,  the  approximation 
will  also  be  effected  by  environmental  points  at  differ¬ 
ent  depths,  especially  at  occlusion  boundaries.  Since  the 
flow  vectors  in  the  area  of  an  occlusion  boundary  will 
not  consistently  emanate  from  a  focus  of  expansion,  the 
error  measure  given  in  Equation  4  returns  a  high  value 
in  these  areas.  Using  the  error  measure,  the  unreliable 
occlusion  areas  can  be  avoided  when  computing  the  pa¬ 
rameters  of  motion.  Figure  3  shows  the  flow  field  for  a 
scene  containing  multiple  depths  and  undergoing  an  ar¬ 
bitrary  motion.  The  error  function  derived  from  this  flow 
field  is  shown  in  Figure  4.  The  scene  contains  two  planes 
which  occlude  a  planar  background  as  well  as  each  other. 
The  planes,  as  well  as  the  background,  are  skewed  with 
respect  to  the  image  plane  (i.e.  the  planes  are  receding 
in  depth).  The  locations  of  the  occlusion  boundaries  are 
obvious  from  the  figure. 

The  method  of  LTD  estimation  discussed  above  was 
tested  on  several  synthetic  optic  flow  fields  like  the  one 
shown  in  Figure  5.  This  flow  field  is  the  result  of  a 
rotation  of  5.73*’  about  the  axis  (5,4, 1),  followed  by  a 
translation  of  (100,25,-75).  All  units  are  given  in  pix¬ 
els.  The  field  of  view  of  the  camera  is  90"  in  both  the 
X  and  Y  directions.  The  image  is  63x63,  and  the  fo¬ 
cal  length  is  31.  The  rectangle  overlayed  on  the  flow 
field  represents  the  neighborhood  over  which  the  trans¬ 
lational  approximation  is  performed.  The  actual  angles 
between  the  correct  local  translational  vectors  and  the 
approximated  local  translational  vectors  at  each  position 
in  the  flow  field  is  shown  in  Figure  6.  The  computed  er¬ 
ror  measure-based  upon  Equation  4  is  shown  as  a  surface 
plot  in  Figure  7.  Notice  that  the  computed  error  measure 


Figure  3:  Flow  field  for  an  image  containing  occlusion 


Figure  4:  Error  function  for  an  image  containing  occlu¬ 
sion 

in  Figure  7  reflects  a  strong  correspondence  between  the 
approximated  translational  vectors  with  the  least  error 
and  the  correct  translational  axes.  This  correspondence 
has  been  found  to  be  typical.  Figure  8  (a)-(c)  shows 
the  correct  local  directions  of  translation  with  the  val¬ 
ues  of  each  component  displayed  as  separate  intensity 
plots.  Since  the  translational  vectors  are  represented 
as  three-dimensional  unit  vectors  with  each  component 
in  the  range  of  -1.0  to  1.0,  Figure  8  displays  the  x,  y, 
and  z  components  of  the  local  translation  vectors  witli 
pure  white  corresponding  to  the  value  of  1.0  and  pure 
black  corresponding  to  -1.0.  Figure  8  (d)-(f)  shows  the 
local  translational  values  that  were  derived  from  the  op¬ 
tic  flow  field  using  the  approximation  procedure.  The 
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Figure  7:  Evaluated  error  measure  for  flow  in  flgure  5 


Figure  5:  Optic  flow  field  for  a  rotation  of  5.73®  about 
the  axis  (5,4, 1),  translation  of  (100,25,-75) 


Figure  6;  Actual  errors  for  flow  in  figure  5 


derived  LTD  vector  components  have  been  thresholded 
using  the  error  measure  given  in  Equation  4,  so  that 
only  the  best  values  are  shown.  These  are  then  used  for 
inferring  the  overall  parameters  of  motion.  The  corre¬ 
sponding  areas  removed  by  the  thresholding  are  shown 
by  the  enclosed  white  regions  which  contain  a  T. 

2.1  Motion  Constrained  to  a  Determined 
Plane 

It  is  particularly  simple  to  recover  the  local  translation 
from  flow  fields  produced  by  environmental  motion  con¬ 
strained  to  a  determined  plane  (the  normal  to  the  plane 
is  known).  In  this  case,  the  environmental  displacement 
vector  Vij  must  be  perpendicular  to  the  normal  of  the 


Figure  8:  LTD  vector  components  of  an  arbitrary 
rigid  body  motion  (a)  x-component  (b)  y-component 
(c)  z-component  (d)  derived  x-component  (e)  derived  y- 
component  (f)  derived  z-component 


Figure  9:  Motion  constrained  to  a  plane 

plane  of  motion.  We  know  from  Section  2  that  Vij  also 
lies  in  the  plane  determined  by  its  corresponding  flow 
vector  Vij  and  the  focal  point  of  the  camera.  The  es¬ 
timated  direction  of  motion  lies  along  the  intersection 
of  these  planes.  The  estimated  direction  of  motion  j 
can  be  determined  by  intersecting  these  planes.  Figure  9 
shows  the  geometry,  where  the  plane  of  motion  is  posi¬ 
tioned  so  that  it  intersects  the  image  plane  at  the  base 
of  the  flow  vector  Vjj .  In  terms  of  image  geometry,  this 
corresponds  to  intersecting  the  horizon  line,  determined 
by  the  plane  of  motion  through  the  focal  point,  with  a 
flow  vector.  The  point  of  intersection  is  a  Focus  of  Ex¬ 
pansion  for  the  local  axis  of  translation  (or  a  Focus  of 
Contraction:  which  depends  on  the  direction  of  the  flow 
vector  relative  to  the  point  of  intersection).  Computing 
the  LTD  in  this  case  has  been  found  to  give  extremely 
low  errors  (small  fractions  of  a  degree)  in  the  estimated 
local  translations. 

Motion  constrained  to  a  plane  is  typical  in  terrestrial 
circumstances.  Several  indoor  robotic  environments  in¬ 
volve  robot  motion  constrained  to  a  plane.  In  vehicular 
environments,  the  translational  approximation  is  usually 
valid  due  to  limitations  in  vehicle  turning  radii,  mean¬ 
ing  that  the  overall  motion  of  a  vehicle  can  be  locally 
approximated  as  a  translation. 

2.2  Motion  Constrained  to  an  Undetermined 
Plane 

Processing  in  the  case  of  motion  constrained  to  an  unde¬ 
termined  plane  is  similar  to  that  of  motion  constrained  to 
a  determined  plane.  The  only  difference  is  that  an  esti¬ 
mate  of  the  plane  of  motion  must  first  be  recovered.  Us¬ 
ing  the  technique  described  in  Section  2  the  local  trans¬ 


Figure  10:  Optic  flow  field  for  a  planar  motion 


lation  is  computed  at  each  flow  vector.  Since  the  motion 
that  produced  these  local  translations  is  constrained  to 
a  plane,  each  of  the  local  translations  must  be  parallel 
to  this  plane.  This  constraint  can  be  written  as 

Vij  •  n  =  0  (5) 

where  n  is  a  vector  normal  to  the  plane  of  motion.  Us¬ 
ing  this  equation,  n  can  be  computed  by  a  linear  least 
squares  technique. 

An  example  of  processing  in  this  case  is  shown  in  Fig¬ 
ure  10  to  Figure  12.  Figure  10  shows  the  flow  field  pro¬ 
duced  by  a  rotation  of  4.58°  about  the  axis  (—1,1,2), 
followed  by  a  translation  of  (120, 20, 50).  Units  are  given 
in  pixels.  This  motion  is  constrained  to  lie  in  the  plane 
perpendicular  to  the  normal  (—1,1,2).  However,  the 
plane  is  unknown,  so  initially  the  local  translation  vec¬ 
tors  must  be  computed  by  the  method  used  for  cases  of 
arbitrary  motion. 

The  angles  between  the  correct  local  translational  val¬ 
ues  and  the  derived  local  translational  values  shown  are 
plotted  in  Figure  11.  The  error  measure  is  shown  in  Fig¬ 
ure  12.  Since  we  have  an  error  measure  associated  with 
each  point  describing  the  error  of  the  translational  ap¬ 
proximation,  we  can  select  several  positions  of  minimal 
error  for  use  in  Equation  5.  Using  the  error  measure  from 
Equation  4  the  best  15  local  translations  were  selected 
for  the  least  squares  fit.  The  recovered  plane  normal  is 
then  (—0.4107,0.4129,0.8129)  which  is  off  by  an  angle 
of  0.37°  from  the  correct  value.  We  can  then  use  this 
estimate  to  evaluate  the  directions  of  motion  using  the 
technique  for  motion  constrained  to  a  determined  plane 
from  the  previous  section.  The  computed  directions  of 
motion  are  then  shown  in  Figure  13  (d)-(f).  Like  the  case 
of  motion  constrained  to  a  known  plane,  there  is  very 
little  error  in  the  derived  LTD  vectors.  The  mean  angle 
between  derived  and  actual  LTD  vectors  was  0.176°  and 
the  maximum  angle  was  1.274°. 
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Fixure  11:  Actual  errors  for  an  unknown  planstr  motion 
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Figure  12:  Evaluated  error  measure  for  unknown  planar 
motion 


2.3  Local  Planarity  and  Rigidity-based  LTD 
Estimation 

Another  algorithm  for  computing  the  LTD  is  based  on 
the  constraints  provided  by  assuming  motion  relative  to 
locally  planar,  rigid  environmental  surfaces.  The  algo¬ 
rithm  begins  by  searching  over  the  half-plane  defined  by 
a  flow  vector  and  the  focal  point  of  the  camera  as  shown 
in  Figure  1  (this  plane  is  designated  a  half-plane  because 
we  only  need  to  search  over  180").  Each  candidate  LTD 
vector  is  used  to  solve  for  other  LTD  vectors  in  a  local 
neighborhood  by  making  an  assumption  of  surface  pla¬ 
narity  within  the  neighborhood.  The  consistency  of  this 
local  neighborhood  of  LTD  vectors  is  then  evaluated  by 
calculating  the  relative  depths  of  the  LTD  vectors.  This 
results  in  an  error  measure  which  is  associated  with  each 
candidate  LTD  vector.  The  candidate  LTD  vector  with 
the  lowest  associated  error  is  selected  as  the  correct  LTD 
vector.  The  remainder  of  this  section  describes  this  al¬ 


Figure  13:  LTD  vector  components  of  an  undetermined 
planar  motion  (LTD  estimated  using  the  determined  pla¬ 
nar  motion  technique)  (a)  x-component  (b)  y-component 
(c)  z-component  (d)  derived  x-component  (e)  derived  y- 
component  (f)  derived  z-component 

gorithm  in  greater  detail. 

2.3.1  Local  Planarity  Assumption 

Given  a  candidate  LTD  vector,  we  wish  to  solve  for 
other  nearby  LTD  vectors.  In  order  to  derive  a  rela¬ 
tionship  between  LTD  vectors  within  a  neighborhood, 
we  will  assume  that  surfaces  are  locally  planar.  In  this 
case  directional  derivatives  of  the  LTD  vectors  along  the 
image  plane  are  constant.  Let  Pi.t,  and  pi+i,*  be 

three  collinear  points  on  the  image  plane.  Under  the  pla¬ 
nar  surface  assumption,  we  have  the  following  constraint 

Vi-H.t  -  Vj.t  _  Vi,k  -  Vi-l,k 
||Pi+i,t  -  Pi, *11  llPi,*  -  Pi-1, *11 

Letting  v,',t  be  the  current  candidate  LTD  vector.  Equa¬ 
tion  6  consists  of  two  independent  equations  and  six  un¬ 
knowns.  The  remaining  equations  needed  to  solve  for 
these  six  unknowns  can  be  provided  by  the  LTD  vectors’ 
corresponding  optic  flow  vectors.  Figure  1  shows  that 
the  plane  formed  by  a  flow  vector  and  the  focal  point 


of  the  camera  must  include  the  LTD  vector.  This  con¬ 
straint  can  be  written  as 

(Pi-i.t +  X  Pi-1, i+i  =  0  (7) 

(p<+i,t  +  X  Pi+i,t+i  =  0  (8) 

This  provides  four  additional  independent  equations. 
Therefore,  using  the  system  defined  by  Equations  6,  7, 
and  8,  we  can  solve  for  the  neighborhood  LTD  vectors 
Vi-1,*  and  Wi+i,*. 

2.3.2  Error  Measure 

The  final  step  in  evaluating  a  candidate  LTD  vector  is 
to  construct  an  error  measure  from  the  neighborhood  of 
derived  LTD  vectors.  The  relative  depth  of  all  the  LTD 
vectors  in  a  3x3  neighborhood  is  calculated  by  position¬ 
ing  the  candidate  vector  at  the  image  plane.  Using  the 
depth  values,  a  plan**  is  fit  to  the  neighborhood  points. 
The  error  measure  is  denned  as 

1 

—  ^(«iPi.t  -  9i.*)  (9) 

t  =  l 

where  or,-  is  the  depth  scale  factor  and  qi^k  is  the  point  of 
intersection  of  the  fitted  plane  and  the  ray  of  projection 
defined  by  pi,*.  Section  3.1  shows  how  to  solve  for  the 
depth  scale  factor  a,-. 

An  example  of  processing  an  arbitrary  motion  using 
the  rigidity-based  method  is  shown  in  Figures  5  and  14. 
Figure  5  shows  the  flow  field  produced  by  a  rotation  of 
5.73"  about  the  axis  (5,4,1),  followed  by  a  translation 
of  (100,25,-75).  Units  are  given  in  pixels.  Figure  14 
(a)-(c)  shows  the  correct  local  translational  values  as  in¬ 
tensity  plots  of  the  vector  components.  Figure  14  (d)- 
(f)  shows  the  local  translational  values  that  were  derived 
from  the  optic  flow  field.  Like  the  case  of  motion  con¬ 
strained  to  a  known  plane,  there  is  very  little  error  in  the 
derived  LTD  vectors.  The  mean  angle  between  derived 
and  actual  LTD  vectors  was  0.425"  and  the  maximum 
angle  was  2.647". 

3  Inferring  Parameters  of  Motion  from 
the  LTD 

In  this  section  we  develop  a  technique  to  recover  the  pa¬ 
rameters  of  motion  given  a  flow  field  and  the  LTD.  The 
method  presented  in  this  section  is  based  upon  using 
rigidity  to  solve  for  the  relative  depth  of  environmental 
points  associated  with  LTD  vectors.  The  key  result  is 
that  it  is  possible  to  infer  the  parameters  of  motion  us¬ 
ing  only  three  determined  LTD  vectors  computed  from 
locations  anywhere  within  the  flow  field.  Thus,  the  infer- 
encing  can  be  done  with  a  sparse  LTD  field  which  may 
have  been  strongly  filtered  by  the  validity  of  the  mea¬ 
sures  reflecting  the  translational  fit.  Once  the  relative 
depth  has  been  determined,  the  solution  for  the  param¬ 
eters  of  motion  becomes  straightforward. 

3.1  General  Rigidity  Constraint 

In  order  to  find  the  parameters  of  motion  we  will  first 
solve  for  the  relative  depth  of  the  LTD  vectors  using  the 
rigidity  constraint.  Once  the  relative  depth  has  been 
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Figure  14:  LTD  vector  components  of  an  arbitrary  rigid 
body  motion  (LTD  vectors  were  derived  using  the  lo¬ 
cal  planar  method)  (a)  x-component  (b)  y-component 
(c)  z-component  (d)  derived  x-component  (e)  derived  y- 
component  (f)  derived  z-component 

determined,  the  solution  for  the  parameters  of  motion 
becomes  trivial. 

Two  LTD  vectors  and  are  assi  d  to  have 
undergone  identical  rigid  body  motions.  ish  to  find 
the  relative  depth  of  these  two  vectors.  Figure  15  shows 
the  relationship  between  the  two  vectors.  One  of  the 
vectors,  {)<,«,  is  fixed  in  depth  so  that  it  emanates  from 
the  image  plane  at  the  point  The  unknown  depth 
of  the  other  vector  can  be  expressed  as  apj^k  where  o 
is  some  unknown  scale  factor.  Since  both  of  the  LTD 
vectors  are  the  result  of  the  same  rigid  body  motion,  we 
have  the  following  constraint 

l|aPi,t  -  Pi.tll  =  l|a(Pj,t  +  i>j,k)  -  {Pi.k  +  w<,*)||  (10) 

Squaring  both  sides  and  solving  for  a,  Equation  10  can 
be  reduced  to 

(2p;,t  ■  Vj,k  +  Vj,k  ■ 

2(Pi,i  •  Vi.t  +  Pi.k  ■  Vj.k  +  Vi.k  •  i>j.*)o  + 

(2p,,t  •  r,,t  +  Vi.k  ■  ii.k)  =  0  (11) 
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Figure  15:  Relative  depth  of  two  LTD  vectors 


This  equation  is  quadratic  in  a  and  results  in  two  feasible 
solutions  for  the  relative  depth  between  two  LTD  vectors. 

3.2  Inferring  the  Parameters  of  Motion 

Once  we  have  determined  the  relative  depth  between 
LTD  vectors  the  estimation  of  the  parameters  of  motion 
is  trivial.  The  problem  is  equivalent  to  that  of  estimating 
the  motion  parameters  from  actual  three-dimensional  en¬ 
vironmental  surface  positions.  A  rigid  body  motion  can 
be  expressed  as 

OijVij  =  r  X  a.jPij  -1- 1  (12) 

where  r  is  the  axis  of  rotation  and  t  is  the  direction  of 
translation.  This  expression  is  linear  and  can  be  solved 
using  a  least  squares  technique.  The  expression  con¬ 
sists  of  six  parameters  and  two  independent  equations. 
Therefore,  it  can  be  solved  using  a  minimum  of  three 
(non-collinear)  LTD  vectors. 

3.3  Motion  Parameter  Inference  Results 

The  rigidity  constraints  were  used  to  compute  the  pa¬ 
rameters  of  motion  from  the  derived  LTDs  presented  in 
Section  2.  The  results  are  shown  for  the  case  of  arbi¬ 
trary  motion,  motion  constrained  to  a  determined  plane, 
motion  constrained  to  an  undetermined  plane,  and  the 
rigidity-based  method  applied  to  arbitrary  motion.  In 
the  previous  section  we  noted  that  the  parameters  of 
motion  can  actually  be  estimated  using  only  three  LTD 
vectors.  The  feasibility  of  estimating  the  parameters  of 
motion  from  a  minimal  set  of  data  is  demonstrated  in 
the  results  presented  below. 


3.3.1  Motion  Constrained  to  a  Determined 
Plane 

In  the  case  of  motion  constrained  to  a  determined 
plane,  the  LTD  vector  estimates  tend  to  be  highly  ac¬ 
curate  over  an  entire  flow  field.  Typically,  when  using 
three  LTD  vectors  selected  at  random  from  the  derived 
local  translations,  the  estimate  of  the  axis  of  rotation 
and  translation  almost  always  are  within  a  degree  of  the 
correct  axes  and  the  angle  of  rotation  is  determined  to 
within  a  hundredth  of  a  degree. 

3.3.2  Motion  Constrained  to  an  Undetermined 
Plane 

The  case  of  motion  constrained  to  an  undetermined 
plane  is  similar  to  the  case  of  motion  constrained  to  a 
determined  plane  in  that  the  LTD  vector  estimates  are 
very  good  over  the  entire  image.  Three  LTD  vectors  were 
selected  at  random  from  the  derived  local  translations 
shown  in  Figure  13.  The  estimate  of  the  axis  of  rotation 
was  off  by  0.99®,  the  angle  of  rotation  was  off  by  0.04®, 
and  the  direction  of  translation  was  off  by  0.83®. 

3.3.3  Local  Planar  Method 

The  rigidity-based  method  presented  in  Section  3.1  is 
also  capable  of  accurate  LTD  estimates  over  the  entire 
flow  field.  Three  LTD  vectors  were  selected  at  random 
from  the  derived  local  translations  shown  in  Figure  14. 
The  estimate  of  the  axis  of  rotation  was  off  by  2.26®,  the 
angle  of  rotation  was  off  by  0.18®,  and  the  direction  of 
translation  was  off  by  2.84®. 

The  camera  was  moved  about  a  randomly  curved  sur¬ 
face.  The  optic  flow  field  produced  by  this  surface  is 
shown  in  Figure  16.  The  three-dimensional  environmen¬ 
tal  surface  was  reconstructed  from  this  flow  field.  Fig¬ 
ure  17  (a)  shows  a  plot  of  the  original  surface.  Fig¬ 
ure  17  (b)  shows  the  results  of  the  surface  reconstruc¬ 
tion  and  Figure  17  (c)  shows  the  resulting  error  in  the 
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Figure  17:  (a)  Curved  surface  (b)  Reconstructed  surface 
(c)  Error 

reconstruction.  The  surface  shown  in  this  example  is 
not  planar.  However,  the  reconstruction  is  fairly  accu¬ 
rate,  despite  the  violation  of  the  planarity  assumption. 
Experiments  indicated  that  surfaces  which  are  approxi¬ 
mately  planar  in  a  local  neighborhood  can  be  successfully 
reconstructed.  Therefore,  any  continuous  surface  can  be 
reconstructed,  given  an  appropriate  density  of  optic  flow 
vectors. 

3.3.4  Arbitrary  Motion 

Using  the  error  measure  shown  in  Figure  7  and  the 
derived  LTD  vectors  shown  in  Figure  8,  the  three  best 
LTD  vectors  were  selected  and  used  to  compute  the  pa¬ 
rameters  of  motion.  The  estimate  of  the  axis  of  rotation 
was  off  by  8.13®,  the  angle  of  rotation  was  off  by  1.09®, 
and  the  direction  of  translation  was  off  by  12.02®.  In  the 
previous  section  it  was  shown  that  the  minimum  num¬ 
ber  of  LTD  vectors  which  can  be  used  to  estimate  the 
parameters  of  motion  is  three.  However,  we  can  use  a 
larger  set  of  LTD  vectors  in  a  least  squares  procedure 
to  obtain  more  accurate  results.  For  example,  when  the 
ten  best  LTD  vectors  were  used,  the  axis  of  rotation  was 
off  by  3.65®,  the  angle  of  rotation  was  off  by  0.44®,  and 
the  direction  of  translation  was  off  by  9.32®. 

4  Summary  and  Future  Work 

We  have  introduced  the  local  translational  decomposi¬ 
tion  (LTD)  as  a  low  level  representation  of  environmen¬ 
tal  motion  which  can  simplify  the  inference  of  motion 
parameters  from  optic  flow  fields.  We  have  found  that 
this  is  particularly  robust  and  simple  for  cases  of  motion 
constrained  to  a  determined  or  undetermined  plane,  and 
motion  relative  to  locally  planar  surfaces.  In  addition,  It 
is  possible  to  infer  motion  parameters  from  sparse  LTDs. 

Areas  for  further  work  include: 

•  Develop  criteria  to  determine  the  the  best  set  of  es¬ 
timated  local  translation  vectors  to  estimate  motion 
parameters  in  order  to  take  advantage  of  the  lim¬ 


ited  number  of  points  for  which  the  local  translation 
needs  to  be  determined  to  infer  motion  parameters. 

•  Investigate  local  translational  analysis  with  the  use 
of  multiple  cameras  and  longer  image  sequences. 

•  The  local  translation  decomposition  is  similar  to  an 
array  of  localized  looming  detectors  which  deter¬ 
mine  whether  things  are  coming  towards  or  away 
from  an  observer  at  a  particular  image  position.  It 
may  be  possible  to  use  such  a  distributed  represen¬ 
tation  of  motion  relative  to  environmental  surfaces 
to  control  navigation  and  other  behaviors  directly, 
without  the  inference  of  motion  parameters  from 
the  LTD. 

•  The  local  translation  approximation  can  be  used 
as  a  criteria  for  computing  flow  to  determine  the 
LTD  directly  without  the  initial  computation  of  a 
flow  field.  In  the  experiments  present^  above,  we 
have  assumed  a  uniformly  dense  flow  field  of  high 
resolution.  The  translation  procedure  developed  in 
[Lawton,  1982]  was  not  applied  to  computed  flow 
fields,  but  to  successive  images  for  which  interest¬ 
ing  points  had  been  extracted  from  the  initial  im¬ 
age.  Given  distinctive  features  (at  least  two),  it 
was  possible  to  compute  the  direction  of  transla¬ 
tion  in  a  small  image  area.  This  use  of  the  trans¬ 
lational  procedure  can  be  seen  as  a  local  constraint 
on  the  determination  of  image  displacements  such 
that  the  corresponding  environmental  motion  can 
be  interpreted  as  being  translational.  For  egomo- 
tion,  this  wouldn’t  require  computation  over  the 
entire  flow  field  since  only  three  LTD  vectors  are 
needed.  Where  the  translational  approximation  is 
poor  there  will  be  a  large  N  .iiu  '  in  the  error  mea¬ 
sure  reflecting  weaker  confide  in  the  validity  of 
the  approximation. 
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Abstract 

This  paper  proposes  a  region  based  method  to  solve 
the  structure  and  motion  problem.  Region  match- 
ing  between  images  is  done  by  using  affine  invariants. 
No  feature  extraction  or  segmentation  is  needed  in 
the  matching  process.  Having  recovered  the  region 
matches,  a  cTcned  form  solution  for  the  camera  mo¬ 
tion  and  the  3D  structures  in  the  regions  can  be 
obtained.  Structure  consists  of  location  and  orien¬ 
tation  of  local  planar  patch  approximate  to  3D  sur¬ 
face. 

1  Introduction 

Two  primary  approaches  to  the  estimation  of  3D 
structure  and  camera  motion  are  based  on  feature 
correspondence  and  optical  flow,  respectively.  An  in¬ 
teresting  example  of  the  first  approach  is  that  taken 
in  [4],  in  which  the  object  of  interest  is  assumed  to 
be  a  3D  planar  surface  patch  that  is  roughly  perpen¬ 
dicular  to  the  optical  axis  of  the  camera(the  shallow 
structure  assumption).  Then,  the  object  is  tracked 
and  its  depth  is  recovered  by  estimating  four  affine 
parameters  through  line  matches.  [3]  is  an  example 
of  the  latter  approach.  They  compute  the  first  or¬ 
der  optical  flow  and  estimate  the  camera  motion  and 
the  orientation  of  the  planar  regions.  Our  approach 
is  a  generalization  where  we  assume  the  3D  surface 
patch  is  planar  where  the  plane  has  arbitrary  ori¬ 
entation  and  location,  and  estimate  both  this  plane 
and  the  camera  motion  from  two  images.  Botn  the 
process  of  estimating;  feature  correspondence  and  of 
computing  the  optical  flow  are  somewhat  noise  sensi¬ 
tive.  Our  approach  is  region  based  and  less  sensitive 
to  the  noise  of  individual  features  or  data  points. 

The  assumption  in  this  paper  is  that  oniects  of 
interest  are  well  approximated  by  groups  of  planar 
patches,  and  each  planar  patch  is  small  compared 
with  the  patch-to-camera  distance.  Then  the  weak 
perspective  model  applies  for  the  projection  of  the 
3D  surface  into  image  planes,  and  a  pair  of  images 
of  points  on  the  3D  planar  surface  patch  are  related 
by  an  affine  transformation  [1,  4]. 

The  primary  contribution  of  this  paper  is  a  set 
of  low  computational  cost  explicit  expressions  for 
the  location  and  orientation  parameters  for  these  3D 
planar  surface  patches  based  on  a  pair  of  images  and 
for  the  motion  specifying  the  two  camera  positions 
from  which  the  pair  of  images  were  taken.  A  second 
contribution  is  a  low  computation  cost  algorithm  for 
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matching  small  corresponding  regions  in  a  pair  of 
images  through  use  of  a  new  class  of  affine  moment 
invariants.  The  region  to  be  matched  can  be  at  any 
arbitrary  position  and  all  the  data  points  inside  the 
region  are  to  be  used.  No  feature  extraction  or  seg¬ 
mentation  is  needed  in  the  matching  process. 


2  Equations  of  the  Apparent  Motion 


Consider  the  system  shown  in  Fig.l,  the  camera  is 
moving  from  frame  1  at  time  ti  to  Frame  2  at  time 
<3  through  a  rotation  R  followed  by  a  translation  T. 
Let  G  be  a  planar  region  on  a  3D  surface  and  P  be 
a  particular  point  on  it.  The  3D  coordinates  of  P 
w.r.t.  FVame  1  and  FVame  2  are  denoted  by  (i,  y,  z) 
and  {x  ,y  ,z),  respectively,  (u, v)  and  (u  ,v  )  are 
the  image  coordinates  of  P  at  tj  and  at  ts. 

By  using  the  weak  perspective  projection,  the  pro¬ 
jections  (u,  v)  and  (u  ,  v  )  at  and  <3  are  as  follows; 

(;)H(:)  (:■')=<(;.')  (m) 

where  /  is  the  focal  length  of  the  camera  and  /,  k  are 
the  depths(  z  component  of  the  3D  coordinates)  of 
the  centroid  of  region  G  w.r.t.  camera  Frame  1  and 
camera  FVame  2,  respectively.  With  small  rotation, 
the  3D  coordinates  of  P  w.r.t.  camera  FVame  1  and 
camera  Frame  2  are  related  by; 


/  X  \  /  1  -w,  \  \  f  **  \ 

j  jr  I  =  I  w,  1  -w,  lip  I  +  1  I 

\  *  J  \  w,  1  /  \  *'  /  \  *»  / 


(2.2) 


Represent  region  G  by  the  equation  ax'  +  by  +z'  + 
d  =  0  w.r.t.  Frame  2.  By  combining  equations  (2.1) 
and  (2.2),  the  2D  apparent  motion  of  the  projections 
of  P  between  image  2  and  image  1  can  be  expressed 
by; 


\  f  1  —  aWf  — \  I  *  1 

v  y  ~T  ^  w,  -P  aw,  1  -p  4i»,  )  y  v  J 

Clearly,  (2.3)  represents  an  affine  transformation  for 
the  apparent  motion.  Denote  the  six  affine  parame¬ 
ters  as  follows; 

fci  =  t(1  -  aw,)  kt  =  j{l  +  bwx) 

hi  =  t(— —  ftw»)  hi  =  +(t,  —  dtPf)  (2.4) 

hi  =  t(w,  +  awx)  ht  =  1(1,  +  dw,) 

Thus,  the  projections  of  region  G  onto  Image2 
and  Imagel  are  related  by  the  affine  parameters 
Ai,  A3, . . . ,  A«.  Without  loss  of  generality,  we  assume 
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the  camera  focal  length  to  be  unity  in  the  following 
derivations. 

3  Problem  Definition  and  Our 
Approach 

In  preceding  section,  we  derived  the  expressions  for 
the  parameters  of  the  affine  transformation  that  de¬ 
scribes  the  apparent  motion  of  a  planar  region  under 
weak  perspective  projection.  In  order  to  estimate 
the  camera  motion  and  the  3D  scene  structure,  two 
problems  have  to  be  solved, 

1.  For  each  3D  planar  surface  patch,  one  in  each 
image,  called  2D  region  matching,  and  recover 
the  associated  transformation  between  images. 

2.  Based  on  results  in  (1),  solve  for  the  camera  mo¬ 
tion  and  3D  structure  parameters  of  the  regions 

Before  explaining  the  scheme  to  solve  the  first  prol^ 
lem  listed  above,  m  next  section,  we  first  give  explicit 
solutions  for  camera  motion  and  3D  scene  structure 
based  on  knowledge  of  the  affine  transformation  de¬ 
scribing  apparent  motion.  Then,  we  come  back  to 
pr^nt  the  2D  region  matching  algorithm  and  the 
estimation  of  the  2%ne  transformation  based  on  mo¬ 
ment  invariants. 

4  Motion  and  Structure  Estimation 

4.1  Solving  the  Motion  and  the  Structure 
Parameters 

As  shown  in  equation  (2.4),  a  2D  matched  region 
pair  has  six  equations  relating  the  affine  parame¬ 
ters  with  the  surface  patch  parameters  and  cam¬ 
era  motion.  The  unknown  parameters  are  k,t,  a,b, 
d,Wt,Wy,w,,  However,  not  all  of  the  un¬ 

known  parameters  are  independent.  Let 
denote  the  projection  of  the  centroid  of  G  onto  im¬ 
age  2.  Then  the  centroid  of  G  w.r.t  camera  frame  2 

^  t  i 

is  (kuy^,kvJ^,k).  By  observing  the  last  equation  of 
(2.2),  we  obtain  an  equation  relating  the  depths  of 
the  centroid  of  G  w.r.t  camera  frame  1  and  camera 
frame  2. 

/  =  k{-WyV.\  +  Wxv\  -b  1)  -I-  f. 

Since  {kuJ^,kvy^,k)  is  the  centroid  of  region  G,  it 

must  also  satisfy  the  plane  equation  ax  +  by  +z  + 
d  =  0.  Thus,  we  have, 
aku'y^  +  bkv\  +  k  +  d  =  0 

As  a  result,  we  have  8  equations  for  a  given  2D 
matched  region  pair.  In  these  8  equations,  it  is  ob¬ 
served  that  /  is  a  scale  factor  which  cannot  be  de¬ 
termined  and  k,a,b,d,  rt)x,Wy,w,,  tt,ty,tz  are  the 
10  unknown  parameters.  Thus,  it  is  impossible  to 
solve  for  both  the  structure  and  the  motion  parame¬ 
ter  given  the  apparent  motion  of  a  region  that  is  the 
image  of  only  one  planar  patch. 

Now,  suppose  we  know  the  apparent  motion  of 
two  regions.  In  addition  to  the  r^ion  G  described 
before,  we  have  another  region  Gb  with  equation 

ax  ■\-bv-kz-\-d  =  0  w.r.t.  camera  frame  2 
and  the  depths  of  its  centroid  w.r.t.  camera  frame 
2  and  camera  frame  1  are  p  and  q  respectively.  Ad¬ 
ditionally,  denote  by  ^he  projection  of  the 

centroid  o(  Gb  onto  image  2  and  thus  the  centroid 
ofGs  w.r.t.  camera  frame  2  is  (pug,pvg,p).  Again, 
there  are  8  equations  for  region  Gb- 


hr  =  *(1  —  a  Wy)  kio  =  *(1  -f  b  Wx) 

As  =s  |(-ti»»  -  b  to,)  All  =  |(<*  -  d  Wy) 

As  =  |(to,  -b  a  to,)  Aij  =  i(t,  -f  d  to,) 
q  —  p{—Wy*B  -b  to,t»fl  -b  1)  -b  1, 
a'pug+b'pvg+p  +  d  =0  (•) 

Given  two  matched  region  pairs,  16  equations  in  16 
variables  are  obtainetT  Again,  /  is  a  scale  factor 
which  cannot  be  determined.  As  a  result  ,  we  have 
an  overdetermined  system  with  16  equations  in  15 
variables.  Our  approach  to  handle  this  overdeter¬ 
mined  system  is  to  use  the  hrst  fifteen  equations  to 
solve  for  the  unknown  parameters.  Then,  the  last 
equation  (*)  is  used  to  verify  the  solutions.  In  [2]  we 
obtmned  the  following  polynomial  after  some  ma¬ 
nipulations  of  the  fifteen  equations 

Xiw*  -b  Ajin®  -b  AaU)*  -b  A4ii;,  -b  A5  =  0 
where  the  coefficients  A,-  are  functions  of  hi.A:, 
ha.  k7,hg,  hg.hio- 

The  above  equation  is  a  fourth  degree  polynomial 
in  w,.  For  a  fourth  degree  polynomial,  there  is  a 
closed  form  solution  for  the  four  roots.  Given  w,,  the 
other  variables  can  be  solved  successively  giving  one 
set  of  solutions.  Detailed  derivation  is  in  [2].  Thus, 
we  have  four  solutions  in  total.  Then  equation  (*)  is 
used  to  eliminate  redundant  solutions.  For  each  of 
the  examples  we  tried,  only  one  of  the  four  solutions 
satisfied  (*).  A  numerical  example  is  presented  in 
the  next  section  to  illustrate  this. 

4.2  Numerical  Example  to  Illustrate  the 
Choice  among  the  Four  Solutions 

Consider  the  camera  motion  parameters  w,=0.06, 
Wy=0.05,  ti),=0.02,  <*=5,  /,=8,  <,=2.  Two  pla¬ 
nar  patches  are  given,  having  0.15*  +  0.05y  -b  r  - 
153.75  =  0,  0. 1333*' +  0.2y'  -br'  -201.3333  =  0,  and 
centroids  (4,6,200)  and  (20,15,150),  respectively,  in 
camera  frame  2.  Equation  (2.4)  is  us^  to  gener¬ 
ate  the  affine  parameters.  The  four  solutions  are 
obtained  by  solving  the  fifteen  equations  described 
before.  In  order  to  exploit  equation  (*)  to  verify  the 
solutions,  we  compute  the  bias, 

where  bias  =  a’jni'B  +  b’vv'B  -b  p  -b  d* 

The  correct  solutions  should  be  the  ones  with  zero 
bias.  Table  4.1  shows  the  four  solutions  with  their 
bias  and  the  ground  truth  values. 

4.3  Practical  Considerations 

In  real  applications,  the  recovered  affine  coefficients 
for  the  matched  regions  are  corrupted  by  noise  and 
thus  the  bias’s  for  each  of  the  four  motion  and  struc¬ 
ture  solutions  are  not  zero  in  general.  One  reason¬ 
able  choice  of  solution  is  that  having  minimum  bias. 
Since  each  solution  specifies  the  camera  motion  and 
the  parameters  of  the  3D  planar  surfaces  associated 
with  the  two  2D  regions,  for  each  point  inside  the 
two  regions  in  the  r«erence  image  its  corresponding 
location  in  the  other  image  can  be  computed  and  the 
intensity  difference  between  them  can  be  computed. 
The  sum  of  all  the  intensity  differences  associated 
with  all  the  points  in  the  two  regions  is  called  the 
Projective  intensity  difference  F.  Thus,  an  alternate 
way  to  select  the  solutions  is  to  pick  the  one  with 
minimum  Projective  intensity  difference.  We  em¬ 
ployed  this  method  in  our  experiments. 
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5  Recovering  the  apparent  motion 

The  property  of  affine  apparent  motion  of  a  region 
between  two  images  explained  in  section  2  ensmles 
us  to  use  a  powerful  tool  -  affine  invariance.  Affine 
invariants  are  functions  of  geometric  structure  which 
remain  unchanged  under  affine  transformation.  In 
the  following  paragraphs,  we  briefly  describe  the  idea 
of  2'*'I7  data,  2'*'D  affine  invariants,  algorithms  to 
find  2D  region  correspondence  between  images  using 
the  invariants  and  to  recover  the  associated  affine 
transformation  parameters. 


5.1  2'*'D  Data 


5.3  Renon  Matching  and  Recovery  of  the 
Aftme  Parameters 

In  [2],  we  introduced  a  two  stages  scheme  to  per¬ 
form  region  matching  and  to  recover  the  associated 
affine  transformation  using  2*0  affine  invariants. 
The  idea  of  matching  is  to  first  compute  the  affine 
invariants  for  the  region  to  be  matched  in  the  ref¬ 
erence  image  ,  then  locate  the  matched  region  in 
the  second  image  such  that  its  invariants  are  closest 
to  that  of  the  reference  region.  Having  found  the 
matched  regions  pair,  the  invariants  aim  permit  a 
trivial  computation  that  provides  a  first  estimate  of 
the  affine  transformation.  We  then  get  an  improved 
estimation  of  this  affine  transformation.  Details  of 
the  algorithm  are  given  in  [2]. 


Regions  g  and  g  ,  the  projections  into  camera  image 
planes  1  and  2,  respectively,  of  the  3D  planar  region 
G,  respectively,  are  related  by  an  affine  transforma¬ 
tion.  Let  (u«,Vi),  (u,-,Vj)  be  the  matched  points 
pair  in  g  and  gi,  respectively.  Then 

(;:)=(!!;  J: )(:!)-( lb) 

Let  the  centers  of g  and  g  be  (m^,  my)  and  (m^, my) 
respectively.  By  simple  calculation 
/  Ui-m,  \  _  f  ht  fcj  \  /  o' 

V  -  my  ;  \  ha  fc4  j  ^  y' 

Assume  G  is  a  Lambertian  surface.  Then,  (u,-,Vi) 
and  (u,-,Vj)  appear  with  the  intensity,  say  /;,  in 
both  frame  1  and  frame  2.  Multiplying  both  sides  of 

(5.1.2)  by  li,  we  get 

Ui  -  m,)/i  /  Ai  ha 

Vi  -  my)Ii  /  ~  \  f>3  hi 

(5.1.3) 

For  convenience,  denote  the  above  equation  as, 

(:::)-( Hi  D: )  ( :i; ) 

From  (5.1.4),  it  is  realized  that  (a,-t,ai2),(a,|,aj2) 
are  still  an  affine  pair.  Thus,  we  construct  two 
new  data  sets,{(aii,a,2)}  for  all  points  in  g  and 
{(a,j,aj2)}  for  all  points  in  g  .  These  data  sets  are 
related  by  parameters  hi,  A2,/i3,h4  and  contain  in¬ 
formation  about  not  only  the  location  of  the  image 
points  but  also  about  their  intensities.  The  more 
interesting  thing  is  that  even  with  the  additional  in¬ 
tensity  information,  the  dimension  of  the  data  set 
remains  two  and  not  three.  So,  we  call  them  2*D 
data  sets. 


“"‘f  ^  (5.1.2) 

-my  J 


5.2  2*^0  AfRne  Moment  Invariants 

In  [5],  a  new  framework  is  introduced  for  generating 
affine  moment  invariants.  The  particular  moment 
invariants  developed  are  the  eigen-values  of  certain 
matrices,  whose  coefficients  are  algebraic  functions 
of  the  location  of  2D  data.  The  computational  cost 
of  computing  these  invariants  is  low.  As  mentioned 
in  the  previous  section,  the  dimension  of  the  2*D 
data  is  two.  Thus,  we  can  apply  the  2D  moment  in¬ 
variants  directly  to  our  2'*'  D  data  and  we  called  them 
2^  D  moment  invariants.  In  our  experiment,  five  mo¬ 
ment  invariants  are  used,  which  are  the  eigen- values 
of  a  2x2  and  a  3x3  matrix.  Thus,  little  computation 
is  involved  in  the  evaluation  of  the  invariants. 


6  Experiments 

6.1  Experiment  1 

This  experiment  simulates  a  general  camera  mo¬ 
tion,  described  by  the  translation  T  =  (1.563,  1.172, 
0.391)  in  focal  units  and  rotation  angles  Q  =  (0.05, 
-0.06,  0.08)  in  radians.  The  scene  consists  of  two 
distinct  planar  surfaces(left  and  right),  represented 
by  equations  0.5i'  +  0.2y'  ■¥  z  —  19.531  =  0  and 
0.2*  -b  0.6y'  -b  z'  -  23.438  =  0.  Fig.6.1(a)  and 
Fig.6.1(b)  show  the  images  taken  before  and  after 
the  camera  motion,  respectively. 

Two  circular  regions  are  chosen  in  the  reference 
image  as  shown  in  Fig.6.1(b).  The  matching  results 
are  depicted  in  Fig.6.1(a).  In  Table  6.1,  we  present 
the  results  of  the  recovered  motion  and  structure 
parameters  along  with  the  ground  truth  values. 

We  see  that  the  recovers  values  of  the  camera 
motion  and  the  structure  parameters  are  in  good 
agreement  with  the  ground  truth  values.  Take  the 
Im  plane  as  an  example:  the  true  normal  to  the 
plane  and  the  recovered  one  differ  only  by  4.5  degrees 
and  the  error  in  the  depth  is  0.2%. 


6.2  Experiment  2 


This  experiment  is  based  on  two  images  of  a  real 
scene  taken  by  a  moving  camera.  Fig.6.2(a)  and 
Fig.6.2(b)  show  the  images  taken  by  the  camera  be¬ 
fore  and  after  its  motion.  To  begin,  the  algorithm 
partitions  the  reference  image  into  o4  circular  win¬ 
dows  as  shown  in  Fig.6.2(b).  The  windows  are  num¬ 
bered  from  go  to  gg^. 

Prior  to  initiating  the  matching  process,  based 
on  the  intensity  histograms,  windows  with  small  in¬ 
tensity  variation  are  discarded  because  they  do  not 
contain  sufficient  information  for  reliable  matching. 
Fig.6.2(c)  and  Fig.6.2(d)  show  the  matching  results. 


It  is  observed  that  all  the  windows  but  g\^  have 
good  matches.  One  reason  for  the  mismatch  is  that 
this  window  lies  on  two  surfaces,  thus  resulting  in 
a  large  affine  intensity  difference  A.  Such  a  large 
difference  is  easily  detected  by  the  system.  Thus, 
the  system  collected  all  the  regions  which  have  good 
affine  matches. 

In  section  4,  we  showed  that  two  matched  regions 
pairs  are  needed  in  order  to  recover  the  motion  and 


structure  parameters.  So  it  paired  g'^o  with  each  one 


of  the  other  regions  and  computed  the  motion  and 
structure  for  each  pair  in  parallel.  As  mentioned  in 
section  4.3,  the  projective  intensity  difference  F  is 
also  computed  which  is  a  measure  of  the  goodness 
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of  the  recovery.  By  using  this  measure,  it  is  found 
that  the  recovery  of  the  motion  and  structure  par 
rameters  for  all  the  pairs  described  above  was  good 
except  for  the  parameters  found  by  the  matched  re¬ 
gions  pair  of  ^30  and  ^^20'  ^  result  found  by 
that  pair  is  discarded  and  the  results  of  the  others 
are  listed  in  Table  6.2.  In  table  6.2,  the  mean  and 
the  standard  deviation  of  the  motion  parameters  and 

the  structure  parameter  associated  with  ^33  are  also 
computed.  The  recoveries  are  highly  consistent,  es¬ 
pecially  in  the  estimations  of  Wajtj,  d. 

The  3D  reconstruction  of  the  regions  is  displayed 
in  Fig.6.3  by  setting  /  equal  to  10000.  The  equation 

of  the  planar  surface  associated  with  is  formed 
by  the  means  of  a,  6,  d  shown  in  Table  6.2  From  the 
top  left  corner  of  Fig.6.3,  we  see  that  the  recovered 

surfaces  associated  with  ^17,  g^^,  g^^  and  the  ones 
associated  with  g^^  and  g^^  formed  the  two  sides  of 
the  big  box.  The  patches  associated  with  g^  and  g^^ 
are  sitting  on  the  top  of  the  box.  The  surface  of  the 
book  on  tne  bottom  left  and  the  box  on  the  far  right 
are  estimated  well. 

7  Conclusion 

This  paper  presents  a  new  approach  to  the  estima¬ 
tion  of  3D  surface  structure  and  camera  motion, 
based  on  two  images.  3D  surface  structure  is  ap¬ 
proximated  by  planar  patches.  The  solutions  are 
explicit  and  reliability  is  possible  because  the  put 
of  images  can  be  taken  from  two  positions  far  aput, 
i.e.,  using  a  large  baseline  and  because  the  matwing 
is  area  based  and  thus  resistant  to  noise.  The  re¬ 
quired  matching  that  is  used  to  solve  for  the  param¬ 
eters  of  the  affine  transformation  describing  appar¬ 
ent  motion  is  computationally  modest  because  we 
use  geometric  invariants.  Our  approach  can  be  used 
for  mignmenting  in  pairs  aerial  photos  taken  from 
completely  different  azimuth  and  elevation  angles. 
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Abstract 

A  learning-based  approach  to  object  recognition 
uixler  variable  object  characteristics  is  presented. 
The  approach  supports  system  capability  of 
reco^zing  objects  in  dynamic  environments  by 
adapting  the  object  models  to  perceived  changes  in 
object  characteristics  —  for  example,  caused  by 
variable  perceptual  conditions.  This  adaptation  is 
perform^  by  the  evolution  of  object  models  over 
attribute  space,  which  is  realized  by  integrating 
within  a  close  loop  a  vision  module  with  an 
incremental  learning  module.  While  the  initial 
acquisition  of  object  models  is  driven  by  a  teacher, 
the  later  evolution  of  these  models  is  performed 
over  a  sequence  of  images  without  the  help  of  a 
teacher.  Object  models  are  applied  to  recognize 
objects  on  the  next  images.  The  effectiveness  of 
such  recognition  and  object  extraction  is  monitored, 
and  when  it  is  decreasing  the  system  selects  new 
training  data  and  activates  learning  processes  to 
improve  its  models.  These  processes  are  related  to 
active  modeling  performed  by  a  system  through  the 
interaction  with  die  dynamic  envircximent  We  have 
implemented  the  model  evolution  approach  within  a 
system  and  tested  it  for  gradually  changing 
resolution  and  lighting  conditions.  The 
experiments  present^  have 
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been  performed  for  recognidai  and  segmentation  of 
texture  areas  on  a  sequence  of  gray  scale  images. 

1.  Introduction 

Most  research  on  object  reco^ition  has  been 
focused  on  learning  and  recognizing  objects  under 
stationary  perceptual  conditions  such  as  lighting, 
resolution  and  ^sitioning.  Relatively  Utile  has 
been  done  on  the  problem  of  recognizing  objects 
under  dynamic  conditions,  particidaiiy  when  the 
change  of  these  conditions  irifluence  the  change  of 
object  characteristics.  This  problem  is  particularly 
severe  for  object  recognition  in  outdoor 
environments  where  the  variability  of  perceptual 
conditions  is  extremely  large  [Bhanu  et  al.,  1989, 
1990]. 

To  avoid  some  problems  with  the  variabiUty  of 
object  characteristics  under  the  change  of  perceptual 
conditions,  we  can  apply,  for  example,  (1)  domain 
specific  feature  selection,  (2)  active  vision,  or  (3) 
model  projection  and  adjustment  AU  of  them, 
however,  have  significant  limitations.  Feature 
selection  [Tsatsanis  and  Giannakis,  1992]  bases  on 
the  idea  of  selecting/buUding  such  features  that  are 
sensitive  to  a  given  object  in  a  wider  range  of 
perceptual  conditions.  But  in  practice,  these 
features  cause  larger  misclassification  error.  Active 
vision  [Bajcsy,  1988]  ajqxoach  bases  on  the  idea  of 
manipulating  camera  parameters  to  maintain  the 
same  perceptual  conditions.  The  problem  with  the 
variability  in  resolution,  for  example,  can  be 
mitigated  through  camera  adjustment  and  the 
!q>plication  of  multiscale  operators.  But  with  the 
increase  in  the  number  of  independently  moving 
objects  the  vision  system  is  becoming  overloaded 
by  camera  manipulations.  The  problem  remains 
with  the  other  conditions.  Fiiudly,  model  prediction 
and  adjustment  bases  on  the  pioj^on  of  percei^ai 
conditions  that  can  occur  in  the  future.  Object 
models  can  be  prepared  respectively  to  these 
projections.  Rir  example,  the  protdem  with  a  pose 
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variability  of  stnictural  objects  can  be  practically 
eliminated  by  the  generation  of  the  aspect  models 
[Dceuchi,  1987].  But,  these  techniques  seem  not 
practical  for  other  representations  —  in  particular, 
for  texture  recognition  problem. 

The  above  methods  are  ^plicable  to  the  recognition 
of  an  object  on  a  single  image  when  relatively  more 
time  is  given  to  perform  the  recognition  task.  But, 
they  are  not  optimal  when  objects  must  be 
recognized,  monitored  and  tracked  through  a 
sequence  of  images  and  the  characteristics  of  these 
objects  vary.  These  methods  rely  mostly  on  the  a 
priori  provided  physical,  structural,  functional  and 
behavioral  models  of  objects  and  the  sensor.  We 
agree  that  in  the  future  the  combination  of  them  can 
be  applicable  to  a  very  well  defined  and  modeled 
problem.  The  possession  of  complete  models, 
however,  is  questionable  for  most  practical 
problems  in  machine  perception  and  other  complex 
large-scale  systems. 

Most  approaches  to  object  recognition  do  not  adapt 
an  object  recognition  system  directly  to  the  dynamic 
environment:  i.e.,  they  do  not  modify  on-line  object 
models  to  the  changes  of  object  characteristics. 
These  methods  use  stationary  models  once  acquired 
during  the  training  phase,  and  they  build  a 
transformation  system  between  these  stationary 
models  and  the  input  data  acquired  from  a  dynamic 
environment  Therefore,  we  call  such  an  adaptation 
an  indirect  adaptation.  Such  an  approach  requires 
that  each  condition  influencing  the  change  of  object 
characteristics  is  represented  in  the  tra^ormation 
system.  So,  it  suffers  when  the  tran^ormation 
system  is  not  preprogrammed  to  deal  with  a  specific 
perceptual  condition  which  might  not  be  know  at 
the  time  of  system  development 

In  the  next  sections,  we  present  an  alternative 
approach  to  the  object  recognition  under  variable 
perceptual  conditions  which  directly  adapts  object 
models  to  perceived  changes  in  object 
characteristics.  This  approach  is  outlined  in  Section 
2.  An  example  implementation  of  the  developed 
method  within  the  CHAMELEON  '92  system  is 
presented  in  Section  3.  Finally,  experimental 
results  are  presented  in  Sections  4. 

2.  Model  Evolution  Approach  to  Object 
Recognition 

Model  evolution  approach  to  object  recognition 
under  variable  perceptual  conditions  rely  on  the 
dynamic  modification  of  object  models  according  to 
the  perceived  change  in  object  characteristics.  It  is 
done  by  close  interaction  of  an  integrated  vision  and 
learning  system  with  the  environment  (learning 
from  the  environment).  A  vision  system  adapts  to 
the  changes  in  the  environment  by  adapting  the 
object  models  directly  rather  than  building  data 


transformation  modules  fitted  to  the  stationary 
models.  This  allows  for  capturing  any  varialxlity  of 
object  characteristics  without  the  knowledge  about 
object  properties  and  without  building  complex  and 
dedicated  modules  serving  the  change  of  a  given 
perceptual  condition.  Thus,  an  object  model  can  be 
adapted  to  any  combination  of  multiple  perceptual 
conditions,  the  combination  of  which  creates  an 
infinite  set  of  possible  states.  Moreover,  the  system 
can  adapt  to  the  change  in  the  internal  state  of  an 
object  (e.g.,  to  the  change  of  the  target  heat 
signature  —  in  the  Automatic  Target  Rrecognition 
(tomain). 

Adaptation  of  object  models  to  the  perceived 
changes  in  perceptual  conditions  is  particuiaily  well 
suited  to  the  problems  where  objects  have  been 
recognized  once  on  an  image,  and  they  have  to  be 
recognized,  monitored  or  tracked  on  the  other 
images  or  over  a  sequence  of  images  acquired  under 
varying  perceptual  conditions.  The  application 
areas  include,  for  example,  scene  aimotation  for 
navigation,  autonomous  surveillance,  automated 
target  recognition,  industrial  inspection,  and 
material  selection.  This  approach  can  also  be 
superior  to  the  understanding  of  an  action 
performed  by  the  object  —  in  the  automatic  damage 
assessment  through  the  analysis  of  change  in  object 
state. 

The  model  evolution  proposed  integrates  a  vision 
system  and  a  learning  system  working  within  a 
close  loop  over  a  set/sequence  of  images  (see 
Rgure  1).  The  primary  asp^t  of  this  approach  is 
that  a  system  has  to  recognize  objects  on  images 
acquired  over  time.  Images  of  such  a  sequence  are 
affected  by  the  variability  of  conditions  under  which 
objects  are  perceived.  Object  models  once  acquired 
through  a  dialog  with  a  teacher  are  then  applied  to 
recognize  objects  on  the  next  image.  The 
recognition  effectiveness  of  object  models  is 
continuously  monitored  and  compared  with  the 
results  on  the  previous  image(s)  or  with  stated 
minimum  requirements.  If  this  recognition 
^ectiveness  decreases,  then  learning  processes  are 
activated  to  improve  the  models'  discriminating 
power.  While  the  system  learns  initial  object 
models  from  teacher-provided  data  (training 
examples),  thereafter,  ^e  system  has  to  update 
these  models  automatically  without  teacher  help.  It 
is  done  by  automatic  selection  of  new  training  data 
and  the  activation  of  the  incremental  learning 
processes. 

Vision  module  selects  new  training  data  and 
activates  the  learning  processes  when  needed.  Hie 
modiflcation  of  object  models  is  performed  by  the 
learning  module.  The  learning  processes 
incorporate  new  training  data  in  such  a  way  that 
they  modify  the  existing  object  models  (rather  than 
learning  thm  from  scratch)  according  to  variability 
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Fig.  1  Architecture  for  model  evolution  integrating  machine  vision  and 
learning  systems;  the  CHAMELEON  *92  system 


of  object  characteristics  represented  by  new  training 
data.  This  process  is  perfoimed  by  the  incremental 
learning  —  further  generalizing  object  models. 

Related  research  work  has  been  reported  by 
Goldfarb  [1990]  who  introduced  theoretical 
background  to  the  model  evolution  integrating 
"Pattern  Learning"  of  symbol  formation  and 
recognition  with  Artificial  Intelligence  of  symbol 
manipulation.  He  suggested  a  Neund  Net  approach 
to  system  evolution  but  he  has  not  provided  an 
experimental  confiimation  of  his  approach  yet.  In 
the  most  recent  work,  Bobick  and  Bolles  [1992] 
provide  ex  xiicnt  motivation  for  the  work  on  model 
evolution,  focus  on  the  evolution  of  representations 
for  object  recognition,  and  indicate  stability 
problems  in  such  systems.  They  learn  object 
models  everytime  from  a  given  image  of  a 
sequence.  However,  they  do  not  evolve  already 
existing  models  by  new  characteristic  data  which  is 
the  subject  of  our  work.  Our  previous  work 
include  the  deHnition  of  the  learning-based 
approach  to  model  evolution  [Pachowicz,  1991], 
experiments  with  the  CHAMELEON  '91  semi- 
autonomous  system  of  model  evolution,  and  the 
analysis  and  improvement  of  the  stability  problems 
in  the  dynamic  model  evolution  [Pachowicz,  1992]. 
The  next  section  presents  the  CHAMELEON  '92 
fully  autonomous  model  evolution  system  —  a 
system  that  evolves  models  without  teacher  help. 


3.  System  Architecture  and 
Implementation 

The  CHAMELEON  '92  system,  presented  in 
Figure  1,  has  been  created  to  investigate  fully 
autonomous  model  evolution  to  the  invariant  objea 
recognition  in  dynamic  environments  on  the 
example  of  texture  recognition.  The  domain  of 
texture  has  been  chosen  because  of  high  variability 
of  texture  attributional  characteristics  on  the  change 
in  perceptual  conditions  (such  as  resolution, 
lighting,  positiong,  weather  conditions). 

3.1.  Image  data 

The  experimental  input  data  was  a  sequence  of  six 
256x256  black  and  white  images  (256  gray  levels 
per  pixel).  The  content  of  each  image  was 
simplified  and  each  image  was  composed  of  six 
overiapping  fabrics  ordy.  The  images,  presented  in 
Figure  2,  were  affected  by  gradually  changing 
resolution  and  illumination.  Tte  distance  between 
the  camera  and  the  textured  scene  was  gradually 
decreased  to  two  thirds  of  the  initial  distance,  arul 
the  light  source  was  moved  alcmg  with  the  camera. 

3.2.  Attribute  extraction 

In  the  first  step,  a  single  image  is  processed  to 
extract  texture  features  (attributes).  For  each  pixel. 
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a  vector  of  attributes  is  extracted  characterizing  its 
local  neighborhood.  We  applied  the  modified 
Laws'  [1980]  method  of  texture  energy  measure 
[Hsiao  and  Sawchuk,  1989].  This  three  step 
procedure  (1)  convolutes  an  image  with  a  given 
mask,  (2)  applies  local  averaging  of  the  absolute 
resjwnses,  and  (3)  applies  non-linear  filtering  to 
mitigate  the  borderline  smoothing  effect  of 
previously  applied  averaging.  The  output  is  a 
vector  of  eight  attributes  corresponding  to  eight 
different  convolution  masks  (i.e.,  S3S3,  R5R5, 
E5L5.  L5E5,  E5S5,  S5E5,  L5S5,  and  S5L5). 
Numeric  attributes  were  then  quantized  into 
subsymbolic  intervals  [Pachowicz,  1990).  We 
have  chosen  this  method  of  attribute  extraction 
because  of  the  fast  computation  and  quite  well 
discriminating  power.  However,  the  other  more 
powerful  metl^s  can  be  used  as  well. 

Figure  3  presents  an  example  distribution  of  texture 
attribute  for  a  given  class  and  over  all  six  images  of 
the  sequence.  First,  analyzing  attribute 
distributions  we  found  that  the  distribution  is 
multimodal  for  most  attributes,  classes,  and  images. 


Second,  the  distribution  of  a  given  attribute  vary 
significantly  from  one  image  to  the  other.  The 
variability  of  perceptual  conditions  causes  both  the 
change  of  "shape"  and  the  translation  of  the  attribute 
distribution.  These  effects  deteriorate  model 
discriminating  power  when  object  models  acquired 
from  one  image  are  applied  to  another  image.  And, 
this  is  why  we  have  to  adapt  the  vision  system  to  a 
dynamic  environment  by  evolving  its  object 
models. 

33.  Initial  training  phase 

In  the  training  phase,  models  of  texture  D(t)  are 
acquired  from  the  first  image  of  a  sequence  through 
the  collaboration  with  a  teacher.  A  teacher 
interaaively  indicates  small  sections  of  texture  areas 
for  the  selection  of  the  training  data.  Texture 
sections  are  then  searched  randomly  to  extract  a 
given  number  of  training  data  (examples)  per  class 
—  in  our  case,  300  examples  (attribute  vectors)  per 
class.  This  data  is  forwarded  to  the  learning 
module.  The  AQ14  learning  program  (for  more 
details  about  this  learning  method  and  the  program 


718 


see  [Michalski,  1983])  is  then  applied  to  learn 
texture  models  (rule  descriptions)  from  provided 
examples.  However,  the  odier  learning  programs 
can  be  i4)plied  to  learn  the  models  as  well  (e.g..  the 
ID  programs  learning  decision  trees;  [Quinlan, 
1986]). 


Fig.3  Example  variability  of  attribute 
distribution  over  images  of  the  sequence 
(for  a  given  texture  class) 


Inductive  learning  acquires  object  models  by 
drawing  inductive  inference  from  teacher-  or 
environment-provided  training  examples.  The 
learning  process,  incorporated  by  the  AQ14 
program,  is  performed  for  each  class  separately. 
Inductive  learning  applies  a  heuristic  search 
through  the  attribute  si»ce  incorporating  inductive 
operators  (generalization,  transformation, 
correction,  ai^  refinement).  Inductive  learning  is 
guided  by  background  knowledge,  which  provides 
information  about  attributes,  preference  criteria, 
inference  rules,  heuristics,  and  program  dependent 
procedures.  The  learning  goal  is  to  find  the  most 
preferred  descriptions  according  to  the  preference 
criterion.  The  set  of  training  examples  of  one  class 


is  called  a  set  of  positive  examjdes.  With  respect  to 
this  particular  class,  all  other  training  data  are 
negative  examples.  The  AQ  programs  find  object 
m^els  over  positive  examples  and  no  negative 
examples. 

3.4.  Recognition  and  segmentation 

Once  objea  models  are  acquired,  they  are  api^ed  to 
classify  pixels  and  segment  the  image  into  class 
areas.  Each  attribute  vector  x  from  the  incoming 
image  is  matched  with  the  texture  models.  A  single 
match  of  an  attribute  vector  with  a  model  of  a  given 
class  produces  a  pair  [i  -  classification  decision;  c  - 
belief  value].  The  belief  value  c  is  maximum  if  the 
attribute  vector  is  matched  strictly  (covered  by)  by 
the  object  model.  In  another  case,  the  belief  value 
is  lower  than  the  maximum  value.  The  belief  value 
is  associated  with  the  decision  indicating  the 
strength  of  the  match.  Since  we  have  more  than 
one  object  model,  a  vector  of  pairs  [  i,  c  ]  is  the 
output  produced  by  the  recognition  algorithm 
applied  to  a  single  input  attribute  vector  (at  a  given 
pixel  position  on  an  image)  and  for  all  object 
models  (for  i=l  to  6  ;  the  number  of  texture 
classes). 

The  belief  values  corresponding  to  the  same 
classification  decision  are  then  locally  averaged  over 
the  3x3  window.  This  averaging,  decision 
filtering,  is  repeated  by  a  given  number  of  iterations 
(i.e.,  S  iterations).  The  final  classification  decision 
is  made  by  yielding  the  decision  of  the  highest 
averaged  belief  value.  For  better  results,  however, 
this  process  can  be  replaced  by  a  more  efifective  but 
computationaUy  expensive  relaxation  method. 

3.5.  Evaluation  of  the  model 
discriminating  effectiveness 

If  an  image  is  segmented,  the  evaluation  process  is 
run  to  determine  the  model  discriminating  power 
and  to  compare  it  with  the  previous  results.  This 
leads  then  towards  a  possible  activation  of  the 
incremental  modification  of  object  models  and  the 
selection  of  new  training  data. 

The  evaluation  is  performed  by  automatically 
selecting  some  texture  areas  to  compute  recognition 
effectiveness  measure;  i.e.,  the  average  and  the 
minimum  recognition  rates  (or  belief  values) 
through  all  classes  of  texture.  Ihe  texture  areas  are 
found  by  randomly  searching  for  the  uniform 
patches  of  ISxlS  pixels  of  the  same  texture  class 
through  the  entire  image  (see  Figure  4a  and  4b). 
(The  ISxlS  image  window  is  considered  by  many 
researchers  as  the  smallest  window  for  the 
distinction  of  a  texture.) 
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Fig.4  Extraction  of  new  training  data  and  its  impact  on  the  improvement  of  model  discriminating 
power,  a)  segmented  and  annotated  image  by  currently  available  models,  b)  randomly 
searched  uniform  patches,  c)  filtered  intermediate  areas  carrying  infonnation  about  the 
change  in  texture  characteristics,  d)  automatically  selected  new  training  examples  (for  the 
worst  classes  only),  e)  segmentation  and  annotation  results  for  the  updated  texture  models. 


'fhe  patches  of  a  single  class  arc  then  divided  into 
three  groups:  (group_arca  1)  typical  texture  areas 
(those  that  are  recognized  with  die  highest  bclieO, 
(group_arca  2)  intermediate  texture  areas  (those 
that  represent  change  in  texture  characteristics), 
and  (group_area  3)  possible  noisy  texture  areas 
(those  that  arc  mostly  influenced  by  possible 
classification  and  segmentation  errors).  The 
division  into  these  areas  is  done  by  computing  the 
deviation  from  the  average  belief  value  for  each 
texture  patch.  Only  intermediate  texture  areas 
(group_area  2)  arc  selected  for  the  evaluation, 
which  arc  shrunk  twice  to  eliminate  possible 
negative  influence  of  the  segmentation  (sec  Figure 
4c). 

The  recognition  rate  R‘e(f  (sec  Figure  1)  is  then 
found  for  each  i-th  texture  class  by  the  analysis  of 
recognition  data  (before  they  were  smoothed) 
within  the  indicated  group  areas  only.  The 


recognition  rate  for  each  class  is  calculated  by  the 
division  of  the  number  of  correctly  classified  pixels 
to  the  total  number  of  pixels  covered  by  the  group 
area.  Then,  the  average  recognition  rate  and  the 
minimum  recognition  rate  arc  calculated  for  the  set 
of  six  texture  classes. 

3.6.  Autonomous  activation  of  incremental 
learning  processes 

The  activation  of  incremental  learning  processes  for 
model  modification  depends  on  the  current 
evaluation  of  the  model  recognition  results  (model 
discriminating  power)  when  compared  with  the 
results  from  the  previous  images.  If  these 
evaluation  results  deteriorate  below  a  set  or  on-line 
adjusted  threshold  level,  then  the  vision  system 
activates  the  learning  module.  In  the 
CHAMELEON  '92  system  applied  to  texture 
recognition,  both  threshold  levels  (for  the  average 
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and  minimum  recognition  rates)  have  been  set 
during  the  training  phase  run  on  the  first  image. 
These  thresholds  are  constant  through  the  remaining 
images  of  the  sequence  maintaining  consistent  stop 
criteria  for  the  control  of  the  evolution  loop. 

3.7.  Selection  of  new  training  data 

If  the  learning  processes  are  activated  then  the  areas 
used  to  evaluate  model  discriminating  power  are 
used  to  selea  new  training  data.  For  each  class,  the 
data  selection  process  groups  the  area  pixels  into  the 
following  three  files:  pixels  recognized  by  strict 
match  (the  maximum  belief  value),  pixels 
recognized  by  flexible  match  (the  closeness  to  the 
object  model),  and  pixels  not  recognized  correctly. 
New  training  data  is  then  selected  from  the  second 
and  the  thi^  group.  Only  a  limited  number  of 
pixels  is  extracted;  i.e.,  the  data  is  filtered  to 
indicate  the  formation  of  new  clusters  only. 

A  set  of  new  training  data  X  is  then  forwarded  to 
the  learning  module.  The  number  of  new  training 
examples  depends  on  the  performance  of  a  given 
class.  A  very  limited  number  of  new  training 
examples  is  allowed  to  be  extracted  (i.e.,  up  to  20 
examples  per  class).  For  the  worse  performing 
class,  the  most  training  examples  are  extracted.  For 
the  better  performing  class,  {tactically  no  new  data 
is  extracted  (see  Figure  4d). 

3.8.  Incremental  model  modification 

We  incorporated  an  incremental  learning  to  modify 
the  once  learned  objea  models.  The  incremental 
learning  methodology  has  already  been 
implemented  within  several  learning  programs,  i.e., 
within  the  AQ  family  of  learning  programs 
[Michalski  and  Larson,  1978],  the  ID  family  of 
learning  programs  [Utgoff,  1989],  the  INDUCE-4 
program  [Bentrup,  et  al.,  1987],  and  conceptual 
clustering  [Fisher,  1987,  Gennari  et  al.,  1989]. 
Incrementid  learning  builds  (modifies)  object 
models  dynamically  according  to  newly  provided 
evidence  —  new  training  data.  Therefore,  this 
learning  (model  acquisition)  technique  was 
employed  by  us  within  the  CHAMELEON  '92 
system  to  modify  texture  models.  It  also  has  been 
proved  that  incremental  learning  increases  the  speed 
of  learning  processes.  Unfortunately,  this  learning 
technique  can  give  slightly  more  complex  models 
and  somewhat  worse  recognition  effectiveness. 

In  our  system,  newly  extracted  training  examples 
along  with  object  models  are  forwarded  to  the 
AQ14  incremental  learning  program.  The  results  of 
the  learning  process  are  texture  models  modified 
according  to  the  provided  new  object  characteristics; 
i.e.,  the  previous  models  are  extended  over  the 
attribute  space  to  include  new  training  examples. 


This  modification  of  texture  models  is  executed  by 
their  further  generalization  over  the  attribute  q>ace. 

3.9.  Verification  of  evoived  modeis 

Model  evolution  works  in  a  closed  loop, 
manipulating  object  models  in  order  to  adapt  them 
to  the  changes  in  the  object  characteristics.  This 
adaptation  is  {lerfoimed  in  a  two-loop  system  (see 
Figure  S).  liie  external  loop  adai>ts  models  to  a 
given  image  of  a  sequence,  while  the  internal  loop 
adapts  models  to  the  selected  image  data 
representing  intermediate  object  characteristics. 
This  schema  was  suggested  by  the  early 
investigation  of  stability  problems  in  the 
CHAMELEON  '91  evolution  systems  [Pachowicz, 
1992]. 


The  model  evolution  is  com{)etition  oriented;  i.e.,  a 
model  of  a  given  class  is  actually  modified  with  the 
lespea  to  the  other  classes.  An  extensive  evolution 
of  one  class  can  cause  weakness  in  the 
discriminating  power  of  the  other  class.  Therefore, 
a  balance  between  the  classes  the  system  is  trained 
to  recognize  should  then  be  kept  carefully. 
Progressing  model  evolution,  the  system  has  to 
verify  the  effect  of  the  evolution.  Regarding  this 
progress  the  evolution  can  be  re{)eated  with  the 
adjusted  strategy  and  new  training  data. 
Particularly,  the  evolution  must  be  continued  as 
long  as  the  recognition  effectiveness  is  satisfied  by 
stop  criteria. 

In  the  CHAMELEON  '92  system,  this  verification 
is  performed  on  the  data  characterizing  the  change 
in  the  object  characteristics.  Evolved  models  are 
applied  to  recognize  those  data.  The  recognition 
characteristics  are  computed,  and  then  evalua^.  If 
they  not  fulfill  the  assumed  threslmld  levels  for  the 
average  and  minimum  recognition  rate,  the 
evolution  process  is  repeated  but  with  new  training 
data  selected  from  the  same  areas.  If  they  fulfill  the 
assumed  thresholds,  this  evolution  loop  is  broken. 

If  the  loop  is  broken,  object  models  evolved  over 
indicated  image  areas  (Figure  4c)  are  later  verified 
on  the  same  image  again.  This  verification  repeats 
all  processes  of  texture  recognition,  se^entation, 
and  evaluation.  If  this  evaluation  does  not  satisfies 
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Hg.  6  Experimental  results  of  system  adaptation  to  the  last  (6th)  image  run  via  the  model 
evolution  through  prededing  sequence  of  images 


the  threshold  conditions  then  the  system  activates 
model  evolution  over  indicated  image  areas  as 
described  in  previous  sections.  But.  if  this 
evaluation  satisfies  the  threshold  conditions  then  the 
system  goes  to  the  next  image  of  a  sequence.  The 
evaluation  of  the  evolution  processes  run  over  the 
example  image  presented  in  Figure  4a  is  illustrated 
onBgure4e.  The  impact  of  the  model  modification 
by  new  training  data  shows  the  significant 
improvement  in  segmentatitm  and  annotation  results 
for  those  two  classes  that  were  represented  by  new 
training  data. 

4.  Experimental  Results 

4.1.  Testing  data  and  methodology 

Considering  an  objective  analysis  of  the  system 
performance,  evolution  processes  must  be  evaluated 
on  different  sets  of  data  than  data  used  in  model 
evolution  to  modify  object  models.  Therefore, 
testing  data  used  to  measure  system  performance 
was  obtained  from  the  same  images  but  from 
widely  spread  image  sections  —  including  sections 
close  to  the  borderline  between  different  texture 
areas.  We  indicated  these  sections  interactively  and 
extracted  the  testing  data  before  the  evolution 
processes  were  begun. 

Since  each  image  of  a  sequence  is  composed  of  six 
classes  of  texture,  six  testing  datasets  were  obtained 
from  each  image.  A  single  dataset  contained  200 
randomly  selected  (from  indicated  areas)  testing 
examples  characteristic  for  a  single  texture  class. 
The  testing  (4iase  was  applied  everytime  when 
models  were  modified.  The  recognition 


characteristics  were  obtained  when  evolved  texture 
models  were  applied  over  and  over  again  to  the 
same  image  monitoring  the  evolution  effect 

4.2.  Recognition  characteristics 

figure  6  shows  the  recognition  characteristice  for 
the  last  image  of  the  sequence;  i.e.,  when  models 
were  apjilied  everytime  to  the  sixth  image.  We 
monitor  (i)  average  recognition  rate  over  six  texture 
classes,  (ii)  standard  deviation  from  the  average 
recogniuxi  rate,  (iii)  misclassification  rate,  and  (iv) 
minimum  recogmtion  rate  from  the  set  of  six 
classes.  These  diagrams  are  complemented  by  six 
images  illustrating  recognition  and  segmentation 
results  on  the  sixth  image  over  the  consecutive 
iterations  of  model  evolution  (see  Figure  7). 

The  experimental  results  show  that  initially  learned 
texture  models  (i.e.,  from  the  first  image)  did  not 
rrcognize  some  of  the  classes  on  the  last  image  of 
the  sequence.  However,  the  model  evolution  over 
the  next  consecutive  images  has  adapted  texture 
models  to  changing  texture  characteristics.  The 
system  was  able  to  improve  its  average  recognition 
rate  from  54%  to  95%,  while  the  minimum 
recognition  rate  was  drastically  improved  from  0% 
to  89%.  System  maintained  steady  decrease  in 
standard  deviation  of  the  recognition  rates 
improving  stability  of  the  recognition  system.  In 
the  same  time,  the  misclassification  rate  was 
decreasing  proving  that  texture  models  were  not 
over-generalized;  i.e.,  the  competition  with  other 
class  models  kept  mode/  boundaries  in  balance. 
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Fig.7  Model  evolution  results  presented  on  the  last  image  of  the  sequence  from  figure  2 


5.  Conclusions 

The  paper  presented  a  method  for  object  recognition 
under  variable  perceptual  conditions  that  evolves 
object  models  adapting  them  directly  to  object  rKw 
visual  appearances.  The  evolution  of  object  models 
is  performed  by  integrating  vision  and  incremental 
learning  processes  and  maintaining  system 
continuous  interaction  with  the  environment.  The 
method  has  been  implemented  within  the 
CHAMELEON  '92  experimental  system  and  tested 
successfully  on  texture  recognition  problem  under 
changing  resolution  and  lighting  conditions.  The 
system  has  been  trained  only  on  the  First  image, 
lliereafter,  the  system  has  worked  autonomously 
on  its  own.  The  system  was  able  to  evolve  its 
models  over  a  sequence  of  images  and  to  maintain 
its  recognition  capability. 
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Abstract 

We  address  projective  prq)erties  of  the  contours  of 
Right  generaiized  cylinders  with  a  Planar,  but  n(X  nec¬ 
essarily  straight,  axis  and  Circular,  but  possibly  varying 
size,  cross-sections  (called  Circular  PRGCs).  This  class 
introduces  many  difficulties  beyond  the  classes  previ¬ 
ously  studied,  such  as  straight  homogeneous  general¬ 
ized  cylinders  (SHGCs)  and  their  special  case  of 
surfaces  of  revolution  (SORs).  where  die  axis  is 
straight,  and  planar  right  constant  gmeralized  cylinders 
(PRCGCs),  where  the  cross-section  size  is  constatt. 
Revious  work  on  2-D  “ribbon”  descriptions  does  relate 
to  Circular  PRGCs.  However,  it  has  not  rigorously  ad¬ 
dressed  or  justified  relationship  between  2D  desoip- 
dons  and  projection  of  3D  descriptions.  Indus  work,  we 
derive  important  rigorous  quasi-invariant  {voperties  of 
Circular  PRGCs  and  invariant  properties  for  subclasses 
of  Circular  PRGCs.  We  show  that  the  derived  quasi-in¬ 
variants  are  useful  for  2D  description  of  the  projectitHts 
of  such  primitives  and  for  recoveiy  of  complete  3D  ob¬ 
ject  centered  descriptitms  from  the  2D  contours.  Wb  ' 
demcmstrate  our  claims  on  some  exanqiles. 

1  Introduction  and  previous  work 

One  of  the  major  proUems  in  computo*  vision  is  the 
recovery  of  shape  of  3-D  objects  fi-om  a  single  2-D  con¬ 
tour  inuge.  This  problem,  known  as  shape  from  con¬ 
tour,  is  difficult  because  2-D  images  contain 
qipearances  of  real  3-D  objects,  which  are  dqiendent 
(m,  and  hence  may  vary  with,  the  viewpoinL  In  matte- 
madcal  terms,  the  problem  is  under-omstrained  due  to 
the  loss  of  one  dimension  by  die  projective  nature  of  die 
image  formation.  Human  vision  does  show,  howev^. 
that  sluqie  percqition  firom  such  contours  is  largely  in¬ 
variant  to  changes  in  viewpoint  Previous  work,  ours 

*  Thia  iMMich  wu  nippoitad  by  the  Advuiced  RatMich  Projects 
Agency  of  the  Department  of  Defense  end  was  monitored  by  the  Air 
ForeeOfBoe  of  Sdentiflc  Research  under  Contract  No.  P49620-90- 
C-0078.  The  United  States  Oovarnment  is  authorized  to  reproduce 
and  distribula  rqrrints  for  governmental  purposes  notwithWanding 
any  copyright  notation  hereon. 


and  others,  has  indicated  that  studying  the  {H'ojective 
properties  of  die  contours,  to  find  invariant  jn-operties. 
helps  in  scene  percqition  in  two  importam  ways.  Rrst, 
they  help  detect  the  objects  in  the  jneseoce  of  noise, 
shadows  and  occlusion.  Second,  the  properties  i^ovide 
important  constraints  for  recovery  of  their  3-D  shape. 
This  work  significandy  extends  the  class  of  objects 
which  can  be  so  recovered.  Chie  important  distinction 
from  previous  woric  is  that  for  the  new  dass.  strict  in¬ 
variants  are  not  found  but  quasi-invariant  properties 
(defined  formally  later)  are  derived  and  provmi  to  be 
equally  effective. 

Using  contours  as  a  source  of  constraints  for  sbqie 
has  been  the  focus  of  research  since  the  early  days  of 
conqiuter  vision.  Early  work  addressed  polyhedral  ob¬ 
jects  using  constraints  (m  junctirm  labelling  [Qowes 
1971]  and  free  orientations  [Kanade  1981.  Mackwonh 
1973].  Subsequent  efforts,  such  as  [Gross  &  Boult 
1990.  Malik  1987.  Nalwa  1989.  Nevada  &  Binford 
1977.  Ponce  ct  al.  1989.  Sato  &  Binford  1992.  Ulupinar 
ft  Nevada  1990a,  Ulupiruu  ft  Nevada  1991,  Zerroug  ft 
Nevada  1993],  have  addressed  curved  surfree  objects. 
These  objects  introduce  more  difficulties  as  some  of 
their  contours,  such  as  limbs  and  cusps  (where  the  view¬ 
ing  direction  is  tangential  to  the  surface),  are  inherently 
viewpoint  dqtendent  lb  obtain  rigorous  {voperties,  it  is 
useful  to  study  cfdrrer  of  objects.  Of  course,  to  be  of  in¬ 
terest,  such  classes  should  include  generic  shape  models 
with  the  ability  to  generate  a  large  set  of  everyday  Ob¬ 
jects.  Generali^  cylinders  (GCs)  [Binford  1971],  are 
one  such  adequate  shape  modd.  They  have  proved  to  be 
particularly  suited  for  structured  shqte  descriptimi  of 
complex  1^  articulated  objects. 

There  is  strong  psychological  evidence  [Biederman 
1987]  that  human  pereqrtirm  of  line  drawings  of  com¬ 
plex  objects  is  infliwnoed  by  percqrtioo  ^  arrange¬ 
ments  of  a  snudl  number  of  simple  vdumetric 
primitives.  Those  basic  ivimitives,  call^  geons  (analo¬ 
gous  to  generalized  cylinders),  are  characterized  by  dif¬ 
ferent  (Toss-sectirm  shqtes.  axis  shape  and  sweq>s.  This 
indicates  that  it  is  sufficient  to  address  shqte  recovery  of 
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a  small  set  of  primitive  generalized  cylinders  for  a  com¬ 
putational  approach  to  recovo^  of  a  large  number  of  ob¬ 
jects  in  our  environment  A  computational  approach  to 
the  analysis  and  recovery  of  those  primitives  requires 
the  study  of  their  projective  properties  and  their  usage 
for  recovoing  3D  shape. 

A  numbo-  of  researchers  have  addressed  the  use  of 
GCs  and  their  projective  invariant  properties.  Nalwa 
[Nalwa  1989]  has  proved  that  contours  of  surfaces  of 
revolution  (SORs),  under  orthognq)hic  projection,  ex¬ 
hibit  bilateral  symm«ry.  Ponce  et.  al  [Ponce  et  al.  1989] 
have  proved  Aat  in  the  (perspective)  projection  of 
straight  homogeneous  generalized  cylinders  (SHGCs), 
tangents  to  limbs  at  corresponding  points  intersect  on  a 
line  which  is  the  projection  of  the  axis  and  exploited  this 
property  for  detection  of  the  projection  of  the  axis  of  an 
SHGC  from  its  image  contours.  Ulupinar  and  Nevada 
[Ulupinar  &  Nevada  1990a  and  b,  Ulupinar  &  Nevada 

1991]  have  doived  projecdve  invariants  of  zero  Gauss¬ 
ian  curvature  (ZGC)  siufaces,  SHGCs  and  planar,  right, 
constant  generalized  cylinders  (PRCGCs).  For  instance, 
they  have  proved  that  cross-secdons  of  SHGCs  and 
limbs  of  PRCGCs  project  onto  "parallel  symmetric" 
curves  under  orthographic  projecdon.  They  have  ex¬ 
ploited  those  properdes  for  recovering  3D  shape  from 
perfect  contours.  Sato  and  Binfo-d  [Sato  &  Binford 

1992]  and  Zerroug  and  Nevada  [Zoroug  &  Nevada 

1993]  derived  and  used  projecdve  invariant  propodes 
of  SHGCs  for  solving  the  figure  ground  problem  in  real 
image  contours. 

The  primidves  addressed  by  previous  work  have  ei¬ 
ther  a  straight  axis,  such  as  SORs  and  SHGCs,  or  a  con¬ 
stant  cross-section  size  such  as  PRCGCs.  Some  natural 
objects,  however,  such  as  human  and  animal  limbs  and 
horns,  have  a  combinadon  of  curved  axes  and  varying 
cross-secdon  size;  some  examples  are  shown  in 
Figure  1.  Depardng  from  the  previously  addressed  cas¬ 
es  of  straight  axis  or  constant  cross-secdon,  to  include 
objects  with  arbitrary  3D  axes,  cross-sections  and 
sweeping  functions,  introduces  many  new  difficulties. 
In  this  woilt,  we  address  the  class  of  GCs  having  curved 
(plaiutf)  axes  with  non  constant,  circular,  cross-sections. 
Following  the  tominology  of  [Shafer  &  Kanade  1983], 
they  can  be  called  circular  planar  right  GCs  (Circular 
PRGCs)  as  the  cross-section  is  orthogonal  to  the  axis. 


Figure  1  Sample  Circular  PRGCs. 


Some  previous  efforts,  such  as  [Brooks  1983,  Neva¬ 
da  &  BinfOTd  1977,  Rao  &  Nevada  1989],  have  used 
ribbons  (2-D  counterparts  of  GCs)  as  intuitive  descrip¬ 


tors  of  the  projection  of  curved  GCs,  assuming  that  the 
2-D  descriptions  correspond  to  the  projection  of  the  3- 
D  descriptions  (ribbon  axis  and  projection  of  3-D  axis, 
for  example).  Howevo-,  the  relationship  between  the  2- 
D  descriptions  and  the  projections  of  the  3-D  descrip¬ 
tions  was  not  rigOTously  addressed.  It  can  be  shown  that 
in  general  they  are  not  the  same. 

We  were  unable  to  find  invariant  properties  of  gcaenl 
Circular  PRGCs  (except  fcx*  speciid  cases),  and  we  be¬ 
lieve  that  none  exist  The  non-constancy  of  the  cross- 
section  and  the  curvature  of  the  axis  affect  the  contours 
in  very  complex  ways.  However,  we  have  been  success¬ 
ful  in  finding  quasi-invariant  pr(q)erties  (following  the 
terminology  of  [Binfm-d  et  al.  1987,  Binford  1991])  that 
are  useful  for  shape  description  and  recovery.  C^asi-in- 
variance  is  a  generalization  of  invariance.  Invariant 
properties  are  properties  with  constant  measure  with  re¬ 
spect  to  a  set  of  parametric  transformations  (they  hold 
indq)endently  of  the  parameters  of  the  transforma¬ 
tions^).  For  exanq>le,  the  ratio  of  the  lengths  of  two  (3- 
D)  parallel  segments  is  known  to  be  an  mthographic  in¬ 
variant  (^asi-invariant  properties  are  prc^rties  that 
may  not  be  strictly  constam,  but  their  measure  varies 
within  a  small  range  over  a  large  set  in  the  parameter 
space  of  the  transformations.  Ftx^  example,  the  previous 
lengths  ratio  is  a  perspective  quasi-invariant  as  its  value 
is  within  10%  of  the  actual  one  over  90%  of  the  viewing 
sphere  [BinfOTd  et  al.  1987]. 

In  this  work,  we  derive  impcntant  quasi-invariant 
projective  properties  of  Circular  PRCiCs.  invariant 
properties  of  tteir  special  cases,  and  show  their  applica¬ 
tion  for  shape  description  and  recovery.  Our  analysis 
shows  that  a  popular  class  of  ribbons  (so-called  Brooks' 
ribbons,  in  the  terminology  of  [Ponce  1988])  provides 
generally  consistent  descriptions  (ones  that  correspond 
to  projections  of  3-D  descriptions)  of  the  projections  of 
Circular  PRGCs.  Our  recovery  method  ne^  not  be  told 
that  it  is  examining  Circular  PRGCs,  ratho-  it  contains 
tests  that  can  verify  their  presence.  The  recovery  meth¬ 
od  assumes  that  the  viewpoint  is  general  and  that  both 
end  cross-sections  of  viewed  Circular  PRGCs  are  visi¬ 
ble. 

We  organize  the  discussion  as  follows.  In  section  2  we 
provide  a  mathematical  aiudysis  of  the  projected  con¬ 
tours  of  Circular  PRGCs.  In  section  3,  we  derive  projec¬ 
tive  invariant  (x-operties  for  special  Circular  PRClCs  and 
in  section  4  we  derive  quasi-invariant  properties  for 
general  Circular  PRGCs.We  also  discuss  relationship  of 
the  derived  properties  with  previcxis  work.  In  section  5, 
we  discuss  the  application  of  the  derived  quasi-invari¬ 
ants  for  2D  description  of  Circular  PRGCs  and  for  rig¬ 
orous  recovery  of  complete  3D  models  of  Circular 

‘e](c«pt  perhapi  on  >  mi  of  meuuie  zero  in  the  peimmeter 
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PRGCs  from  tbeir  2D  contours.  Wb  also  demonstrate 
our  methods  on  some  examples.  We  conclude  this  paper 
in  section  6. 

2  Limbs  and  projections  of  Circular  PRGCs 

We  begin  by  giving  a  formal  dehnition  of  a  GC,  then 
proceed  to  the  analysis  of  limbs  and  projections  of  Cir¬ 
cular  PRGCs.Throughout  the  analysis,  orthographic 
projection  will  be  assumed. 

Definitionl:  a  generalized  cylinder  (GC)  is  the  sur¬ 
face  obtained  by  sweq>ing  a  given  cross-section  curw 
C  along  a  3D  (axis)  curve  A,  while  transforming  it  by  a 
function  r. 

For  GCs  where  the  cross-section  is  orthogonal  to  tte 
axis  (Right  GCs),  the  surface  can  be  parameterized  as 
follows  (using  a  notation  similar  to  [Ponce  &  Chdbetg 
1987]): 

P  (j.  0)  -  A  (a)  +  u  (a,  0)  (cos0n  +  sinOb)  (1) 

where  A  (a)  is  an  arclength  parameterization  of  the 
axiscurveA,  u(a,  0)  the  cross-section  function  and  n 
and  b  respecdvdy  the  nmmal  and  binmmal  to  the  axis 
curve.  For  homogeneous  cross-sections  (i.e.  rigid 
sweq)s)  u(a,0)  ■r(a)p(0)  ,  where  r(a)  is  the 
scaling  function  and  p(0)  ^  cross-section  curve. 
Hgure  2a  illustrates  this  parametrization  in  the  case  of  a 
Circular  PRGC. 


Figure  2  Generalized  cyliixler  parameterization 
and  viewing  geometry 


A  point  on  the  GC  surface  is  a  limb  poim  (10^  the  view¬ 
ing  direction  is  orthogonal  to  the  surface  normal  at  that 
point  The  surface  nomud  at  a  point  P(s.0)  is  givoi  by 


JV(s) 


dP  dP, 


dP  dP 
9j  ^50 


,  where  P  -  P(i.0)  (2) 


Let  (a(f),  P(j))  be  the  angular  coordinates  of  the 
unit  viewing^  vector  V  in  the  Frenet-Serret  frame 
(>(.r).n(5).6(5))  (Figure 2b), then 


V  =  cosP(«)  r  («)  +  sinP(5)  cosa(f)  h(s)  + 

siii0(f)  sina(j)  b  (s)  (3) 


Using  the  Frenet-Serret  theorem  [Millman  &  Parker 
1977]  rdating  the  above  basis  vectcxs  and  tbdr  deriva¬ 
tives  and  writing  the  orthogonality  of  ^(s)  and  V 
yields  the  following  limb  equation; 

sinP(f)  [1-  K(f)pK«)cos0]cos(0-  0(5))-  pr  (s)  cos^i)  =  0 

where  K(f)  is  the  curvature  of  the  axis  and  r  (5)  thede- 
rivative  of  r  (s) .  Assuming  sin0  (x)  ^  0  this  equation 
can  be  rewritten  as; 


[1-  k(«)  pK*)  COS0]  CO6(0-  a(j))  =  p  r  ( j)  cotp(r)  (4) 

The  limb  equation  usually  has  two  solutions  0^.  (s)  for 
i  m  1,2.  We  can  doive  limb  equations  fm-spedal  cases 
such  as  SORs  and  Circular  PRCGCs  (constam  size 
cross-section;  note  the  difference  with  the  more  general 
Circular  PRGCs  where  the  cross-section  is  non  con¬ 
stam).  Fm  SORs,  the  axis  is  straight  (k  (s)  •  0)  and  we 
obtain  the  SOR  limb  equation; 

cos(0-a(r))  •  pr(f)  cotP(r)  (5) 

which,  for  every  s  yidds  two  solutions  of  opposite  an¬ 
gular  distance  from  a  (s)  (Figure  3.a). 


P(r.02)^;r;^P(r.0,) 

p  0j-a(5) 

a. 


2.1  Deriving  limbs  of  Circular  PRGCs 


Figure  3  Limb  points  properties,  a  SOR  b  circular 
PRCGC 


For  lack  of  space  we  omit  the  details  of  the  limb  der¬ 
ivation  process  as  a  detailed  analysis  is  given  in  [Ponce 
&  Chdberg  1987].  We  limit  the  analysis  to  defining 
limb  boundaries  and  expressing  their  equatioa  W;  fur¬ 
ther  limit  the  analysis  at  this  stage  to  Circular  PRGCs 
(i.e.  zero  axis  torsion  and  p  (0)  >  p  and  p  (0)  «  0). 


For  Circular  PRCGCs,  die  sweeping  function  is  con¬ 
stant  (r(s)  m  0)  and  we  obtain  the  equation 

^1  does  not  hold  only  when  the  viewing  direction  V  it  per- 
ellel  to  /,  a  non  general  viewpoint  for  which  the  limb  equation 
has  an  inflnite  or  zero  number  of  tolutionf . 
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(1 -k(5) pr (j)  COS0)  cos  (0- a(j) )  ■  0,  which  im¬ 
plies 

cos(0  -  a(s) )  =  0  (6) 

since  sinP(s)  ^0  and  1 -K(f)pr(s)  cos0  -  0  would 
imply  1/  i<j)  =  R(s)  <  prisX  i.e.  the  object  self  intersects, 
an  undesirable  irregularity  (,R(,s)  is  the  radius  of  curva¬ 
ture  of  the  axis  and  pr($)  the  cross-section  radius).  Thus, 
in  the  case  of  Circular  PRCGCs,  limb  points  are  deto*- 
mined  by  0  «  ±  n/2  a  (  j)  ;  i.e.  two  diametrically  op¬ 
posite  points  (Figure  3.b). 

For  the  general  case  of  Circular  PRGCs  (k  (s)  ^  0  and 
r  (s)  #  0),  the  relationship  between  the  two  limb  points 
is  not  as  straightforward  as  the  two  sub-cases  discussed 
above.  However,  in  section  4,  we  will  show  that  they  do 
have  a  well  behaved  relationship. 

2.2  Deriving  projeciions  of  Circular  PRGCs 

The  orthographic  projection  of  the  (3D)  Frenet-Sora 
frame  (>(f),n(5),h(j))on  a  plane  orthogonal  toV 
gives  a  ’moving’  local  2D  frame  in  the  image  for  each 
value  of  s.  The  relationship  between  the  3D  and  the  2D 
frames  is  as  follows  (we  will  omit  the  argument  s)-. 

hex.  P  m  it-¥nh-¥bt>  2i  point  expressed  in  the  3p 
frame,  then  its  projection,  p,  on  a  plane  orthogonal  to  V 
is  given  by 

p  =  (•  sin^  t  +  cos^cosa  n  +  cos^sina  b)  u  +  (,-  sina  n  +  cosa 
b)  V  (7) 

where  u  -  u  (i)  is  the  projection  of  >  and  v  -  v  ($)  is 
orthogoiud  to  u  following  Ae  right  hand  rule.  Written  in 
vector  form,  local  3D  coordinates  (r,  n,  b)^  projea  as 
local  2D  coordinates 

(-  sinp  /  -f  cosPcosa  n  +  cos^sina  b,  •  sina  n  *  cosa  b)^. 

The  projection  of  the  axis  point  A  (s)  is  thus  the  ori¬ 
gin  of  ^  local  2D  frame. 

From  equations  (1)  and  (7),  a  point  P  (s.  0)  on  the  sur¬ 
face  can  be  shown  to  project  as 

fcospcos(0-a)  I 
**  1.  sin(0-a)  J 

i^thout  loss  of  generality,  from  this  point,  we  will  nor¬ 
malize  the  scaling  function  by  fixing  p  ■  1 

In  sections  3  and  4  we  wiU  derive  two  types  of  prop¬ 
erties.  The  first  one  relates  the  projection  of  limb  points 
of  the  same  cross-section  (i.e.  for  the  same  s  along  the 
axis).  We  caU  such  points  corresponding  limb  points. 
Finding  such  relationships  is  useftil  not  only  for  a  con¬ 
sistent^  2D  description  but  for  recovery  of  3D  shape  as 
wdl.  This  has  also  been  the  focus  of  previous  work  in 
[Ponce  et  al.  1989,  Ulupinar  &  Nevada  1990a,  Ulupinar 
&  Nevada  1991].  The  second  type  addresses  the  rela- 
donship  between  the  axis  of  the  projecdon  of  a  Circular 
PRGC  and  the  projecdon  of  its  3D  axis.  These  latter  two 
are  not  the  same  in  general  (Figure  4). 

^the  icaljng  function  will  only  change  by  a  constant  factor. 


To  make  the  discussion  in  subsequent  sections  rigor¬ 
ous,  we  give  a  number  of  definitions  clarifying  our  ter¬ 
minology.  See  Figure  4. 

Definition2:  a  correspondence  segment  is  the  2D  line 
segment  joining  corresponding  limb  points. 

Definitions:  the  2D  axis  (axis  of  the  projection)  is  the 
locus  of  midpoints  of  correspondence  segments. 

From  equation  (8),  a  correspondence  segment  C,  be¬ 
tween  projections  of  limb  points  P  (s.  0j) ,  P  (s,  0^) , 
can  be  expressed  in  its  local  2D  frame  as 

rcosP(cos(02-a)  -cos(0,-a))\ 
sin (02- a)  -  sin (0j -a)  J 


and  its  midpoint  is  thus  given  by 


r /cos  P(  cos  (02 -a)  +cos(0j-a))' 


sin  (02  -  a)  +  sin  (0|  -  a) 


(10) 


corresponding  limb 
points 


correspcxidence 
segment 

of 

D  axis  point 
2D  axis  point 


Figure  4  projection  of  a  Circular  PRGC 


3  Projective  invariant  properties  of  special 
Circular  PRGCs 

In  this  section  we  derive  projective  invariant  proper¬ 
ties  of  the  two  special  cases  of  Circular  PRGCs:  SORs 
and  Circular  PRCGCs.  Figure  5  illustrates  those  prqrer- 
ties.  We  will  address  the  general  case  in  section  4. 

3.1  Circular  PRCGCs 

Property  1:  In  an  orthographic  projection  of  a  Circu¬ 
lar  PRCGC,  the  2D  axis  and  the  projection  of  the  3D 
axis  coincide  regardless  of  the  viewing  direction. 

Proof:  from  equation  (6),  at  limb  points,  we  have 
co8(0,-a)  -  cos (02 -a)  -0  and  thus 
sin(0j-a)  ■  -sin (02 -a)  ■  ±1  ,  fw  each  s  along 
the  axis.  Therefore,  from  equation  (10),  the  2D  axis 
point  is  given  by  (0. 0)' ,  the  wigin  of  the  local  2D 
frame  which,  as  discussed  in  section  2.2,  is  the  projec¬ 
tion  of  the  axis  point  A  (s)  no  mfdter  whia  the  viewpoint 
(given  by  a  and  p)  is  □ . 

^there  may  be  many  ways  to  describe  2D  surfaces  (such  as  dif¬ 
ferent  classes  of  ribbons),  not  all  of  them  being  projections  of 
3D  descriptions.  For  example,  points  on  the  same  cross-sec¬ 
tion  of  a  2D  ribbon  are  not  necessarily  projections  of  poims  on 
the  same  3D  cross-section.  We  call  a  2D  description  consistent 
when  it  does  relate  projections  of  corresponding  3D  points 
(i.e.  the  2D  descrqxion  is  the  projection  of  the  3D  one). 
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Property  2:  In  an  orthographic  projection  of  a  Circu¬ 
lar  PRCGC,  ccnrespondence  s^ments  and  tangents  to 
the  2D  axis  at  their  midpoints  are  orthogonal. 

Proof,  reporting  the  results  of  the  previous  analysis  in 
equation  (9),  we  obtain  the  expression  of  the  correspon¬ 
dence  segment  r  (0,  ±2) The  2D  axis  shown  to  be  the 
projection  of  the  3D  axis,  its  tangent  vector  is  the  pro¬ 
jection  of  the  3D  tangent  vector  >.  This  latter,  from  the 
projection  equation  (7),  is  given  by  (-sin^,  0) '  which  is 
orthogonal  the  previous  correspondence  segment  □ . 
correspondence  segments 
/^■T>v  /orthogonal  to  2D  axis 


V  2D  axis  and  projection 

of  3D  axis  coincide 

a.  SOR  b.  Gicuiar  PRCGC 

Figure  5  Projective  invariants  for  special 
Oicular  PRGCS 

Note  that  it  can  be  easily  verified  that  for  a  Circular 
PRCGC,  the  length  of  the  cmrespondence  segments  (in 
2-D)  is  constant  and  equal  to  the  (constant)  diameter  of 
the  3-D  cross-sectioa 

3.2  SORs 

The  following  properties  are  related  to  the  bilateral 
symmetry  pr(^>erty  derived  by  Nalwa  in  [Nalwa  1989] 
whae  non  algebraic  proofs  wae  used.  We  state  our 
properties  here  and  give  algebraic  {x-oofs. 

Property  3:  In  an  orthographic  projection  of  a  SOR, 
the  2D  axis  and  the  projection  of  the  3D  axis  coincide 
regardless  of  the  viewing  direction. 

Proo/l-  from  equation  (3),  we  have  cos(0i  -  a)  =  cos(e2 
-  a)  and  thus  sin(0i  -  a)  +  sin(62  -  a)  =  0.  Rqx)rting  tUs 
in  equation  (10),  the  2D  axis  point  is  given  by  (r  /  2) 
(cos0  [  cos  (^  -  a)  -I-  cos  (6i  -  a)],  0)*  which  is  always  a 
point  on  the  u-axis  of  the  local  2D  frame;  i.e.  the  direc¬ 
tion  of  the  projection  of  the  tangent  to  the  3D  axis.  Nom 
that  unlike  property  1,  this  point  does  rvot  coincide  with 
the  projection  of  the  3D  axis  point  A  (s) .  However, 
since  the  3D  axis  is  straight,  its  projection  is  also 
straight  and  it  is  determined  by  the  origin  of  the  local  2D 
frame  (projection  of  >4  (s) )  and  the  projection  of  the  3D 
tangent;  i.e.  the  u-axis  □ . 

Property  4:  In  an  (»thogrtq)hic  projection  of  a  SOR, 
correspondence  segments  and  tangents  to  the  2D  axis  at 
their  midpoints  are  orthogonal. 

Proof:  from  the  previous  proof  and  equation  (9),  the 
correspondence  segment  is  given  by 
r  (0,  sin  (62  -  a)  -  sin  (Oj  -  a))' ;  i.e.  paralld  to  the  v-axis  of 
tiK  local  2D  frame.  But  also  from  the  previous  proof  the 
2D  axis  is  the  u-axis,  orthogonal  to  the  v-axis  □ . 

These  properties  say  that  in  the  image  plane,  the  pro¬ 
jection  of  a  Circular  PRCGC  or  of  a  SOR  can  be  consis¬ 


tently  described  by  a  ribbon  with  2D  cross-section 
(correspondence  segmoits)  orthogonal  to  the  (2D)  axis 
tangent  (so-called  Brodcs’  ribbon  with  right  angle). 

4  Quasi-invariant  properties  of  generai 
Circuiar  PRGCs 

In  tiM  more  gmeral  case  where  the  axis  is  non  straight 
and  the  cross-section  non  constant,  the  limb  equation 
(4)  can  be  rewritten  as  follows  (dropping  the  argument 
s  and  as  previously  mentioned  ntxrmalizing  the  scaling 
function  by  fixing  p  =  1) 

(l-«cos0)cos(0-a)  ■  rcotp  (11) 

where  e  =  Kr  =  rl  RiR  being  the  radius  of  curvature  of 
the  axis),  e  is  a  local  measure  of  the  relative  thickness  of 
the  sh^  of  the  Circular  PRGC.  Smaller  values  indicate 
rather  dongated  surfaces  (small  cross-section  radius 
compared  to  axis  curvature  radius)  whereas  latger  val¬ 
ues  thick  and  curved  surfaces.  We  will  call  e  the  thick¬ 
ness  ratio,  r  is  a  measure  of  how  fast  the  cross-section 
changes  its  size  (radius). 

Equation  (11)  indicates  that  two  pairs  of  parameters 
affect  the  behavior  of  the  contours  of  a  Circular  PRGC: 

*  (a,  p)  corresponding  to  the  viewing  direction 

•  (e,  r)  corresponding  to  the  objea  parameters 
(local  shape  measures) 

For  the  two  sub-cases  of  Circular  PRGCs,  the  study 
was  simplified  by  e  -  0  for  SORs  and  r  -  0  for  Circu¬ 
lar  PRCGCs  (for  cylinders,  the  simplest  Circular 
PRGC,  both  are  zero).  In  the  general  case  whoi  both  are 
non  zero,  the  propenies  discussed  previously  are  not  in- 
varia^  This  can  easily  be  seen  by  expressing  the  tan¬ 
gent  7*20  to  the  2D  axis  in  the  general  case  (omitting  the 
details  of  its  derivation): 

(O.ScosP  [  r  (C1+C2)  -  +  @2^2)]  -  sinp  (1  -  O.Sw 

(COS01-KOS02)), 0.5(r (*,  -v Sj)  +  r (0,c,  +  02C2)  ] )‘  (12) 

where  c,-  =  cos(0,-  -  a),  x,-  =  sin(0,-  -  a),  0,-  ■  30 /dj  for 
i  -  1,2  (at  limb  points).  ' 

Using  the  expression  of  the  correspondence  segmem 
(eq.  (9)),  the  dot  product  with  the  2D  axis  tangent  is  giv¬ 
en  by 

(i.flD)  =  r  1 0.5  cosP^  (C2  -  C|)  [  r  (C]  +  c^  - 
KOjCi  +  02<^2^1  ■  ^^2  *  <^1)  n  -  0-5Kr  (COS01-I-  COS0|2)] 

05(^2  -  s{Kr  (*  j  +  *2)  +  »■  +  ^<^2) )  ]  (13) 

This  expression  has  bm  provoi  to  be  zero  fm  SORs 
(where  k  -  0,  0j  ■  -O2,  Cj  ■  Cj  and  s,  ■  -sj)  and 
for  Circular  PRCGCS  (where  r  ■  0,  c,  -  Cj  *  0  and 

-  -52  "  ^1  )>  in  the  previous  analysis.  In  the  gmieral 
case,  it  is  non  zero. 

No  property  has  been  found  to  be  projective  invariam 
for  the  gener^  case  (and  we  conjectme  that  none  exist). 
However,  we  demonstrate  that  the  properties  discussed 
previously  are  quasi-invariant  properties  of  general 
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Circular  PRGCs  with  respect  to  both  the  viewing  direc¬ 
tion  parameters  (a.  P)  (i.e.  orthographic  quasi-invari¬ 
ant)  and  to  {e,r)  (i.e.  object  parameters  quasi¬ 
invariant).  For  this  we  have  to: 

a)  define  our  space  of  observation  (space  of  values 
of  the  parameters) 

b)  show  that  the  property  holds  up  to  some  small 
range  over  most  of  that  space 

4.1  Space  of  observation 

The  whole  space  of  observation  is  a  4-D  space 
(a,  p,  e,  r) .  For  the  same  object,  those  parameters  vary 
as  s  varies,  a  and  p  can  take  any  values  on  the  viewing 
spho'e  for  which  the  limb  equation  admits  a  finite  (non 
empty)  set  of  solutions;  i.e.  a  €  (0. 2n )  and 
P  e  (0,  n )  u  (n,  2n) .  Letting  a  €  [0.n]  and  P  €  (0,  n) 
is  sufficient  as  the  limb  equation  (1 1)  is  symmetric  with 
respect  to  n  for  a  and  p. 

The  2D  subspace  (e.r)  is  also  constrained.  We  have 
kl  ^  1  (the  cross-section  radius  is  smaller  than  the  radi¬ 
us  of  curvature  of  the  axis),  otherwise  as  mentioned  pre¬ 
viously,  the  surface  self-intersects.  r  is  also 
constrained,  since  1 1  -  « cosOj  ^2  (as  kl  ^  1 )  and  thus, 
from  equation  (11),  |r|  ^2|tanp|.This  implies  that,  at 
each  point,  the  closer  the  viewing  direction  is  to  the  axis 
tangent  direction  (i.e.  smaller  |tanP|),  the  smaller  1^ 
has  to  be  (othowise  there  would  be  no  visible  limbs). 
Thus,  for  a  surface  point  where  p  ■  15®,  for  example, 
the  CTOss-secdon  has  limbs  if  |r|  <  0.53.  Objects  seen  in 
daily  environments,  such  as  animal  limbs  or  industrial 
parts,  appear  not  to  have,  at  the  same  time,  high  values 
of  e  and  r;  i.e.when  e  is  high  lr|  is  small  (otherwise  the 
thickness  ratio  would  rapidly  inaease,  which  would 
cause  self-intersection)  and  when  I H  is  high  e  is  small. 
Shapes  with  high  thickness  ratio  and  high  sweeping 
slope  self-occlude  over  most  viewing  directions.  In  the 
andysis,  values  of  I  r|  will  be  given  as  rates  of  change  of 
the  cross-section  radius  r  per  unit  arclength  along  the 
axis. 

4.2  Quasi-invariant  properties 

Property  5:  In  an  orthographic  projection  of  a  Circu¬ 
lar  PRGC,  orthogonality  of  correspondence  segments 
and  tangents  to  the  2D  axis  at  their  midpoints  is  a  quasi¬ 
invariant  property  (the  angle  they  form  is  "almost"  right 
over  most  of  the  space  of  observation)  with  respect  to 
the  viewing  direction  (a,  P)  and  object  paramet^ 
{e,  r) . 

Note  that  this  property  is  related  to  the  description  of 
the  projection  of  a  (Circular  PRCjC  by  a  Brooks’s  ribbon 
with  right  angle. 

To  prove  property  5  and  analyze  to  what  extent  the  2D 
angle  is  "almost"  right,  we  an^yze  the  behavior  of  the 
angle  between  a  correspondence  segment  and  the  tan¬ 
gent  to  the  2D  axis  for  parameter  values  in  our  space 


of  observation.  (Notice  that  in  3D,  the  segment  connect¬ 
ing  limb  points  is  orthogonal  to  the  tangent  to  the  3D 
axis  since  the  cross-section  is  planar  and  orthogonal  to 
the  axis).  Unlike  an  invariant  property,  algebraic  analy¬ 
sis  of  a  quasi-invariant  propmy  is  difficult  Instead,  we 
analyze  it  numerically  by  quantizing  the  space  of  obser¬ 
vation  and,  for  each  point  (a,  p,  e,  r)  of  the  space,  solv¬ 
ing  the  limb  equation  and  deriving  the  projections  of 
corresponding  limb  points  and  the  tangent  to  the  2D 
axis.  Although  there  exists  an  analytical  expression  for 
the  tangent  to  the  2D  axis  (eq.  (12)),  it  requires  knowl¬ 
edge  of  how  6  varies  with  respect  to  s  at  limb  points, 
which  is  not  known  in  genaal.  Instead,  we  have  used 
the  following  method  (of  which,  we  omit  the  details): 

for  each  set  of  parameters  (a,,  p^.^j,  rj)  (at  some  s) 
do 

•  select  an  arbitrary  3D  frame  F,  ■  Oi.hi.bi) 

•  solve  the  limb  eq.  (11)  to  obtain  the  two  limb 
points  Bj^and  6,2 

•  detamine(a2,  P2,«2*'^2) 

small  <fr) 

•  solve  eq.  (11)  for  the  second  pair  of  limb  points 
62,  and  022  (at  s+ds)  and  express  their  coordi¬ 
nates  in  F, 

•  using  eq.  (8)  determine  the  projection  of  the 
two  pairs  of  points  F(r,  0jj),  F(j,0j2)and 
F(f  +  (fr,02.),  Pis  +  ds.e^^)  (say  />,,,  pjj, 
F21  ^  P72) 

•  detomine  the  angle  between  the  correspon¬ 
dence  segment  given  by  Pj,  and  p^2 

2D  axis  tangent  given  by  the  line  joining  the 
midpoints  of  p, I  p^^  and  /721  Pti- 

We  have  derived  the  angles  in  the  image  between  cor¬ 
respondence  segments  and  2D  axis  tangents  over  the 
space  of  observation  defined  by  a€  [0,7t},  P€  (0,  n) , 
e  ^  0.5  and  |r|  <  O.S  (sweq)  rates  less  than  half  the  cur¬ 
rent  cross-section  radius  per  unit  arclength).  The  results 
show  that  over  84.30%  of  that  4D  space  (excluding  spe¬ 
cial  values  e  -  0  or  r  ■  0;  i.e.  SORs  and  Circular 
PRCGCs)  the  2D  angle  is  within  5®  of  90®  and  for  ovct 
92.63%  of  the  space  is  within  10°  of  90®.  Table  1  sum¬ 
marizes  the  sizes  (in  percent)  of  the  regions  on  the  view¬ 
ing  sphere  where  the  2D  angle  is  within  5®of  90®  for 
certain  values  of  (e,  r) .  The  size  is  with  respect  to  the 
space  region  where  limbs  exist.  (It  can  be  seen,  from 
this  table,  that  the  size  of  the  space  where  the  propoty 
holds  (2D  angle  within  5®  of  bdng  right)  gradually  de¬ 
creases  as  both  e  and  r  take  higher  values).  Figure  6a 
shows  a  3D  plot  of  the  2D  angle  as  a  function  of  (a,  P) 
for  (e,r)  ■  (0.2,0  2)  and  Figure  6b  shows  the  corre¬ 
sponding  display  of  the  half  viewing  sphere  ( (a,  P) 
sub-space).  'This  last  figure  shows  where  the  property 
holds,  where  it  does  not  hold  and  where  limbs  do  not  ex- 
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ist  In  the  display,  vertical  circles  correspond  to  constant 
P  values  with  a  5°  step. 


e  -  0.1 

e  m  0.2 

e  -  0.3 

e  -  0.4 

r  -  0.1 

96.72 

92.44 

85.50 
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r  -  0.2 

96.86 

89.44 

83.62 

78.55 
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97.11 

88.25 

80.05 

74.83 

r  ■  0.4 

97.29 

87.57 

78.72 

72.86 

Tablet  Viewing  sets  sizes  (j 
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thin  5®  of 
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W® 

2D  angle  within  5®  of  90®  (thin  lines) 


(medium  lines)  of  90®  (thick  lines) 


Figure  6  Property  5  for  (e,  r)  ■  (0.2, 0.2). 
a  3D  plot.  b.h^f  viewing  sphere. 

Notice  that  the  sub-space  where  the  property  holds  is 
connected  and  the  property  is  wdl  be^ved.  It  tends  to 
gradually  degrade  for  small  values  of  |tanP{,  that  is 
close  to  regions  where  limbs  do  not  exist  Notice  that  a 
small  value  of  |  tan^  require  a  very  specific  viewpoiirt 
whae  the  viewing  direction  K  is  not  only  close  to  teing 
in  the  axis  plane  but  also  almost  parallel  to  the  3D  axis 
tangent;  i.e.  an  unlikely  viewpoint  Therefore,  even  if  P 
is  close  to  being  in  the  axis  plane,  the  property  would 
degrade  only  at  points  where  the  axis  tangent  is  in  the 
direction  of  V  (a  small  set  of  points)  and  it  would  still 
hold  for  the  rest  of  the  surface  (a  much  larger  set). 

Property  6;  In  an  orthographic  projection  of  a  Circu¬ 
lar  PRGC,  the  tangent  to  the  2D  axis  and  the  tangent  to 
the  projection  of  3D  axis  for  corresponding  points  are 
"almost"  parallel  (within  a  few  degrees  of  each  otho') 
over  a  large  fraction  of  the  space  of  observation. 

To  show  this  pxopetXy  and  analyze  the  extent  to  which 
the  two  2D  tangent  vectors  are  "almost"  parallel,  we 
have  used  the  method  discussed  previously  for  prqjerty 
3.  The  results  showed  that  the  two  tangents  are  within  3° 
of  each  other  over  94.48%  of  the  previous  space  of  ob¬ 
servation  and  within  5°  over  96.90%  of  that  space.  Th- 
ble  2  gives  the  sizes  of  the  regions  on  the  viewing  sptere 
where  the  two  tangents  are  within  3®  of  each  otha  for 
some  values  of  (e,  r) .  Figure  7a  gives  the  3D  plot  of  the 
angle  difference  as  a  function  of  (a,fi)  for 
(e,r)  m  (0  2,0.2)  and  Figure  7b  the  corresponding 
half  viewing  sphere,  such  as  discussed  for  Figure  6b. 
The  behavior  of  this  quasi-invariant  property  is  similar 
to  the  previous  one.  It  tends  to  degrade  only  close  to  re¬ 


gions  whae  limbs  do  not  exist  (fw  unlikdy  view¬ 
points).  Notice  that  because  the  size  of  the  regions  on 
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100 
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100 

99.00 

96.30 

94.07 

Table  2  Viewnig  sets  sixes  (percent)  where  the 
tangent  to  the  2D  axis  is  within  3°  of  the  tangent  to 
the  projection  of  the  3D  axis 


2D  angle  difference  is  3®  or  less 
(thin  lines) 


'^^>20  angle  difference  is 
greater  than  3®  (thick  lines) 

Figure?  Property 6 for  (e,r)  ■  (0.2, 0.2). 
a  3D  plot  b.  half  viewing  sphere. 

the  viewing  sphere  where  limbs  do  not  exist  is  mainly 
influenced  by  r  (as  discussed  in  section  4.1),  the  relative 
size  of  the  region  where  the  property  holds  may  tend  to 
increase  as  r  increases  (the  ratio  is  over  a  smaller  re¬ 
gion,  where  limbs  exist  at  all). 

Notice  that  fw  SORs  and  Circular  PRCGCs  this  prop¬ 
erty  is  a  projective  invariant  as  we  have  proved  the 
stronger  property  that  the  two  axes  coiiKide.  For  Circu¬ 
lar  PRCGCs  the  tangents  are  the  same  since  the  axes  co¬ 
incide  precisely  at  corresponding  pcnnts.  For  SORs,  the 
axes  are  coUinear  (although  they  do  not  coincide  at  cm*- 
responding  points),  thus  they  have  parallel  tangents. 

Notice  also  that  it  can  be  verified  from  equations  (10) 
and  (13),  that  the  properties  hold  exactly  (perfect  or¬ 
thogonality  and  coincidence  of  axes)  for  general  Circu¬ 
lar  PRGCs  fOT  the  two  viewpoints,  where  the  viewing 
direction  V  is  orthogonal  to  the  axis  plane  (i.e.  side 
view,  cosa  ■  cosfi  -  0,  0j  -  0,  0,  -  n)  and  where  it 
is  in  the  axis  plane  (i.e.  frontal  view,  sina  «  0, 
CO80,  •  COS02,  sin0|  m  -sin02). 

These  two  properties  are  quasi-invariant  with  respect 
to  transfinms  parameterized  over  the  4D  space  of  obser¬ 
vation.  Thus  they  are  orthographic  quasi-invariants 
(with  respect  to  (a,  fl))  and  object-shape  parameters 
quasi-invariants  (with  respect  to  {e,  r) ).  Usage  of  these 
properties  is  discussed  in  the  next  section. 
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5  Application  to  shape  description  and 
recovery 

In  this  section  we  discuss  how  the  projective  proper¬ 
ties  of  the  previous  section  can  be  used  to  derive  con¬ 
straints  on  shape  recovery  for  Circular  PRGCs.  In 
section  5.1  we  discuss  the  usage  of  property  5  for  a  2D 
consistent  shape  description.  In  section  5.2  we  discuss 
how  constraints  derived  from  both  the  2D  description 
(property  5)  and  property  6  can  be  used  to  recover  the 
3D  shape  of  Circular  PRGCs  from  a  single  image  of 
their  contours.  The  recovery  method  provides  tests  as  to 
whether  the  contours  are  projections  of  Circular 
PRGCs,  thus  it  does  not  make  assumptions  about  the 
class  of  objects  being  viewed.  At  this  stage,  we  are  as¬ 
suming  perfect  contours  to  be  given  as  our  purpose  is  to 
address  usage  of  the  new  projective  quasi-invariants. 
CXn  ultimate  goal  is  to  handle  real  image  contours.  Pre¬ 
vious  experience  with  SHGCs  shows  that  projective  in¬ 
variants  help  solve  the  figure-ground  problem  in 
complex  real  images  [Zerroug  &  Nevada  1993].  We 
plan  on  implementing  a  similar  approach  for  handling 
Circular  PRGCs  using  the  derived  quasi-invariants,  in 
the  near  future. 

5.1  2D  shape  description 

Propoty  5  indicates  that  the  projection  of  a  Circular 
PRGC  can  be  described  by  a  ribbon  whose  cross-sec¬ 
tions  (correspondence  segments)  are  orthogonal  to  the 
(curved)  axis.  This  type  of  ribbon  has  been  addressed  in 
the  past  by  many  researchers.  Nevatia  and  Binford 
[Nevada  &  Binford  1977]  used  it  to  describe  complex 
objects  from  contours  obtained  from  range  data.  In  AC¬ 
RONYM  [Brooks  1983],  Brooks  used  these  ribbons  in 
a  model  based  interpretation  of  image  contours  and  Rao 
and  Nevatia  [Rao  &  Nevatia  1989]  used  them  to  de¬ 
scribe  complex  shapes  from  imperfect  contours.  In 
[Ponce  1988],  Ponce  compared  those  ribbons  (Brooks' 
ribbons^)  to  other  types  of  ribbons  commonly  used  in 
the  literature.  We  will  call  the  ribbons  in  our  analysis 
right  ribbons. 

In  such  previous  work,  right  ribbons  have  been  used 
as  a  means  for  2D  shape  description  without  rigorously 
addressing  the  relationship  between  the  obtained  2D  de¬ 
scriptions  and  the  3D  descriptions  of  viewed  objects.  In 
this  discussion,  we  have  shown  that  right  ribbons  are  in¬ 
deed  good  descriptors  of  the  orthographic  projection  of 
Circular  PRGCs.  They  provide  a  consistent  2D  descrip¬ 
tion  by  identifying  corresponding  limb  points,  projec¬ 
tions  of  points  on  the  same  cross-section  of  the  3D 
shape. 

Detection  of  right  ribbons  has  been  addressed  in 
[Nevatia  &  Binford  1977,  Rao  &  Nevatia  1989].  The 

^Ribbons  in  our  analysis  are  a  special  case  of  Brooks'  ribbons 

with  a  right  angle  between  cross-sections  and  axis  tangents. 


original  method  (projection  method)  consists  of  dis¬ 
cretizing  the  OTientations  of  the  axis  (or  equivalently  of 
the  correspondence  segments)  and  casting  regularly 
spaced  lines.  We  have  used  a  variation  of  that  method 
using  a  quadratic  B-spline  rqjresentation  of  the  con¬ 
tours  of  Circular  PRGCs.  Line  casting  is  done  from  the 
extremities  of  each  B-spline  segment.  The  complexity 
of  this  process  of  O  (km)  where  k  is  the  number  of  se¬ 
lected  orientations  and  m  the  number  of  B-spline  seg¬ 
ments.  In  the  oiginal  method,  m  is  the  number  of 
points,  which  is  much  larger  than  the  number  of  B- 
spline  segments.  Local  right  ribbons  are  detected  for 
each  pair  of  lines  intersecting  a  pair  of  curves  at  nearby 
B-splines  (Figure  8).  The  local  axis  can  be  determined 
analytically  given  the  desired  extremities  and 
and  their  tangents  Tq  and  f„.  A  local  right  ribbon  is  hy¬ 
pothesized  if  there  exists  a  i,  adratic  B-spline  segment 
(local  axis)  for  those  extremities  and  the  following  con¬ 
straint  is  satisfied  (see  Figure  8): 

dist  (p^,  pj  /dist  (/>.,  q^)  <  e  Vi;  wh^e  p^  is  a  point 
on  the  local  axis,  p,,  9,  hypothesized  corresponding 
points  (into’sections  of  the  line  orthogonal  to  the  axis  at 
p^  with  opposite  B-splines)  and  p^  the  midpoint  of 
i.e  the  local  axis  should  be  the  locus  of  midpoints 
of  correspondence  segments. 


Figure  8  Local  right  ribbons  detection  using 
B-splines. 

Several  such  local  right  ribbons  are  usually  obtained 
between  each  pair  of  curves.  To  obtain  a  global  descrip¬ 
tion  (of  the  whole  surface),  we  perform  a  grouping  of 
these  local  ribbons  based  on  contiguity  of  their  sides 
and  their  orientations.  Selection  of  the  'best'  global  de¬ 
scription  uses  measures  of  continuity  of  orientations  of 
correspondence  segments  and  axes.  For  lack  of  space, 
we  omit  frirth^  details  of  the  method  as  they  are  similar 
to  [Rao  &  Nevatia  1989].  Figure  9,  shows  the  obtained 
right  ribbon  descriptions  for  the  two  Circular  PRGCs  of 
Figure  1  using  the  above  method. 

5.2  3D  shape  recovery 

The  2D  descriptions  can  be  used  to  recover  complete 
3D  object  centered  descriptions  from  2D  contours  of 
Circular  PRGCTs.  To  do  this,  we  have  to  recover  the  3D 
cross-section,  the  3D  axis  and  the  sweeping  function. 
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Figure  9  Resulting  right  ribbon  descriptions  for 
Qrcular  PRGCs  of  Figure  1 . 


For  this,  we  assume  that  the  viewpoint  is  general  and 
that  the  two  extremal  cross-sections  are  (at  least  partial¬ 
ly)  visible.  Since  the  cross-section  is  known  to  be  circu¬ 
lar,  recovery  of  the  two  extremal  3D  cross-sections  can 
be  done  by  finding  the  orientation  of  the  3D  circles,  (Cj 
and  C2  in  Figure  10),  whose  prmections  coincide  with 
the  visible  elliptic  cross-sections  .  The  3D  circle  finding 
method  also  yields  the  3-D  orientations  n,  and  h2  of  the 
two  extremal  3D  cross-sections.  The  orientation  of  the 
3D  axis  plane  is,  thus,  hg  -  xh2.  The  depth  of  the 
axis  plane  can  be  arbitrarily  fixed.  The  3D  positions  of 
the  two  extremal  cross-sections  are  automatically  fixed 
since  their  centers  belong  to  the  axis  plane. 


Figure  10  Recovering  3D  models  of  Circular 
PRGCs  from  2D  contours 


As  previously  mentioned,  the  3D  axis  is  not  the  back- 
projection  of  the  2D  axis  on  the  axis  plane,  as  property 
6  relates  only  2D  tangent  orientations,  not  axis  points. 
However,  the  2D  description,  based  on  property  5,  gives 
us  a  good  approximation  of  corresponding  points  (pro¬ 
jections  of  points  on  the  same  3D  cross-section;  and 
P2  in  Figure  10).  Moreover,  property  6  gives  us  a  good 
approximation  of  the  orientation  of  the  projection,  say 
tv,  of  the  3D  axis  tangent  (3D  cross-section  orientation). 
The  (unique)  3D  cross-section  whose  projection  passes 
through  pj  and  pj  can  ^  determined  as  follows  (see 
Figure  10): 

*  Backproject  w  on  the  3D  axis  plane  toj)btain  the 
orientation  of  the  3D  cross-section,  say  W. 

^of  the  two  pouible  lolutionf  for  each  ellipse,  we  choose  the 
one  that  makea  the  extnior  pait  (outside  the  ribbon)  point 
away  from  the  viewer.  In  the  case  of  a  visible  half  ellipse,  the 
part  of  the  3D  circle  corresponding  to  the  visible  half  ellipse  is 
made  closer  to  the  viewer  than  the  one  corresponding  to  the  in¬ 
visible  (occluded)  part. 


•  In  3D,  rotate  a  reference  eternal  cross-section,  say 
Cj,  by  the  rotation  R(n,.1^  which  rotates  n,  to  IF. 
Call  the  resulting  circle  and  its  projection  c^. 

•  Find,  on  c^,  the  two  points  whose  tangents  are  the 
same  as  tte  tangents  and  >2  to  the  projected 
limbs  at  p,  and  P2^.  Call  them  9,  and  92- 

•  Since  lengths  ratio  of  parallel  segments  is  an  ortho¬ 
graphic  invariant,  the  scale  of  the  desired  cross-sec¬ 
tion  with  respect  to  the  reference  one  is  givra  by 
r  -  dist{p^,p^/dist{q^,q^  . 

•  Scale  by  r  and  translate  it  so  that  p,  and  co¬ 
incide.  Call  the  resulting  cross-section  c^. 

•  The  desired  3D  cross-section  is  the  backprojection, 
Cp  ,  of  Cp  so  that  its  center  backprojects  on  the  3D 
axis  plane.  This  gives  us  the  3D  CTOss-sections 
(thus,  the  3D  sweep  as  well). 

•  The  3D  axis  is  the  locus  of  the  cent^  of  the  3D 
cross-sections  so  recovered. 

Results  of  the  application  of  this  3-D  recovery  method 
to  the  descriptions  of  Figure  9  are  shown  in  Figure  1 1 . 
The  figure  shows  the  3-D  ruled  surfaces,  showing  ctoss- 
sections,  meridians  and  tte  3-D  axes,  arKl  the  corre¬ 
sponding  shaded  displays,  using  differoit  poses  of  the 
recovoed  3-D  shapes 

This  method  produces  estimated  3-D  sh^)es  fi-om  the 
observed  2D  contours,  since  the  determined  2-D  ctxre- 
spondences  are  themselves  estimates  of  the  actual  ones. 
However,  the  derived  quasi-invariants  show  that  the 
projections  of  the  3-D  correspondences  and  the  estimat¬ 
ed  2-D  correspondences  are  dose  to  each  other  over 
most  viewing  directions.  Therefore,  the  (unique)  back- 
projections  of  those  2-D  correspondences  are  close  to 
the  actual  3-D  ones;  i.e.  the  recovered  3-D  shapes  are 
good  ^>proximations  of  actual  ones.  Notice  also  that  the 
method  does  not  make  assumptions  about  the  viewed 
objects.  It  provides  tests  as  to  whether  the  objects  are 
projectiles  of  Circular  PRGCs.  The  tests  involve  con¬ 
sistency  of  orientations  of  extremal  cross-sections  as 
predicted  by  the  2-D  correspondences  and  as  deter¬ 
mined  by  the  3-D  circle  finding  method.  The  fmmer  ori¬ 
entations  are  determined  by  the  backprojections  of  the 
2-D  axis  tangents,  at  the  extremities  of  the  surface,  on 
the  axis  plane.  The  latt^  ones  are  the  normals  n,  and  h2 
mention^  previously.  Other  objects  would  produce  in¬ 
consistencies  in  those  orientations  as  the  extremities  of 
the  limbs  would  not  be  matched  by  the  right  ribbon  find¬ 
ing  method  (we  have  found  that  non  Circular  PRGCs  do 
not  satisfy  the  properties  3  and  6  as  Circular  PRGCs 
do) 


ir  ihe  proi«ction,  ciott-MCtioni  >nd  limbc  are  mutually  tan- 
geiiLiiii  and  vector  paralleliim  ia  an  orthographic  invariant. 
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Figure  1 1  Results  of  3D  shape  recovery  for  the 
previous  two  Circular  PRGQ. 

6  Conclusion 

We  have  addressed  projective  properties  of  the  con¬ 
tours  of  complex  GC  primitives  (Circular  PRGCs).  Pre¬ 
vious  work  on  using  projective  properties  of  GO  for 
interpretation  of  2D  contours  has  a^essed  relatively 
simple  primitives  having  either  a  straight  axis  or  a  con¬ 
stant  size  cross-section,  where  invariant  properties  have 
been  derived  and  used  for  3D  shape  recovery.  In  the 


case  of  Circular  PRGCs  (non  straight  axis  and  non  con¬ 
stant  cross-section  size),  we  have  derived  rigOTOUS  qua¬ 
si-invariant  prop«ties  for  the  general  case  and  have 
shown  that  tiiey  are  invariant  for  special  cases.  Those 
quasi-invariants  have  been  derived  by  analyzing  the  be¬ 
havior  of  limb  projections  as  a  function  of  viewing  pa¬ 
rameters  and  object  sluq)e  parameters.  The  doived 
quasi-invariants  provide  strong  constraints  for  consis¬ 
tent  2D  descriptions  and  3D  stuq)e  recovo^  frmn  2D 
contours  and  we  have  demonstrate  their  application  on 
some  examples.  We  believe  that  the  results  derived  in 
this  analysis  constitute  an  important  progress  towards 
handling  complex  objects  with  non  simple  primitives. 

In  the  near  future,  we  plan  on  addressing  recovery  of 
Circular  PRGCs  from  real  image  contours.  Otha*  effcxts 
have  shown  that  projective  invariants  hdp  solve  the  fig¬ 
ure  ground  problem  for  SHGCs  in  real  image  contours 
with  breaks,  matidngs,  and  occlusion  [Zerroug  &  Neva¬ 
da  1993].  We  believe  that  quasi-invariants  are  also  a 
source  of  strong  constraints  for  the  figure  grouzKi  prob¬ 
lem  for  non  simple  GC  primitives.  Oir  aim  is  to  develop 
a  system  that  handles  compound  objects  m^  up  of 
several  GC  primitives. 
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Abstract 

This  paper  describes  a  pair  of  projectivity  invari¬ 
ants  of  four  lines  in  three  dimensional  projective  space, 
T^.  Invariants  are  derived  in  both  algebraic  and  ge¬ 
ometric  terms,  and  the  connection  between  the  two 
ways  of  defining  the  invariants  is  established.  Since 
a  count  of  the  number  of  degrees  of  freedom  would 
predict  the  existence  of  a  sin^e  invariant,  rather  that 
the  two  that  are  shown  to  exist,  an  isotropy  of  the 
four  lines  must  exist.  The  nature  of  this  isotropy  is 
investigated. 

It  is  shown  that  once  the  epipolar  geometry  is 
known,  the  invariants  of  four  lines  may  be  computed 
from  the  images  of  the  four  lines  in  two  distinct  views 
with  uncalibrated  cameras.  An  example  with  real  im¬ 
ages  is  computed  to  shows  that  the  invariants  are  ef¬ 
fective  in  distinguishing  different  geometrical  configu¬ 
rations  of  lines. 

1  Introduction 

Projective  invariants  of  geometrical  configura¬ 
tions  in  space  have  recently  received  much  atten¬ 
tion  because  of  their  application  to  vision  problems 
([Mundy-Zi8sermw-92]).  Although  invariants  of  a 
wide  range  of  objects  m  the  3-dimensional  projective 
space  P’do  exist  ([Abhyankar-92]),  one  is  restricted 
in  vision  applications  to  considering  those  that  may 
be  computed  from  two-dimensional  projections  (im¬ 
ages).  For  point  sets  and  more  structured  geomet¬ 
rical  objects  lying  in  planes  in  many  invariants 
exist  ([Coelho-92])  which  can  be  computed  from  a 
single  view.  Unfortunately,  it  has  been  shown  in 
[Burns-92]  that  no  invariants  of  arbitrary  point  sets 
in  S-dimensions  may  be  computed  from  a  single  im¬ 
age.  One  is  led  either  to  consider  constrained  sets  of 
points,  or  else  to  allow  two  independent  views  of  the 
object.  An  example  of  the  first  approach  is  contained 
in  [Zi8serman-92]  which  considers  solids  of  revolution. 
This  paper  takes  the  second  course  and  considers  in¬ 
variants  that  can  be  derived  from  two  views  of  an 
object.  It  has  been  recently  proven  in  [Faugeras-92] 
and  [Hartley-Gupta-92]  that  a  3  dimensional  scene 
may  be  constructed  up  to  a  projectivity  of  space 
from  two  views  with  uncalibrated  cameras.  This  al¬ 
lows  us  to  compute  invariants  of  3-dimensionai  con¬ 
figurations  from  two  views.  Invariants  of  six  points 
in  space  have  been  suggested  in  [Faugera8-92]  and 


[Hartley-Guptar92]  and  verified  in  [Hartley-93a]  to  be 
useful  at  distinguishing  different  point  configurations. 
The  present  paper  considers  invariants  of  straight  lines 
in  7^ computable  from  a  pair  of  images.  Since  straight 
lines  occur  commonly  in  man-made  objects  and  may 
be  effectively  extracted  from  the  image  using  zm  edge 
extraction  algorithm,  invariants  of  sets  of  fines  may 
prove  to  be  more  useful  than  invariants  of  point  sets 
in  object  recognition  {^plications. 

The  invariants  of  lines  in  space  can  not  be  com¬ 
puted  from  two  views  of  lines  only.  It  may  be  seen 
that  virtually  no  information  about  the  cameras  can 
be  derived  from  two  views  of  a  set  of  lines  in  space. 
This  is  because  given  two  images  of  a  line  ana  two 
arbitrary  cameras,  there  is  always  a  line  in  space 
that  corresponds  to  the  two  images.  In  other  words, 
two  images  of  an  unknown  line  do  not  in  any  way 
constrain  the  cameras.  This  point  is  discussed  in 
[Weng-921.  If  on  the  other  hand  the  epipolar  geom¬ 
etry  of  tne  two  views  (as  expressed  in  the  essential 
matrix)  is  known,  then  the  locations  of  fines  may  be 
determined  up  to  a  projectivity  of  P^from  their  im¬ 
ages  in  the  two  views.  There  are  many  ways  of  deter¬ 
mining  the  epipolar  geometr^rom  views  of  points  or 
fines  in  two  or  three  images  (^iggins-Sl,  Hartley-93b, 
Zisserman-Hartley-93]) . 

2  Line  Invariants 

In  this  section,  invariants  of  lines  in  sp{M;e  will  be 
described.  It  will  be  shown  that  four  lines  in  the  3- 
dimensional  projective  plane,  T^give  rise  to  two  in¬ 
dependent  invariants  under  projectivity  of  7*®.  Two 
different  ways  of  defining  invuiants  will  be  described, 
one  algebraic  and  one  geometric. 

2.1  Algebraic  Invariant  Formulation 

Consider  four  fines  A,-  in  space.  A  line  may  be  given 
by  specifying  either  two  points  on  the  line  or  dually, 
two  planes  that  meet  in  the  fine.  It  does  not  matter  in 
whicn  way  the  lines  are  described.  For  instance,  in  the 
formulae  (2)  and  (3)  below  certain  invariants  of  Imes 
are  defined  in  terms  of  pairs  of  points  on  each  line. 
The  same  formulae  could  be  used  to  define  invariants 
in  which  lines  are  represented  by  specifying  a  pair  of 
planes  that  meet  along  the  line.  Since  the  method 
of  determining  lines  in  space  from  two  view  given  in 
section  3.3  gives  a  representation  of  the  line  as  an  in- 
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tersection  of  two  planes,  the  latter  interpretation  of 
the  formulae  is  most  useful. 

Nevertheless,  in  the  following  description,  of  al¬ 
gebraic  and  geometric  invariants  of  lines,  lines  will 
be  represented  by  specifying  two  points,  since  this 
method  seems  to  allow  easier  intuitive  understanding. 
It  should  be  borne  in  mind,  however,  that  the  dual 
approach  could  be  taken  with  no  change  whatever  to 
the  algebra,  or  geometry. 

In  specifying  lines,  each  of  two  points  on  the  line 
will  be  given  as  a  4-tuple  of  homogeneous  coordinates, 
and  so  each  line  A,  is  specified  as  a  pair  of  4-tuples 

Ai  =  {{ail,ai2yOi3y(^4){bil,bi2,bi3,bi4)) 


Now,  given  two  lines  A,-  and  Aj ,  one  can  form  a  4  x  4 
determinant,  denoted  by 


lAjAjl  =  det 


/ 

Oil 

ai2 

ai3 

fli4 

bil 

bi2 

bi3 

6|4 

aj2 

aj4 

\ 

bji 

bj2 

bi3 

bj4 

(1) 


Taking  determinants,  it  is  seen  that  the  net  result 
of  choosing  a  different  representation  of  the  lines  A, 
and  Aj  is  to  multiply  the  value  of  |A,-Aj|  by  a  factor 
det(£)< )  det(Dj ).  Since  each  of  the  lines  A;  appears  in 
both  the  numerator  and  denominator  of  the  expres¬ 
sions  (2)  and  (3),  the  fcictors  will  cancel  and  the  values 
of  the  invariants  will  be  unchanged. 

Next,  it  is  necessary  to  consider  the  effect  of  a 
change  of  projective  coordinates.  If  /f  is  a  4  x  4  invert¬ 
ible  matrix  representing  a  coordinate  transformation 
of  then  it  may  be  applied  to  each  of  the  points 
used  to  designate  the  four  lines.  The  result  of  apply¬ 
ing  this  transformation  is  to  multiply  the  determinant 
|AiAj|  by  a  factor  det(/f).  The  factors  on  the  top  and 
bottom  cancel,  leaving  the  values  of  the  invariants  (2) 
and  (3)  unchanged.  This  completes  the  proof  that  h 
and  12  defined  by  (2)  and  (3)  are  indeed  projective 
invariants  of  the  set  of  four  lines. 

An  alternative  invariant  may  be  defined  by 


/3(Ai,  A2»  A3,  X4) 


IA1A4I  IA2A3I 
IA1A3I  IA2A4I 


(4) 


Finally,  it  is  possible  to  define  two  independent  invari¬ 
ants  of  the  four  lines  by 


fi(Ai,  A2,  A3,  A4) 


IA1A2I  IA3A4I 
IA1A3I  IA2A4I 


(2) 


It  is  easily  seen,  that  I3  =  /1//2.  However,  if  IA1A2I 
vanishes,  then  both  Ii  and  I2  are  zero,  but  I3  is  in 
general  non-zero.  This  means  that  I3  can  not  always 
be  deduced  from  Ii  and  I2.  A  preferable  way  of  defin¬ 
ing  the  invariauits  of  four  lines  is  as  a  homogeneous 
vector 


and 


/2(Ai,A2,  A3,A4) 


IA1A2I  IA3A4I 
IA1A4I  IA2A3I 


/(Ai,A2,A3,  A4)  = 

(^)  (IA1A2I  IA3A4I ,  IA1A3I  IA2A4I ,  IA1A4I  |A2A3l)(5) 


It  is  necessary  to  prove  that  the  two  (|uantities  so 
defined  are  indeed  invariant  under  projectivities  of 
First,  it  must  be  demonstrated  that  the  expressions 
do  not  depend  on  the  specific  formulation  of  the  lines. 
That  is,  there  are  an  infinite  number  of  ways  in  which 
a  line  may  be  specified  by  designating  two  points  lying 
on  it,  and  it  is  necessary  to  demonstrate  that  choos¬ 
ing  a  different  pair  of  points  to  specify  a  line  does 
not  change  the  value  of  the  invariants.  To  this  end, 
suppose  that  (a,i ,  0,2 ,  ai3,  a, 4 and  (6,1 , 6,2 ,  bi3,  bi^y 
are  two  distinct  points  lying  on  a  line  A,,  and  that 
and  {bL,h\^,b\^,b'n)'^  are  another 
pair  of  points  lying  on  the  same  line.  Then,  there 
exists  a  2  X  2  matrix  Di  such  that 


ai2  «i3  “<4 
bi2  bi3  6,4 


Consequently, 


Two  such  computed  invariant  values  are  deemed  equal 
if  they  differ  by  a  scalar  factor.  Note  that  this  defini¬ 
tion  of  the  invariant  avoids  problems  associated  with 
vanishing  or  near-vanishing  of  the  denominator  in  (2) 
or  m. 

The  definitions  of  h  and  I2  are  similar  to  the  defi¬ 
nition  of  the  cross-ratio  of  points  on  a  line.  It  is  well 
known  that  for  four  points  on  a  line,  there  is  only 
one  independent  invariant.  It  may  be  asked  whether 
Ii  may  be  obtained  from  I2  by  some  simple  arith¬ 
metic  combination.  This  is  not  the  case,  as  will  be¬ 
come  clearer  when  the  connection  of  these  algebraic 
invariants  with  geometric  invariants  is  shown. 

2.2  Degenerate  Cases 

The  determinant  lA,  Aj|  as  given  in  (1)  will  van¬ 
ish  if  and  only  if  the  four  points  involved  are  copla- 
nar,  that  is,  exactly  when  the  two  lines  are  coincident 
(meet  in  space).  If  all  three  components  of  the  vec¬ 
tor  /(Ai,A2,  A3,A4)  given  by  (5)  vanish,  then  the  in¬ 
variant  is  undefined.  Enumeration  of  cases  indicates 
that  there  are  two  essentially  different  configurations 
of  lines  in  which  this  occurs. 

1.  Three  of  the  lines  lie  in  a  plane. 

2.  One  of  the  lines  meets  all  the  other  three. 

The  configuration  where  one  line  meets  two  of  the 
other  lines  is  not  degenerate,  but  does  not  lead  to  very 
much  useful  information,  since  two  of  the  components 
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of  the  vector  vanish.  Up  to  scale,  the  last  compo¬ 
nent  may  be  assumed  to  equal  1,  which  means  that 
two  such  configurations  can  not  be  distinguished.  In 
fact  any  two  such  configurations  are  equivalent  under 
projectivity. 

2.3  Geometric  Invariants  of  Lines 

Consider  four  lines  Xi  in  general  position  (which 

means  that  they  are  not  coincident)  in  V^.  It  will 
be  shown  that  there  exist  exactly  two  further  lines  Ti 
and  T],  called  transversals,  which  meet  each  of  the 
four  lines.  Once  this  is  established,  it  is  easy  to  define 
invariants. 

The  points  of  intersection  of  each  of  the  four  lines 
A,-  with  one  of  the  transversals  Tj  constitute  a  set  of 
four  points  on  a  line  in  The  cross  ratio  of  these 
points  is  an  invariant  of  the  four  lines  Ai .  In  this  way, 
two  invariants  may  be  defined,  one  for  each  of  the  two 
transversals. 

Invariants  may  be  defined  in  a  dual  manner  as  fol¬ 
lows.  Given  a  transversal,  Tj ,  meeting  each  of  the  lines 

A, -,  there  exists,  for  each  A,-  a  plane  denoted  <  Tj,Xi  >, 
containing  Tj  and  Aj .  This  gives  rise  to  a  set  of  four 
planes  meeting  in  a  common  line  tj  .  The  cross-ratio 
of  this  set  of  inanes  is  an  invariant  of  the  lines  A^. 

It  is  easy  to  see  that  this  dual  construction  does  not 
give  rise  to  any  new  invariant.  Specifically,  consider 
the  cross-ratio  of  the  four  planes  meeting  at  ti  .  The 
cross-ratio  of  four  planes  meeting  along  a  line  is  equal 
to  the  cross-ratio  of  the  points  of  intersection  of  the 
planes  with  any  other  non-coincident  line  in  space. 
The  line  is  such  a  line.  Hence,  the  cross  ratio  of 
the  planes  <  Tj  ,  A,'  >  is  equal  to  the  cross-ratio  of  the 
points  <  ri,A,-  >  H  T2,  where  the  symbol  D  denotes 
the  point  of  intersection.  However,  plane  <  ri.Aj  > 
meets  T2  in  the  point  Aj  nr2.  In  other  words,  the  cross¬ 
ratio  of  the  four  planes  meeting  along  Ti  is  equal  to  the 
cross-ratio  of  the  four  points  along  rj,  and  vice-versa. 

2.4  Existence  of  Transversals 

To  prove  the  existence  of  transversals,  we  start  by 
considering  three  lines  in  space. 

Lemma  1.  There  exists  a  unique  quadric  surface  con¬ 
taining  three  given  lines  Ai ,  Aj  and  A3  in  general  po¬ 
sition  in  . 

Proof.  For  a  reference  to  properties  of  (|uadric  sur¬ 
faces,  the  reader  is  referred  to  [Semple-Kneebone-52]. 
It  is  shown  there  that  a  quadric  surface  is  a  doubly 
ruled  surface  containing  two  families  of  lines  A  and 

B.  Two  lines  from  the  same  set  A  or  do  not  meet, 
whereas  any  two  lines  chosen  one  from  each  set  will  al¬ 
ways  meet.  Assuming  that  the  lines  A,-  lie  on  a  quadric 
surface,  since  they  do  not  meet,  they  must  all  come 
from  the  same  family,  which  we  assume  to  be  A.  Now 
consider  any  point  x  on  the  quadric  surface.  There 
is  a  unique  line  passing  through  x  and  belonging  to 
the  class  B.  This  line  must  meet  each  of  the  lines  A,-, 
which  belong  to  class  A. 

We  are  led  therefore  to  consider  the  locus  of  all 
points  X  in  V^ior  which  there  exists  a  line  passing 
through  X  meeting  all  the  lines  A,,  i  =  1, ...  ,3.  To  this 
end,  let  x  =  {x,y,z,ty  be  a  point  on  this  locus.  For 
each  of  the  lines  Aj  we  may  define  a  plane  Tt,  pa.ssing 


through  X  and  Aj.  The  condition  that  there  exists  a 
line  passing  through  x  meeting  each  Aj  means  that  the 
three  planes  Xj  meet  idong  that  line. 

Next,  we  formulate  this  last  condition  algebraically 
and  give  a  method  of  computing  the  formula  for  the 
quadric  surface.  As  before,  letting  (0/1,0^2,013,014)^ 
and  (iii,ht2,  be  two  points  on  the  line  A,-,  the 

plane  Xj  passing  through  x  =  (x,y,z,t)^  and  the  line 
Aj  may  be  computed  as  follows.  Consider  the  matrix 

(Oji  0/2  0/3  0/4  \ 

hii  b/2  b/3  b/4  I  (6) 

X  y  z  t  J 

The  plcUie  Xj  b  given  by  the  homogeneous  vector 
(p«i.P»2iPi3.Pi4)^  where  (— IVpji  is  the  determinant 
of  the  3x3  matrix  obtained  oy  deleting  the  j-th  col¬ 
umn  of  (6).  Consequently,  each  p/j  b  a  homogeneous 
linear  expression  in  x,  y,  z  and  t.  Furthermore,  since 
point  (z,y,  lies  on  this  plane  it  follows  that 

XPil  +  yPi2  +  ZPi3  +  tp/4  =  0  .  (7) 

Now  the  fact  that  the  three  planes  x,-  meet  along  a 
common  line  translates  into  the  algebraic  fact  that 


the  rank  of  the  matrix 

P12 

Pl3 

Pl4 

P  =  1  P21 

P22 

P23 

P24 

V  P31 

P32 

P33 

P34 

b  2.  This  is  equivalent  to  the  condition 

det  =  0  for  all  j  ,  (8) 

where  is  the  matrix  obtained  by  removing  the  j- 
th  column  of  P.  Since  each  entry  pj,  of  P  b  a  linear 
homogeneous  expression  in  the  v»iables  x,  y,  z  and 
t,  the  determinant  det  {P^’^)  b  a  cubic  homogeneous 
polynomial.  A  point  on  the  required  locus  must  satbfy 
the  condition  det  (P^^^)  =  0  for  j  =  1, . . . ,  4.  However, 
because  of  condition  (7)  these  four  equations  are  not 
independent.  In  particular,  if  pj  represents  the  j-th 
column  of  P,  then  (7)  implies  a  relation 

*Pi  +  yP2  +  ^Ps  -I-  <P4  =  0 

Then 

xdet{P^*^)  =  idet(pi  pa  pa) 

=  detfxpi  P2  Pa) 

=  det  ( -yp2  -  zpa  -  tP4  P2  Ps) 
=  det(-tp4  P2  Pa) 

=  -<det(p2  Pa  P4) 

=  — tdet(/^*^)  . 

(9) 

Thb  equation  implies  that  x  divides  det(P(*^)  and  ( 
divides  det(f^^^).  Furthermore,  applying  the  same  ar¬ 
gument  to  other  coordinates  gives  rise  to  an  equation 

det(P^'^)/j:  =  —  det(/^^^)/y  =  det(/^®^)/r  = 
-det(P^'‘^)/t  =  R(x,y,z,t) 
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where  R{x,  y,z,t)  is  some  homogeneous  degree-2  poly¬ 
nomial.  Then  the  defining  equations  (8)  of  the  locus 
become 

xR{x,y,z,t)  =  yR{x,y,z,t)  =  zR{x,y,z,t)  = 
tR(x,y,z,t)  =  0  . 

This  implies  that  either  R{x,y,z,t)  =  0  or  x  =  y  = 
z  =  i  z=  0.  The  latter  condition  can  be  discounted, 
since  (0, 0, 0, 0)  is  not  a  valid  set  of  homogeneous  co¬ 
ordinates.  Consequently,  the  desired  locus  is  described 
by  the  degree-2  polynomial  equation  R(x,y,z,t)  =  0, 
and  is  therefore  a  quadric  surface.  Since  it  is  easily 
verified  that  the  three  original  lines  A,  lie  on  this  sur¬ 
face,  the  proof  of  the  lemma  is  complete.  □ 

It  is  now  a  simple  matter  to  prove  the  existence  of 
transversals. 

Theorem  2.  There  exist  exactly  two  transversals  to 
four  lines  in  general  position  in 

Proof.  We  choose  three  of  the  lines  Ai ,  A2  and  A3  and 
construct  the  ouadric  surface  5  that  they  all  line  on. 
Let  Xi  and  X3  be  the  two  points  of  intersection  of  the 
fourth  line  A4  with  the  quadric  surface.  The  construc¬ 
tion  of  S  in  Lemma  1  shows  that  any  transversal  to 
lines  Ai,  Aj  and  A3  must  lie  on  S.  Further,  the  lines 
Ai,  A]  and  A3  all  belong  to  one  of  the  families,  A,  of 
ruled  lines  on  the  quadric  surface,  5.  Let  Ti  and 
be  the  lines  in  the  other  family  B  passing  through  Xi 
and  Xj.  Then  Ti  and  73  are  the  two  transversals  to  all 
four  lines.  □ 

Of  course,  it  is  possible  that  A4  does  not  meet  the 
surface  5  in  any  real  point,  or  is  tangent  to  5.  The 
statement  of  the  theorem  must  be  interpreted  as  al¬ 
lowing  complex  or  double  solutions.  In  the  case  of  four 
real  lines  in  space,  there  are  either  two  real  transver¬ 
sals  or  two  conjugate  complex  traversals.  In  the  case 
of  complex  traversals,  there  is  no  conceptual  difficulty 
in  defining  the  invariants  as  in  the  real  case.  The 
cross-ratio  of  points  of  intersections  of  the  lines  with 
the  two  conjugate  transversals  will  result  in  two  in¬ 
variants  which  are  complex  conjugates  of  each  other. 

Various  degenerate  sets  of  lines  also  allow  two 
transversals.  For  instance  suppose  that  Ai  and  A2  are 
coincident,  and  so  are  A3  and  A4.  One  transversal  to 
the  four  lines  passes  through  the  two  points  of  intersec¬ 
tion  of  the  pairs  of  lines.  The  other  transversal  is  the 
line  of  intersection  of  the  two  planes  defined  by  Aj ,  A2 
and  by  A3,  A4.  The  cross-ratio  invariant  correspond¬ 
ing  to  the  first  transversal  is  zero,  but  the  invariant 
corresponding  to  the  second  transversal  is  in  general 
non-zero  and  is  a  useful  invariant  for  this  geometric 
configuration.  This  is  similar  to  what  happens  for  the 
algebraically  defined  invariants  (see  Section  2.1). 

2.5  Independence  and  Completeness 

I  shall  now  show  that  the  two  geometrically  defined 
invariants  are  independent  and  together  completely 
characterize  the  set  of  four  lines  up  to  a  projectivity 
oiV^. 

To  show  independence,  we  start  by  .selecting  rj  and 
72,  two  arbitrary  non-intersecting  lines  in  space  to 


serve  as  transversals.  Next,  we  mark  off  points  ai, 
a2,  83  and  84  along  7i  in  such  a  way  that  their  cross 
ratio  is  equal  to  any  arbitrarily  chosen  invarieuit  value. 
Similarly,  mark  off  along  73  points  bi,  b2,  bs  and  b4 
having  another  arbitrarily  chosen  cross-ratio  invariant 
value.  Now,  joining  a^  to  b,-  for  each  1  gives  a  set  of 
four  lines  having  the  two  arbitrarily  chosen  invariants. 

Next,  it  will  be  shown  that  the  two  invariants  com¬ 
pletely  characterize  the  set  of  four  lines  up  to  a  pro¬ 
jectivity.  Consequently,  let  four  lines  in  space  have 
two  given  cross-ratio  invariant  values  with  respect  to 
transversals  7i  and  73  respectively.  Let  the  points  of 
intersection  of  the  four  lines  with  ti  be  ai,  83,  83 
and  84  and  the  intersection  points  with  73  be  bj,  b2, 
1)3  and  b4.  Let  a  second  set  of  lines  with  the  same 
invariants  be  given,  with  transversals  7^  and  intersec¬ 
tion  points  8^  and  b( .  Our  goal  is  to  demonstrate  that 
there  is  a  projectivity  taking  Tj  to  rj  for  j  =  1,2,  tak¬ 
ing  points  a«  to  a(  and  b,  to  b^  for  t  =  1, . . .  4.  It  will 
follow  that  the  projectivity  takes  one  set  of  lines  A; 
onto  the  other  set. 

Choosing  two  points  on  each  of  7i  and  73,  four 
points  in  all,  and  two  points  on  each  of  t[  and  a 
further  four  points,  there  exists  a  projectivity  taking 
the  first  set  of  four  points  to  the  second  set,  and  hence 
taking  Ti  to  t{  and  73  to  To.  Suppose  that  this  pro¬ 
jectivity  takes  8^  to  aj'  and  b,  to  b",  it  remains  to 
be  shown  that  there  exists  a  projectivity  preserving  t[ 
and  73  and  taking  to  a(  and  b|'  to  b^.  Without 
loss  of  generality  it  may  be  assumed  that  7^  is  the  line 
z  =  y  =  0  and  that  Tj  is  the  line  z  =  f  =  0.  With  this 
choice,  we  see  that  a  projectivity  of  T*®  represented  by 

a  matrix  of  the  form  ^  ^2  }  ’ 

a  2  X  2  block,  maps  each  7^  to  itself.  Furthermore  each 
Hj  represents  a  homography  of  the  line  rj.  Since  the 
points  aj  and  aj'  on  t[  have  the  same  cross-ratio,  there 
is  a  homography  of  t[  taking  to  aj'  for  i  =  1, . . . ,  4, 
and  the  same  can  be  said  for  the  points  b|  and  b"  on 
73.  Hence  by  independent  choice  of  the  two  2x2  ma¬ 
trices  Hi  and  H2,  both  mappings  can  be  carried  out 
simultaneously  and  the  proof  is  complete. 

2.6  Existence  of  an  Isotropy 

Four  lines  in  V^cnn  be  represented  by  a  total  of  16 
independent  parameters.  On  the  other  hand,  there  are 
15  degrees  of  freedom  for  projectivities  of  This 
suggests  that  there  should  be  only  one  invariant  for 
four  lines  in  space,  but  we  have  seen  that  there  are 
two.  The  discrepancy  arises  because  of  the  existence 
of  an  isotropy  ([Mundy-92a]).  To  understand  this,  we 
need  to  determine  the  subgroup  of  all  projectivities  of 
P^that  fix  four  given  lines.  Any  such  projectivity  will 
also  fix  the  two  transversals  as  well  as  the  four  points 
of  intersection  of  the  lines  with  each  transversal.  Since 
four  points  on  each  transversal  are  fixed,  every  point 
on  the  transversal  must  be  fixed.  This  shows  that  a 
projectivity  of  T’^fixes  four  given  lines  if  and  only  if  it 
fixes  the  two  transversals  pointwise.  Assuming  a.s  be¬ 
fore  that  the  two  transversals  are  the  lines  i  =  y  =  0 
and  z  =  t  =  0,  it  is  easily  seen  that  a  projectivity 
fixes  the  transversals  pointwise  if  and  only  if  it  is  rep- 
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resented  by  a  matrix  of  the  form  diag(ibi,)bi, ib2,)b2) 
where  ki  and  are  two  independent  constants.  Al¬ 
lowing  for  an  arbitrary  sc^lle  factor  in  the  matrix,  this 
implies  that  there  is  ^  one-parameter  subgroup  of  pro- 
jectivities  fixing  the  four  lines.  This  reduces  tW  num¬ 
ber  of  degrees  of  freedom  of  the  group  action  of  pro- 
jectivities  of  V^on  sets  of  four  lines  in  space  to  14,  and 
explains  why  there  are  two  independent  invariants. 

2.7  Relationship  of  Geometric  to  Alge¬ 
braic  Invariants 

The  fact  that  for  real  lines  the  algebraic  invariants 
defined  in  Section  2.1  must  be  real  whereas  the  geo¬ 
metric  invariants  may  be  complex  indicates  that  they 
are  not  the  same.  However,  since  the  geometric  invari¬ 
ants  completely  determine  the  four  lines  up  to  projec- 
tivity,  it  must  be  possible  to  determine  the  algebraic 
invariants  |iven  the  values  of  the  geometric  ones.  Con¬ 
sider  four  lines  with  geometric  invariants  a  and  0.  We 
desire  to  determine  the  values  of  the  algebraic  invari¬ 
ants  given  by  (5).  To  this  end,  we  may  assume  that 
the  transversals  are  the  lines  x  =  y  =  0  and  z  =  t  =  Q 
and  that  the  points  of  intersections  of  the  four  lines 
with  the  transversals  have  coordinates 

a2  =  (0,0,0,ir 

ai  =  (0,0,a,l)‘'’ 

as  =  (0,0,1,  If 

a,  =  (0,0,1,  or 

and 

bj  =  (0,1,0,  or 
bi  =  (/?,l,0,0r 
bs  =  (1,1,0,  or 
b4  =  (i,o,o,or . 

These  points  have  cross-ratio  invariants  a  and  P 
on  the  transversal  lines  x  =  y  =  0  and  z  =  t  =  0 
respectively. 

From  this  it  is  easy  to  compute  the  value  of  the 
invariant  (5)  to  be 

/  =  (a/9, 1, 1 -I- a/?  -  q;  - /?)  .  (10) 

Hence,  it  is  easy  to  compute  the  algebraic  invariants 
from  the  geometric  ones.  Similarly,  given  I,  it  is  easy 
to  solve  (10)  for  a  and  0,  which  indicates  that  the 
algebraic  invariant  (5)  is  complete. 

3  Computation  of  Line  Invariants 

It  will  be  shown  in  this  section  that  invariants  of 
lines  in  space  may  be  computed  from  two  images  with 
uncalibrated  cameras,  provided  that  the  epipolar  cor¬ 
respondence  is  known  (in  the  sense  to  be  explained 
below). 

3.1  Camera  Models 

Nothing  will  be  assumed  about  the  calibration  of 
the  two  cameras  that  create  the  two  images.  The 
camera  model  will  be  expressed  in  terms  of  a  gen¬ 
eral  projective  transformation  from  three-dimensional 
real  projective  space,  known  as  object  space,  to 
the  two-dimensional  real  projective  space  T^^known  as 
image  space.  The  transformation  may  be  expressed  in 


homogeneous  coordinates  by  a  3  x  4  matrix  P  known 
as  a  camera  matrix  and  the  correspondence  between 
points  in  object  space  and  image  space  is  given  by 
Uj  «  Pxi  where  the  symbol  «  means  equal  up  to  mul¬ 
tiplication  by  a  non-zero  scalar  factor. 

For  convenience  it  will  be  assumed  that  the  camera 
placements  are  not  at  infinity,  that  is,  that  the  pro¬ 
jections  are  not  parallel  projections.  In  this  case,  a 
camera  matrix  may  be  written  in  the  form 

P  =  {M\  -Mt) 

where  A/  is  a  3  x  3  non-singulair  matrix  and  t  is  a  col- 
unm  vector  t  =  (tx ,  ty ,  t,  )^  representing  the  location 
of  the  camera  in  object  space. 

3.2  The  Essential  Matrix 

Consider  a  set  of  points  {x,  }  as  seen  in  two  im¬ 
ages.  The  set  of  points  {x,  }  will  be  visible  at  image 
locations  {u,  }  and  {u|}  in  the  two  images.  In  normal 
circumstances,  the  correspondence  {u,}  {uj}  will 
be  known,  but  the  location  of  the  original  points  {x,} 
will  be  unknown.  As  shown  in  [Higgins-81]  there  exists 
a  matrix  Q,  called  the  essential  matrix,  such  that 

uJ^Qu,-  =  0  for  all  i  .  (11) 

Given  at  least  8  point  correspondences,  the  ma¬ 
trix  Q  may  be  computed  from  (11).  Longuet- 
Higgins  ([Hignns-81])  suggested  a  linear  solution  of 
the  equations!  11).  Other  methods  ([Horn-91,  Tsai-84, 
Spetsakis-92])  have  been  suggested  relying  on  proper¬ 
ties  of  the  essential  matrix. 

Although  the  essential  matrix  was  originally  de¬ 
fined  for  calibrated  cameras,  it  may  also  oe  defined 
for  uncalibrated  cameras  using  the  same  equation  (11). 
Methods  of  computing  the  essential  matrix  for  uncal¬ 
ibrated  cameras  have  heen  suggested  using  point  cor¬ 
respondences  ([Faugeras-92])  or  line-correspondences 
([Hartley-93b]). 

For  calibrated  cameras,  the  essential  matrix  deter¬ 
mines  the  camera  matrices  uniquely,  up  to  a  scaled 
Euclidean  transformation^ .  For  uncalibrated  cameras, 
this  in  not  the  case.  The  connection  between  essential 
matrix  and  camera  matrices  for  uncalibrated  cameras 
will  be  explained  below.  For  proofs  of  the  following 
theorems,  see  [Hartley-Gupta-92]. 

Given  a  vector,  t  =  (tx,ty,t»)^  it  is  convenient  to 
introduce  the  skew-symmetric  matrix 


Theorem  3.  IfQ  is  an  essential  matrix  corresponding 
to  a  pair  of  uncalibrated  cameras,  then  Q  factors  as  a 
product  Q  =  F*[t]x  for  some  vector  t  and  non-singular 
matrix  P.  Then,  one  possible  choice  of  camera  matri¬ 
ces  consistent  with  Q  is  given  by 

M  =  (/10)  ,  M'  =  (P*|-P*t) 
where  P*  is  the  inverse  transpose  of  P. 

*Slrirlly  *pealci?,g  tlirrr  f<«ir  poBsililr  snliilions 
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Given  a  pair  of  camera  matrices  and  some  image 
correspondences  u,-  uj  it  is  easy  to  compute  the  cor¬ 
responding  object  points  x,-  by  the  solution  of  a  set  of 
linear  equations  (in  effect  by  triangulation).  The  pair 
of  camera  matrices  given  in  Theorem  3  is  not  neces¬ 
sarily  the  correct  pair,  and  hence  the  reconstructed  set 
of  object  points  will  not  necessarily  be  correct.  How¬ 
ever,  the  following  theorem  shows  that  the  points  are 
nevertheless  correct  up  to  a  projectivity  of  7^. 

Theorem  4.  Suppose  Q  is  an  essential  matrix  and  M 
and  M'  are  any  pair  of  camera  matrices  consistent 
with  Q.  Let  u,-  *■*  u(-  be  corresponding  points  in  the 
two  images  and  {Xi}  be  a  set  of  object  points  such 

that  Ui  »  Mxi  and  A/'x,-.  Now  let  M  and  M' 
be  a  different  vair  of  camera  matrices  consistent  with 
Q  and  let  {x,  }  be  the  respective  set  of  object  points. 
Then  there  is  a  projectivity  h  of  taking  each  x,  to 
Xi. 

The  algorithm  for  computing  invariants  may  now 
be  formulated  in  broad  terms  as  follows. 

1.  Compute  the  essential  matrix  from  ^e  corre¬ 
spondences  using  any  available  algoritu.  i. 

2.  Select  a  pair  of  camera  matrices  M  and  A/'  ac¬ 
cording  to  Theorem  3. 

3.  Reconstruct  the  scene  geometry  using  the  chosen 
camera  matrices. 

4.  Compute  invariants  of  the  scene. 

3.3  Computing  Lines  in  Space 

To  be  able  to  compute  invariants  of  lines  in  space, 
it  is  sufficient  to  be  able  to  compute  the  locations  of 
the  lines  in  T^from  their  images  in  two  views  (step  3 
of  the  above  algorithm  outline). 

Lines  in  the  image  plane  are  represented  as  3- 
vectors.  For  instance,  a  vector  1  =  (/,  m,  n)^  rep¬ 
resents  the  line  in  the  plane  given  by  the  equation 
/u  -b  mv  -I-  nti;  =  0.  Similarly,  idanes  in  3-dimensional 
space  are  represented  in  homogeneous  coordinates  as 
a  4-dimensional  vector  x  =  {p,q,r,  s)^ . 

The  relationship  between  lines  in  the  image  space 
and  the  corresponding  plane  in  object  space  is  given 
by  the  following  lemma. 

Lemma  5.  Let  A  be  a  line  in  V^and  let  the  image  ofX 
as  faiken  by  a  camera  with  transformation  matrix  M  be 
1.  The  locus  of  points  in  V^that  are  mapped  onto  the 
image  line  1  is  a  plane,  x,  passing  through  the  camera 
centre  and  containing  the  line  A.  It  is  given  by  the 
formula  x  =  M^\. 

Proof.  A  point  x  lies  on  x  if  and  only  if  it  is  mapped  to 
a  point  on  the  line  1  by  the  action  of  the  transformation 
matrix.  This  means  that  Mx  lies  on  the  line  I,  and  so 

rA/x  =  0  .  (13) 

On  the  other  hand,  a  point  x  lies  on  the  plane  x  if  and 
only  if  x^x  =  0.  Comparing  this  with  (13)  lead  to  the 
conclusion  that  jr^  =  FA/  or  jt  =  A/^1  as  reipiired. 

0 


Now,  given  two  images  1  and  1'  of  a  Une  A  in  space 
as  taken  by  two  cameras  with  camera  matrices  M  and 
A/',  the  line  A  is  the  intersection  of  the  planes  M^l 
and  Af'^^l'. 

4  Experimental  Results 

Three  images  of  a  p2ur  of  wooden  blocks  represent¬ 
ing  houses  were  acquired  and  vertices  and  edges  were 
extracted.  The  images  are  shown  in  Figures  1,  2,  and 
3.  Corresponding  edges  and  vertices  were  selected  by 
hand  from  among  those  detected  automatically.  The 
edges  and  vertices  shown  in  Fig.  4  were  chosen.  There 
were  13  edges  and  15  lines  extracted  from  each  of  the 
images.  The  dotted  edges  were  not  visible  in  all  images 
and  were  not  chosen.  Vertices  are  represented  by  num¬ 
bers  and  edges  by  letters  in  the  figure.  Because  of  the 
way  edges  and  vertices  were  found  by  the  segmentar 
tion  algorithm,  the  edges  do  not  always  pass  precisely 
through  the  indicated  vertices,  but  sometimes  through 
a  closely  neighboring  vertex.  On  other  occasions,  the 
full  edge  was  not  detected  as  a  single  edge,  but  was 
broken  into  several  pieces.  This  is  usual  with  most 
edge  detection  algorithms,  and  is  a  source  of  error  in 
the  computation  of  invariants. 

The  essential  matrices  Q12  for  the  first  and  second 
images  and  Q23  for  the  second  and  third  images  were 
computed  from  the  point  matches.  Compatible  set  of 
camera  matrices  were  computed,  the  locations  of  the 
lines  in  ^^were  reconstructed  and  invariants  (5)  were 
computed  algebraically. 

4.1  Comparison  of  Invariant  Values 

The  invariant  (5)  is  represented  as  homogeneous 
vectors.  Two  such  vectors  are  considered  equivalent 
if  they  differ  by  a  non-zero  scale  factor.  Because  of 
arithmetic  error  and  image  noise,  two  comptited  in¬ 
variant  values  will  rarely  be  exactly  proportional.  In 
order  to  compare  two  such  computed  invariant  values 
(perhaps  when  attempting  to  match  an  object  with  a 
reference  object),  each  homogeneous  vector  is  multi¬ 
plied  by  a  scale  factor  chosen  to  normalize  its  length 
to  1.  This  normalization  determines  the  vector  up  to 
a  multiplication  by  a  factor  ±1.  Two  such  normalized 
homogeneous  vector  invariants  Vi  and  V2  are  deemed 
close  if  vi  is  close  to  V2  or  to  — V2  using  a  Euclidean 
norm.  Correspondingly,  a  metric  may  be  defined  by 


(\  1/2 


For  any  vi  and  V2,  distance  d(vi,V2)  lies  between  0 
and  1.  A  value  close  to  0  means  a  very  good  match, 
whereas  values  close  to  1  are  mismatches. 


4.2  Invariants  of  4  lines 


Six  sets  of  four  lines  were  chosen  as  in  the  following 
table,  which  shows  the  labels  of  the  lines  as  given  in 
Fig.  4). 


5, 

52 

53 

54 

55 

56 


B,  C,  J,  K) 
B,G,J,N} 
A,B,H,I} 


{B,D,  E,G) 
{A,C,0,J] 

\b,i,l,n) 
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Table  115  shows  the  results.  The  (»,  j)-th  entry  of  the 
table  snows  the  distance  according  to  the  metric  ( 14) 
between  the  invariant  of  set  Si  as  computed  from  the 
first  image  pair  with  that  of  set  Sj  as  computed  from 
the  second  image  pair.  The  diagonal  entries  of  the 
matrix  (in  bold)  should  be  close  to  0.0,  which  indicates 
that  the  invariants  had  the  same  value  when  computed 
from  different  pairs  of  views. 

The  only  bad  entry  in  this  matrix  is  in  the  posi¬ 
tion  (4,  4).  This  is  because  of  the  fact  that  the  four 
lines  chosen  contained  three  coplanar  lines  (lines  B,  D 
and  E).  This  causes  the  values  of  the  invariant  to  be 
indeterminate  (that  is  (0,0,0)),  and  shows  that  such 
instances  must  be  detected  and  avoided.  The  entry 
in  position  (3,  3)  is  also  shows  instability  for  similar 
reasons. 


0.013 

0.674 

0.303 

0.689 

0.643 

0.449 

0.647 

0.034 

0.741 

0.838 

0.707 

0.222 

0.062 

0.691 

o.aag 

0.708 

0.708 

0.461 

0.287 

0.608 

0.182 

o.8go 

0.856 

0.384 

0.657 

0.722 

0.900 

0.719 

0.003 

0.694 

0.473 

0.239 

0.555 

0.948 

0.719 

0.033 

(15) 

One  concludes  from  this  experiment  that  the  four- 
line  invariant  is  a  powerful  discriminator  between  sets 
of  four  lines,  but  care  must  be  taken  to  detect  and 
exclude  degenerate  and  near-degenerate  cases. 
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Abstract 

It  is  known  that  a  set  of  points  in  3  dimensions 
is  determined  up  to  projectivity  from  two  views  with 
uncalibrated  cameras.  It  is  shown  in  this  paper  that 
this  result  may  be  improved  bv  distinguishing  between 
points  in  front  of  and  behind  the  camera.  Any  point 
that  lies  in  an  image  must  lie  in  front  of  the  camera 
producing  that  image.  Using  this  idea,  it  is  shown 
that  the  scene  is  determined  from  two  views  up  to 
a  more  restricted  class  of  mappings  known  as  good 
projectivities,  which  are  precisely  those  projectivities 
that  preserve  the  convex  hull  of  an  object  of  inter¬ 
est.  An  invariant  of  good  projectivity  known  as  the 
cheirality  invariant  of  a  set  of  points  is  defined  and  it  is 
shown  how  the  cheirality  invariant  may  be  computed 
usinf  two  uncalibrated  views.  As  demonstrated  the¬ 
oretically  and  by  experiment  the  cheirality  invariant 
may  distinguish  between  sets  of  points  that  are  pro- 
jectively  equivalent  (but  not  via  a  good  projectivity). 
These  results  lead  to  necessary  and  sumcient  condi¬ 
tions  for  a  set  of  corresponding  pixels  in  two  images 
to  be  realizable  as  the  images  of  a  set  of  points  in  3 
dimensions. 

Using  similar  methods,  a  necessary  and  sufficient 
condition  is  given  for  the  the  orientation  of  a  set  of 
points  to  be  determined  by  two  views.  If  the  perspec¬ 
tive  centres  are  not  separated  from  the  point  set  by 
a  plane,  then  the  orientation  of  the  set  of  points  is 
determined  from  two  views. 

Good  projectivities  and  the  cheirality  invariant  are 
also  defined  for  point  sets  in  a  plane,  which  allows 
these  new  methods  to  be  applied  to  images  of  planar 
objects. 

1  Introduction 

Consider  a  set  of  points  {xi}  lying  in  a  plane  in 
space  and  let  {Uj}  and  {ii<}  be  two  images  of  these 
points  taken  with  arbitrary  uncalibrated  perspective 
(pinhole)  cameras.  It  is  well  known  that  the  points 
Uj  and  Uj  are  related  by  a  planar  projectivity,  that 
is,  there  exists  h  a  projectivity  of  the  plane  such 
that  hui  =  u|-  for  all  t.  This  fact  has  been  used 
for  the  recomition  of  planar  objects.  For  instance  in 
(Rx)thwell-9^  planar  projective  invariants  were  used 
to  define  indexing  functions  allowing  look-up  of  the 
objects  in  an  object  data-base.  Since  the  indexing 
functions  are  invariant  under  plane  projectivities,  they 


provide  the  same  value  independent  of  the  ''<ew  of  the 
object. 

In  a  similu  way,  it  has  been  shown  in  [Faugeras'92] 
and  [Hartley-Guptar92]  that  a  set  of  points  in  3- 
dimensions  is  determined  up  to  a  3-dimensional  pro¬ 
jectivity  by  two  images  taken  with  uncalibrated  cam¬ 
eras.  Both  these  papers  give  a  constructive  method 
for  determining  the  point  configuration  (up  to  projec¬ 
tivity).  This  permits  the  computation  of  projective  in¬ 
variants  of  sets  of  points  seen  in  two  views.  An  exper¬ 
imental  verification  of  these  results  has  been  reported 
in  [Hartiey-92]  and  is  summarized  in  this  p^er. 

The  papers  just  cited  make  no  distinction  between 
oints  that  lie  in  front  of  the  camera  and  those  that  lie 
ehind.  The  specification  of  the  front  of  a  camera  will 
be  termed  the  cheirality  of  the  camera  (from  Greek 
:  x(ip  =  hand  or  side).  It  is  well  know  that  camera 
cheirality  is  valuable  in  determining  scene  geometry 
for  calibrated  cameras.  Longuet-Higgins  [Higgins-81] 
uses  it  to  distinguish  between  four  dmerent  scene  re¬ 
constructions.  No  systematic  treatment  of  cheirality 
of  uncalibrated  cameras  has  previously  appeared,  how¬ 
ever.  Investigation  of  this  phenomenon  turns  out  to  be 
quite  fruitful,  as  is  seen  in  the  present  psqier.  Cheiral¬ 
ity  is  valuable  in  distinguishing  different  point  sets 
in  space,  especially  in  allowing  projectively  equivalent 
point  sets  to  be  distinguished. 

The  major  results  of  this  p^er  are  summarized 
now.  In  Definition  4  a  class  of  projectivities  called  good 
projectivities  is  defined,  consisting  of  those  ones  that 

{^reserve  the  convex  hull  of  a  set  of  points  of  interest, 
n  section  5  an  invariant  of  good  projectivity  is  de¬ 
fined  -  the  cheirality  invariant.  Theorem  13  strength¬ 
ens  the  result  of  [Faugeras-92,  Hartley-Gupta-92j  by 
showing  that  a  3-dimensional  point  set  is  determined 
up  to  good  projectivity  by  its  image  in  two  uncal¬ 
ibrated  views.  This  sharpening  of  the  theorem  of 
[Faugeras-92,  Hartley-Gupta-92]  results  from  a  con¬ 
sideration  of  the  cheirality  of  the  cameras.  In  section 
9  an  example  of  computation  of  the  cheirality  invari¬ 
ant  for  3D  point  sets  shows  that  it  is  useful  in  dis¬ 
tinguishing  oetween  non-equivalent  point  sets.  In  sec¬ 
tion  7  the  concept  of  good-projectivity  is  applied  to 
orientation  of  point  sets,  explaining  why  some  point 
sets  allow  two  aifferently  oriented  reconstructions  from 
two  views,  whereas  some  do  not.  The  relationship  of 
this  result  to  human  visual  perception  of  3D  scenes  is 
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briefly  mentioned,  suggesting  that  the  brain  ciccepts 
various  interpretations  of  a  scene  differing  by  good 
projectivities,  but  not  by  arbitrciry  projectivities. 

2  Notation 

We  will  consider  object  space  to  be  the  3- 
dimensional  Euclidean  space  and  represent  points 
in  ii®  as  3-vectors.  Similarly,  image  space  is  the  2- 
dimensional  Euclidean  space  and  points  are  repre¬ 
sented  as  2- vectors.  Euclidean  space,  is  embedded 
in  a  natural  way  in  projective  3-space  V^by  the  ad¬ 
dition  of  a  plane  at  infinity.  Similarly,  R?  may  be 
embedded  in  the  projective  2-space  ^*by  the  addi¬ 
tion  of  a  line  at  infinity.  The  simplicity  of  considering 
projections  between  7’^and  T’^has  led  many  authors 
to  identify  P^and  P^as  the  object  and  images  space. 
This  point  of  view  will  not  be  followed  here  however, 
although  when  necessary  we  will  consider  points  in  R^ 
and  Ft  to  aa  lying  in  P^and  P^  respectively,  via  the 
natural  embedding. 

Vectors  will  be  represented  as  bold-face  lower  case 
letters,  such  as  x.  Such  a  notation  represents  a  column 
vector.  The  corresponding  row  vector  will  be  denoted 
by  x^.  The  notation  x  usually  denotes  a  vector  in 
/C®,  whereas  u  represents  a  vector  in  R?.  Elements 
in  projective  spaces  P^and  P^will  be  denoted  with  a 
tilde  accent.  For  instance,  x  is  a  homogeneous  4- vector 
representing  an  element  in  P^,  and  u  is  a  homogeneous 
3- vector  representing  an  element  of  P^ . 

The  notation  «  represents  equality  of  matrices  or 
homogeneous  vectors  up  to  an  arbitrary  non-zero  fac¬ 
tor.  If  X  =  (i,  y,  z)^  is  a  3-vector  representing  a  point 
in  /?,  then  x  is  the  vector  (a:,y,  z,l)^.  Similarly,  if 
u  =  («,  v)^,  then  u  represents  the  vector  (u,v,  1)^. 

The  notation  a  =  6  means  that  a  and  6  have  the 
same  sign.  For  instance  a  =  1  has  the  same  meaning 
as  a  >  0. 

3  Projections  in 

A  projection  from  P^into  P^is  represented  by  a 
3x4  matrix  P,  whereby  a  point  x  maps  to  the  point 
ii  ss  Px.  It  will  be  assumed  that  P  has  rank  3.  Since 
P  has  4  columns  but  rank  3,  there  is  a  unique  point  t 
such  that  Pt  =  (0,0,0)^.  In  other  words,  the  projec¬ 
tive  transformation  is  undefined  at  the  point  t,  since 
(0,0,0)^  is  not  a  valid  homogeneous  3-vector.  The 
point  t  will  be  called  the  perspective  centre  of  the  cam¬ 
era.  We  will  assume  that  the  perspective  centre  is  not 

a  point  at  infinity  so  we  may  write  t  w  t  = 

where  t  is  the  perspective  center  as  a  point  in  iZ®. 

Now,  the  camera  matrix  P  may  be  written  in  block 
form  as  P  =  (M  |  c)  where  M  is  a  3  x  3  block  and  c 
is  a  column  vector.  Now 

Pt  =  (M  I  c)  ^  5  ^  =  A/t -I- c  =  0  , 

and  so  c  =  —Mt.  In  future,  we  will  write  P  =  (Af  | 
—Mi).  Now  since  P  has  rank  3  and  —Mt  is  a  linear 


combination  of  the  columns  of  A/,  it  follows  that  M 
must  have  rank  3.  In  other  words,  M  is  non-singular. 
Summarizing  this  discussion  we  have 

Proposition  1.  If  P  is  a  camera  transform  matrix  for 
a  camera  with  perspective  centre  not  at  infinity,  then  P 
can  be  written  as  P  =  (Af  |  —Mt)  where  M  is  a  non¬ 
singular  3x3  matrix  and  t  represents  the  perspective 
centre  in  P®. 

There  exist  points  in  P^that  are  mapped  to  points 
at  infinity  in  the  image.  To  find  what  they  are,  we  sup¬ 
pose  that  u  =  («,  r,0)^  =  Px.  Letting  pi^,  p2^  and 
P3^  be  the  rows  of  P,  we  see  that  pa^x  =  0.  In  other 
words,  a  point  x  in  P^that  maps  to  a  point  at  infin¬ 
ity  in  the  image  must  satisfy  the  equation  x^pa  =  0. 
Looked  at  another  way,  if  pa  is  taken  as  representing 
a  plane  in  P^,  then  a  point  x  lies  on  the  plane  pa  if 
and  only  if  x^pa  =  0.  In  other  words,  the  condition 
for  X  to  map  to  a  point  at  infinity  is  the  same  as  the 
condition  for  x  to  lie  on  the  plane  pa.  Since  Pt  =  0, 
we  see  in  particular  that  pa^t  =  0,  and  so  t  lies  on 
the  plane  pa.  To  summarize  this  paragraph,  the  set  of 
points  in  P^mapping  to  a  point  at  infinity  in  the  im¬ 
age  is  a  plane  passing  through  the  perspective  centre 
and  represented  by  pa,  where  pa^  is  the  last  row  of 
P.  This  plane  will  be  called  the  meridian  plane  of  the 
camera. 

Restricting  now  to  /Z^,  consider  a  point  x  in  space, 
not  lying  on  the  meridian  plane.  It  is  projected  by  the 
camera  with  matrix  P  onto  a  point  u  where  u;u  =  Px 
for  some  scale  factor  w.  The  value  of  u;  will  vary 
continuously  with  x  and  the  set  of  points  where  it 
vanishes  is  precisely  the  meridian  plane.  It  follows 
that  on  one  side  of  the  meridian  plane  w  >  0  and  on 
the  other  side,  u;  <  0.  It  can  be  shown,  but  is  not 
used  in  this  paper,  that  w  is  in  fact  proportional  to 
the  distance  of  x  from  the  meridian  plane. 

Any  real  camera  can  only  view  points  on  one  side 
of  the  meridian  plane,  those  points  that  are  “in  front 
of’  the  camera.  Points  on  the  other  side  will  not  be 
visible.  In  order  to  distinguish  the  front  of  the  camera 
from  the  baK:k,  a  convention  is  necessary. 

Definition  2.  A  camera  matrix  P  =  (Af  |  —Aft)  is 
said  to  be  normalized  if  det(Af)  >0.  If  P  is  a  nor¬ 
malized  camera  matrix,  a  point  x  is  said  to  lie  in  front 
of  the  camera  if  Px  =  w\i  with  in  >  0.  Points  x  for 
which  w  <  0  are  said  to  be  behind  the  camera. 

Any  camera  matrix  may  be  normalized  by  multiply¬ 
ing  it  by  —  1  if  necessary.  It  will  always  be  assumed 
that  camera  matrices  are  normalized.  The  selection 
of  which  side  of  the  camera  is  the  front  is  simply  a 
convention,  consistent  with  the  assumption  that  for  a 
camera  with  matrix  (/  |  0),  points  with  positive  z- 
coordinate  lie  in  front  of  the  camera.  This  is  the  usual 
convention  in  computer  vision  literature,  used  for  in¬ 
stance  in  [Higgins-81]. 

The  following  statement  expresses  the  fact  that  a 
camera  sees  only  those  points  that  lie  in  front  of  it. 

Proposition  3.  A  point  x  in  fZ^  is  mapped  to  a  point 
11  in  R}  by  a  camera  with  normalized  matrix  P  if  and 
only  ifwxi  =  Px  for  some  constant  w  >  0. 
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4  Good  Projectivities 

A  subset  5  of  /2”  is  ceilled  convex  if  the  line  segment 
joining  any  two  points  in  B  also  lies  entirely  within  B. 
The  convex  hull  of  denoted  B  is  the  smallest  convex 
set  containing  B. 

Definition  4.  Let  B  be  a  subset  of  /2"  and  let  h  be 
a  projectivity  of  P".  The  projectivity  h  is  said  to 
be  a  “good  projectivity”  with  respect  to  the  set  B  if 
h~^{Lro)  does  not  meet  B,  where  Loo  is  the  plane  (or 
line)  at  infinity. 

A  good  projectivity  with  respect  to  B  is  precisely 
one  that  preserves  the  convex  null  of  B.  It  may  be 
verified  that  if  />  is  a  good  projectivity  with  respect 
to  B,  then  is  a  good  projectivity  with  respect 
to  h{B).  Detciils  are  omitted  for  the  sake  of  brevity. 
We  will  be  considering  sets  of  points  {uj}  and  f«<} 
that  correspond  via  a  projectivity.  When  we  speak  of 
the  projectivity  being  good,  we  will  mean  good  with 
respect  to  the  set  {u,^. 

An  alternative  characterization  of  good  projectivi¬ 
ties  is  given  in  the  following  theorem. 

Theorems.  A  projectivity  h  :  V''  —*  'P"  represented 
by  a  matrix  H  is  good  with  respect  to  a  set  B  =  {u^}  C 
B"  -  /»“*(Loo)  if  an  only  if  there  exist  constants  Wi, 
all  of  the  same  sign,  such  that  Hui  = 

Proof.  To  prove  the  forward  implication,  we  assume 
that  h  is  a  good  projectivity.  By  definition,  constants 
Wi  exist  such  that  Bii,  =  What  needs  proof  is 
that  they  all  have  the  same  sign.  The  value  of  w  in 
the  mapping  u^uj  =  Hiu  is  a  continuous  function  of 
the  point  u.  If  Wi  >  0  for  some  point  iij,  and  wj  <  0 
for  another  point  u, ,  then  there  must  some  point  Uno 
on  the  line  segment  joining  ti,  to  for  which  w  =  0. 
This  means  that  /»(«oo)  lies  on  the  line  at  infinity, 
contrary  to  hypothesis. 

Next,  to  prove  the  converse,  we  assume  that  there 
exist  such  constants  w,  all  of  the  same  sign.  Let  5  be 
the  subset  of  B"  consisting  of  all  points  u  satisfying 
the  condition  Hu  =  u;u'  such  that  w  has  the  same  sign 
as  all  u;,-.  The  set  S  contains  B  and  it  is  clear  that 

5  n  /»“*(Lrc)  =  0.  All  that  remains  to  show  is  that 

5  is  convex,  for  then  5  must  contain  the  convex  hull 
of  B.  If  iij  and  u.  are  points  in  S  with  corresponding 
constants  Wi  and  Wj ,  then  any  intermediate  point  u 
between  iij  and  u^-  must  have  iv  value  intermediate 
between  u’,  and  wj .  Consecpiently,  the  value  of  iv  must 
have  the  same  sign  as  and  iVj ,  and  so  ii  lies  >n  S 
also.  This  completes  the  proof.  □ 

As  just  noted,  if  a  projectivity  is  not  good,  then 
there  are  points  in  the  convex  hull  for  which  w  equals 
nought  (0).  For  this  rea.son,  a  projectivity  that  is  not 
good  will  be  called  “naughty”*. 

This  theorem  gives  an  effective  method  of  identify¬ 
ing  good  projectivities.  The  (piestiori  remains  whether 
good  projectivities  form  a  useful  class.  This  (|Uestiou 
will  be  answered  by  the  following  theorem. 

*Tliw  f.erTnifin1ogy  wm  l.o  we  hy  Havifl  Fcirsytli 


Theorem  6.  If  B  is  a  point  set  in  a  plane  (the  “ob¬ 
ject  plane”)  in  B^  lying  entirely  in  front  of  a  projective 
camera,  then  the  mapping  from  the  object  plane  to  the 
image  plane  defined  by  the  camera  is  a  good  projectiv¬ 
ity  with  respect  to  B. 

Proof.  That  there  is  a  projectivity  h  mapping  the  ob¬ 
ject  plane  to  the  image  plane  is  well  known.  What 
is  to  be  proven  is  that  the  projectivity  is  good  with 
respect  to  B.  Let  L  be  the  line  in  which  the  meridian 

[>lane  of  the  camera  meets  the  object  plane.  Since  B 
ies  entirely  in  fiont  of  the  camera,  L  does  not  meet 
the  convex  hull  of  B.  However,  by  definition  of  the 
meridian  plane  L  =  h~*ILno),  where  Loo  is  the  line 
at  infinity  in  the  image  plane.  Therefore,  A  is  a  good 
projectivity  with  respect  to  B.  □ 

As  an  example.  Fig.  1  shows  an  image  of  a  comb 
and  the  image  resampled  according  to  a  naughty  pro¬ 
jectivity.  Most  people  will  agree  that  the  resampled 
image  is  unlike  any  view  of  a  comb  seen  by  camera 
or  human  eye.  Nevertheless,  the  two  images  Me  pro- 
jectively  equivalent  and  will  have  the  same  projective 
invariants. 

Note  that  if  points  u,  are  visible  in  an  image,  then 
the  corresponding  object  points  must  lie  in  front  of  the 
camera.  Applying  Theorem  6  to  a  sequence  of  imag¬ 
ing  operations  (for  instance,  a  picture  of  a  picture  of 
a  picture,  etc),  it  follows  that  the  original  and  final 
images  in  the  sequence  are  connected  by  a  planar  pro¬ 
jectivity  which  is  good  with  respect  to  any  point  set 
in  the  object  plane  visible  in  the  final  image. 

Similarly,  if  two  images  are  taken  of  a  set  of  point 
{x,  }  in  a  plane,  u,  and  tij  being  correponding  points 
in  tne  two  images,  then  there  is  a  good  projectivity 
(^ith  respect  to  the  1  mapping  each  u,-  to  ii  ■ ,  and  so 
Theorem  5  applies,  yielding  the  following  proposition. 

Proposition 7.  //{u,}  and  {uJ}  are  corresponding 
points  in  two  views  of  a  set  of  object  points  {x,  }  lying 
in  a  plane,  then  there  is  a  matrix  H  representing  a 
planar  projectivity  such  that  Hu,  =  and  all  Wi 

have  the  same  sign. 

This  fact  was  pointed  out  to  me  by  Charles  Roth- 
well  (private  communication)  and  served  as  a  starting 
point  for  the  current  investigation.  Rothwell  derived 
this  result  using  the  methods  of  [Sparr-92]. 

5  An  integer  valued  invariant 

Given  a  set  of  N  >  n-|-2  points  {u,},  i  =  l,...,N  in 
B" ,  it  is  possible  to  define  an  invariant  of  good  projec¬ 
tivity  as  follows.  Let  ei , . . .  e„+2  be  points  in  B"  such 
that  }  form  a  canonical  projective  basis  for  P" .  For 
n  =  2,  the  points  (0,0)^,  (1,0)^,  (0,1)^  and  (1,1)^ 
will  do.  Assume  that  the  points  u<  are  numbered  in 
such  a  way  that  the  first  n-i-2  of  them  are  in  general  po¬ 
sition  (meaning  that  no  n-|- 1  of  them  lie  in  a  codimen¬ 
sion  1  nyperplane).  In  this  case,  there  exists  a  projec¬ 
tivity  g  (not  necessarily  good)  such  that  .7(11,)  =  e;  for 
i  =  1, . . . ,  n  -f  2.  Now,  for  each  i  =  1. . . . ,  A  we  define 
a  value  r/j  as  follows.  If  7(11, )  lies  on  the  plane  at  infin¬ 
ity,  we  set  Tji  =  0.  Otherwise,  there  exists  a  further  Cj 
such  that  7(11,  )  =  Cj.  If  7  is  represented  by  a  matrix 
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G,  then  r)i  is  defined  by  the  equation  Gu,  =  We 
show  that,  except  for  possible  simultaneous  negation, 
the  values  sign(r]^)  are  an  invariant  of  good  projec- 
tivity.  Here  sign(T},)  is  defined  to  equal  1,  -1  or  0 
depending  on  whether  r;,  is  positive,  negative  or  zero 
respectively.  The  invariant  value  is  of  course  depen¬ 
dent  on  the  choice  of  c^lnonical  basis  {cj}. 

To  prove  the  invariance,  suppose  that  h  is  a 
good  projectivity  with  respect  to  points  {uj}  and  let 
h(u,)  =  uj.  Consider  the  projectivity  g'  defined  by 
g'iu'i)  =  e,-  for  i  =  1, . . . ,  n  +  2.  Values  tj-  may  be  de¬ 
fined  as  before  in  terms  of  the  projectivity  g' .  On  the 
other  hand,  values  w,  may  be  defined  in  terms  of  the 
projectivity  h  mapping  each  u,  to  uj  as  in  Theorem 
5. 

Since  /»  and  g'~^  o  g  agree  on  a  set  of  basis  points, 
it  follows  that  h  =  g'~^  o g.  Consequently,  Wi  =  r]ilr(^. 
However,  under  the  assumption  that  h  is  a  good  pro¬ 
jectivity,  all  the  Wi  have  the  same  sign,  and  so,  for  all 
i,  we  have  T)i  =  erf^,  where  f  =  ±1.  In  other  words,  the 
set  of  values  signfrj^  )  are  an  invariant  under  good  pro¬ 
jectivity,  except  for  possible  simultaneous  negation. 

It  is  possible  to  code  the  values  rji  into  a  single 
number  according  to  the  formula 


X(«l.«2 . «7v) 


5]]sign(rfc)3*  * 

isl 


(1) 


The  value  x(««)  is  an  invariant  under  good  projectiv¬ 
ity  of  the  ordered  set  of  points  u,- .  It  will  be  called  the 
cheirality  invariant  of  the  points. 

6  Three  dimensional  point  sets 

We  now  consider  three-dimensional  point  sets.  The 

3ue8tion  that  will  be  addressed  is  :  “Under  what  con- 
itions  can  points  and  ii{  in  two  views  be  the  im¬ 
ages  of  a  three  dimensional  point  set  Xi  corresponding 
to  two  arbitrary  uncalibrated  cameras  ?”.  One  well- 
known  necessary  condition  ([Higgins-Sl])  is  the  epipo- 
lar  constraint,  =  0  for  all  i  and  some  rank-two 

matrix  Q.  We  will  ignore  the  effects  of  noise,  so  that 
the  epipolar  constraint  equation  will  be  assumed  to 
hold  exactly.  The  question  is  whether  this  is  also  a 
sufficient  condition.  The  answer  is  no. 

It  will  be  assumed  that  there  are  sufficient  points  for 
the  matrix  Q  to  be  determined  unambiguously,  that  is 
at  least  7  ([Hartley-92])  or  8  ([Higgins-811)  points.  Un¬ 
der  these  conditions  as  shown  in  [HartIey-Gupta-92] 
and  [Faugeras-92]  it  is  possible  to  determine  the  lo¬ 
cation  of  points  X,  and  cameras  P  and  P'  such  that 
II,-  w  Pxi  and  u|  w  P'x,  ,  and  furthermore,  the  choice 
is  unique  up  to  projectivity  of  Assuming  that 
none  of  the  reconstructed  points  x,  is  at  infinity,  we 
can  write 

WiXli  =  PXi 

v/iXl'i  =  P'Xi 

If  all  the  Wi  and  w'i  are  positive,  then  according  to 
Proposition  7  the  points  x,  map  to  points  iij  and  iij 
in  the  two  images.  Normally,  this  will  not  be  the  case. 
It  is  possible,  however,  that  another  choice  of  P,  P' 
and  Xi  exists  with  the  desired  property. 


We  introduce  some  new  terminology.  A  triplet 
(Q,  {ui},  {uj})  is  called  2Ui  epipolar  configuration  if  Q 
is  a  rank  2  matrix  satisfying  the  epipolar  constreiint 
equation  u[^Qui  =  0  for  all  t.  A  weak  realization  of 
(Q, {u,},{u[})  is  a  triplet  (P, P',{xi}),  where  P  and 
P  are  a  choice  of  normalized  camera  matrices  cor¬ 
responding  to  the  essential  matrix  Q  and  the  points 
{x,-}  are  ooject  points  satisfying  the  equations  (2)  for 
eacn  i.  A  strong  realization  is  such  a  triplet  satisfy¬ 
ing  the  addition^  condition  that  all  the  Wi  and  are 
positive.  The  triplet  (Q,  {Ui}i  {«<})  is  called  a  feasible 
configuration  if  a  strong  realization  exists. 

The  following  lemma  sets  notation  and  derives  a 
basic  technical  result. 


Lemma 8.  Let  (P,  P,{xi})  and  fP,  P,{^Xi})  be 
two  weak  realization  of  a  feasible  configuration 
(Q,  {uj},  {uj}).  There  exists  o  4  x  4  matrix  H  such 
that  P  «  PH,  P  w  P'H  and  x,-  w  H~^Xi.  Assume 
that  P,  P,  P  and  P  are  normalized  and  let  constants 
(,  r)i,  Wi  and  Wi  be  defined  by  the  equations 


p 

=  ePH 

X. 

=  r)iH~^Xi 

WiXli 

=  PXi 

U),Ui 

=  Pki 

Then  WiWiCTji  =  1. 

If  constants  tuj,  iDj,  and  e'  are  defined  in  a  similar 
way  then  =  1. 

Proof.  The  existence  of  the  matrix  H  is  proven  in 
[Hartley-Gupta-92].  Now, 

ly.u,  =  Pxi 

=  eriiPHH-^Xi 

=  CTjiWiXXi 

whence  Wi  =  erfiWi.  Multiplying  each  side  of  this  equa¬ 
tion  by  Wi  gives  the  required  result.  The  proof  for  the 
primed  quantities  is  of  course  the  same.  □ 

A  further  useful  technical  result  follows. 

Lemma  9.  Let  H  be  the  matrix 


Then  with  the  notation  used  in  Lemma  8,  e  =  v^t, 

f'  =  v^t  and  for  each  i,  rji  =  ibv^Xj,  where  t  and  V 
ore  the  perspective  centres  of  P  and  P.  (Remember 
that  =  denotes  equality  of  sign.) 

Proof.  One  verifies  that 


Let  P  =  {M  I  -Mt)  with  det(M)  >  0  and  P  =  (A/  | 
—Mi)  with  det(A/)  >  0.  The  from  the  cP  =  PH~^  it 
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follows  that  (M  =  M{I  +  tv^).  Taking  determinants 
and  signs  gives 

e  =  det(/  +  tv^)  =  1  +  v''’t  =  v^t 

as  required.  The  same  proof  holds  for 

F^om  (3)  we  have  Hii  =  ruxi.  Multiplying  this 
out  and  considering  only  the  last  component  yields 

=  Jb(v^x,-  +  1)  =  as  required.  O 

Applying  Lemma  8  to  the  case  where  one  of  the 
realizations  is  a  strong  realization  leads  to  a  necessary 
and  sufficient  condition  for  an  epipolar  configuration 
to  be  feasible. 

Theorem  10.  Let  {P,  P* ,{xi})  be  any  weak  realiza¬ 
tion  of  an  epipolar  configuration  (Qi  {ui},  {u|-})  and 
let  Wi  and  be  defined  as  in  (2).  There  exists  a 
strong  realization  (P,  ^,x,)  o/ (Q,  {u^ },  {uj})  if  and 
only  if  WiWj  has  the  same  sign  for  all  i. 

Proof.  We  begin  by  proving  the  i/part  of  this  theorem, 
and  apply  Lemma  8  to  the  case  where  (P,  P',Xj)  is  a 
strong  realization.  In  this  case,  Wi  =  1  and  so  WiTfiC  = 
1.  Similarly,  =  1.  Therefore  Wiu/frj^ee'  =  1. 

from  which  it  follows  that  w,-u;|  =  ee'  which  is  constant 
for  ^1 1. 

Now,  we  turn  to  prove  the  converse.  Let  X"*"  be 
the  set  of  points  x,-  such  that  «),  >  0  and  let  X~  be 
the  set  of  points  such  that  tVi  <  0.  The  sets  and 
X"  are  separated  by  the  meridian  planes  of  each  of 
the  cameras.  Now,  we  seek  a  plane  that  separates  X“ 
from  X***  and  satisfies  the  additional  condition  that 
the  perspective  centres  of  the  two  camera  lie  on  the 
same  siae  of  the  plane  if  ujiu/f  >0  for  all  i,  or  on 
opposite  sides  of  the  plane  if  «;,«;<  <0  for  all  i.  Such 
a  plane  can  easily  be  found  by  slightly  displacing  the 
meridian  plane  of  one  of  the  cameras^ . 

Let  this  separating  plane  be  represented  by  a  4- 
vector  V.  The  condition  that  both  perspective  centres 
t  and  t'  lie  on  the  same  or  opposite  sides  of  the  plane 

may  be  written  as  v^t  =  k  and  v^t  =  KWiw'i  where 
K  is  some  non-zero  value  and  sign(u;,  ti;|)  is  a  constant 
for  all  i  by  hypothesis.  The  condition  that  the  plane  v 
separates  X~  from  X'*'  may  be  written  as  v^x,  =  ^u;,- 
for  some  constant  Now,  let  H  be  the  matrix 

i)  ■ 

Then  according  to  Lemma  9,  f  =  v^t  =  «,  c'  = 
=  KWiWf  and  fj,  =  K(v^Xi  =  K(^Wi.  Now  substi¬ 
tuting  into  the  equation  WiWiftji  =  1  from  Lemma  8 
yields  WiWiK^(^Wi  =  1  from  which  it  follows  that 
Wi  =  1  as  required.  Similarly,  from  the  equation 

^For  ih%n  ron*t.nirf.iofi  to  work,  it.  iirr^nary  to  mak^ 

the  AfhlitiofiAl  AHflitrnptioii  tliAt  the  point  net  {n,}  in  tmiiiKierl 
in  the  image  plane.  Thin  ae^iimption  will  he  tnie  for  any  rea> 
Aonahle  pinhole  ramera,  which  ran  not  have  an  image  of  infinite 
extent. 


uijtiije'ij,-  .=  1  we  derive  =  1,  from 

which  it  follows  that  Wi  =  1.  This  shows  that 
(P,  P*,  {zj})  is  a  strong  realization  as  required.  □ 

Since  the  epipolar  configuration  derived  from  two 
images  of  a  real  scene  must  have  a  strong  reaUzation, 
this  theorem  gives  a  necessary  and  sufficient  condition 
for  a  set  of  image  correspondences  to  be  realizable  as 
a  three  dimensional  scene.  Theorem  10  is  illustrated 
in  Fig  2. 

For  planar  object  sets,  Theorem  6  established  the 
existence  of  a  good  projectivity  between  the  object 
plane  and  the  image  plane.  For  non-planar  objects 
seen  in  two  views,  strong  realizations  of  the  epipolar 
configuration  take  the  role  played  by  sets  of  image 
points  in  the  two  dimensional  case. 

Theorem  11.  Let  (Q,{ui},{Uj})  be  an  epipolar  con¬ 
figuration  and  let  {P,  P,  {x,})  and  {P,  P' ,  {xj})  be  two 
separate  strong  realizations  of  the  configuration.  Then 
the  projectivity  mapping  each  point  x,-  to  x,  is  good. 

Proof.  With  notation  as  in  (3),  w,-  =  Wi  =  1,  and 
hence  from  Lemma  8,  rjiC  =  1,  which  means  that  all 
Tji  have  the  same  sign.  Therefore,  by  Theorem  5,  H  is 
a  good  projectivity.  □ 

The  particular  case  where  one  of  the  two  realiza¬ 
tions  is  the  “correct”  realization  is  of  interest.  It  is 
the  analogue  in  three  dimensions  of  Proposition  6. 

Corollary  12.  //{x,}  are  points  in  /P,  image  coor¬ 
dinates  {u,  }  and  {ii|}  are  corresponding  image  points 
in  two  uncalibrated  views,  Q  is  the  essential  matrix 
derived  from  the  image  correspondences  u,  *-*  u| 
and  {P,P',{xi})  is  a  strong  realization  of  the  triple 
(Q,  {u,),  {uj}),  then  there  is  a  good  projectivity  tak¬ 
ing  each  X,  to  Xi . 

FYom  this  corollary,  we  can  deduce  one  of  the  main 
results  of  this  paper. 

Theorem  13.  Let  (P,  P,{x,}l  and  (P,  P,{x,})  be 
two  different  reconstructions  of  3D  scene  geometry  de¬ 
rived  as  strong  realizations  of  possibly  different  epipo¬ 
lar  configurations  corresponding  to  possibly  different 
pairs  of  images  of  a  3D  point  set.  Then  there  is  a 
good  projectivity  mapping  each  point  x,  to  x,-. 

What  this  theorem  is  saying  is  that  if  a  point  set  in 
is  reconstructed  as  a  strong  realization  from  two 
separate  pairs  of  views,  then  the  two  results  are  the 
same  up  to  a  good  projectivity. 

Proof.  By  corollary  12  there  exist  good  projectivities 
mapping  each  of  the  sets  of  reconstructed  points  ) 
ana  {x,-}  to  the  actual  3D  locations  of  the  points.  The 
result  follows  by  composing  one  of  these  projectivities 
with  the  inverse  of  the  other.  □ 
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7  Orientation 

We  now  consider  the  question  of  image  orientation. 
A  mapping  h  from  RT  to  itself  is  cetlled  orientation¬ 
preserving  at  a  point  x  if  the  Jacobian  of  h  has  positive 
determinant  at  x.  Otherwise  h  is  called  orientation 
reversing.  Reflection  of  points  of  /Z”  with  respect  to 
a  hyperplane  (that  is  mirror  imaging)  is  an  example 
of  an  orientation  reversing  mapping.  A  projectivity  h 
from  V’'  to  itself  restricts  to  a  mapping  from  /2"  — 
h~*(Loo)  to  iZ",  where  Lk>  is  the  hyperplane  (line, 
pleine)  at  infinity.  Consider  the  case  n  =  3  and  let  H 
be  a  4  X  4  matrix  representing  the  projectivity  h.  We 
wish  to  determine  at  which  points  x  in  /Z  —  h~^(Lno) 
the  map  h  is  orientation  preserving.  It  may  be  verified 
(quite  easily  using  Mathematica  [Wolfrcim-88])  that  if 
Hx  =  wx!  and  J  is  the  Jacobian  of  h  evaluated  at  x, 
then  det(J)  =  det{H)/w*.  This  gives  the  following 
result. 

Proposition  14.  A  projectivity  h  ofV^  represented  by 
a  matrix  H  is  orientation  preserving  at  any  point  in 

—  h~^{Lro)  if  <rnd  only  if  del(H)  >  0. 

Of  course,  the  concept  of  orientability  may  be  ex¬ 
tended  to  the  whole  of  and  it  may  be  shown  that  h 
is  orientation-preserving  on  the  whole  of  P^if  and  only 
if  det{H)  >  0.  The  essential  feature  here  is  that  as  a 
topological  manifold,  P^is  orientable.  The  situation 
is  somewhat  different  for  which  is  not  orientable 
as  a  topological  space.  In  this  case,  with  notation 
similar  to  that  used  above,  it  may  be  verified  that 
det(/)  =  det(H)/vi^.  Therefore,  the  following  propo¬ 
sition  is  true. 

Proposition  15.  A  projectivity  h  ofV^is  orientation 
preserving  at  a  point  u  in  R?  —  /»"*(L„o)  if  <rnd  only 
if  wdet{H)  >  0,  where  H\i  =  wu' . 

This  theorem  allows  us  to  strengthen  the  statement  of 
Theorem  5  somewhat. 

Corollary  16.  If  h  is  a  good  projectivity  of  with 
respect  to  a  set  of  points  (iv)  in  R^,  then  h  is  either 
orientation-preserving  or  onentation-reversing  at  all 
points  n,' .  Suppose  the  matrix  H  corresponding  to  h  is 
normalized  to  have  positive  determinant  (by  possible 
multiplication  by  —1)  and  let  Hiii  =  Then  h  is 

orientation-preserving  if  and  only  if  Wi  >  0  for  all  i. 

An  example  where  Corollary  16  applies  is  in  the  case 
where  two  images  of  a  planar  object  are  taken  from 
the  same  side  of  the  object  plane.  In  this  ceise,  an 
orientation-preserving  good  projectivity  will  exist  be¬ 
tween  the  two  images.  Conserpiently,  all  the  Wi  defined 
with  respect  to  a  matrix  H  will  be  positive,  provided 
that  H  is  normalized  to  have  positive  determinant. 

The  situation  in  3-dimensions  is  rather  more  in¬ 
volved  and  more  interesting.  Two  sets  of  points  {x,  } 
and  {x,  }  that  correspond  via  a  good  projectivity  are 
said  to  be  oppositely  oriented  if  the  projectivity  is 
orientation-reversing.  This  definition  extends  also  to 
two  strong  realizations  {P, P' ,  {xi})  and  {P,P',{xi}) 
of  a  common  epipolar  configuration  (Q,  {u,},  {uj}). 


since  in  view  of  Theorem  1 1  the  point  sets  are  related 
via  a  good  projectivity.  Whether  or  not  oppositely  ori¬ 
ented  strong  realizations  exist  depends  on  the  imaging 
geometry.  Common  experience  provides  some  clues 
here.  In  particular  a  stereo  pair  may  be  viewed  by 
presenting  one  image  to  one  eye  and  the  other  image 
to  the  other  eye.  If  this  is  done  correctly,  then  the 
brain  perceives  a  3-D  reconstruction  of  the  scene  (a 
strong  realization  of  the  image  pair).  If,  however,  the 
two  images  are  swapped  and  presented  to  the  opposite 
eyes,  then  the  perspective  will  be  reversed  -  hills  be¬ 
come  valleys  and  vice  versa.  In  effect,  the  brain  is  able 
to  compute  two  oppositely  oriented  reconstructions  of 
the  image  pair.  It  seems,  therefore,  that  in  certain  cir¬ 
cumstances,  two  oppositely  oriented  realizations  of  an 
image  pair  exist.  It  may  be  surprising  to  discover  that 
this  IS  not  always  the  case,  as  is  shown  in  the  following 
theorem. 

Theorem  17.  Let  (Q,  {uj},  {uj-})  be  an  epipolar  con¬ 
figuration  and  let  (P,  P',  {x,})  be  a  strong  realization 
(Q>  {“•}>  {«<})•  There  exists  a  different  oppositely 
oriented  strong  realization  {P,P',{xi})  if  and  only  if 
there  exists  a  plane  in  JZ^  such  that  the  perspective 
centres  of  both  cameras  P  and  P'  lie  on  one  side  of 
the  plane,  and  the  points  Xi  lie  on  the  other  side. 

Before  proving  this  theorem,  we  need  a  lemma. 

Lemma  18.  Let  (P,  P',{xj})  be  a  strong  realization 
of  an  epipolar  configuration  (Q,  {u,},  {«(}).  Then 
there  exists  a  similarly  oriented  strong  realization 
(P,  F,  {xj})  for  which  P  =  (7  I  0). 

Proof.  Suppose  P  =  (M  [  —Aft),  with  det(Af)  >  0. 
Then  mtiltiplication  by  the  matrix 


transforms  P  to  the  required  form.  Furthermore, 
defines  an  orientation-preserving  good  projectivity  on 
the  points  x,- .  □ 

Now,  we  may  prove  the  theorem. 

Proof.  (Theorem  17)  In  light  of  Lemma  18  it  may  be 
assumed  that  F  and  F  are  both  of  the  form  (7  | 
0),  because  an  oppositely  oriented  pair  of  realizations 
exist  if  and  only  if  an  oppositely  oriented  pair  exist 
satisfying  this  additional  condition. 

Let  us  assume  that  such  an  oppositely  oriented 
pair  of  strong  realizations  exists  and  H  represents  the 
orientation-reversing  good  projectivity  relating  them. 
We  define  e,  e'  and  tj,-  as  in  {^.  If  necessary,  H  may 
be  multiplied  by  a  constant  so  that  P  =  1.  Since 
u;^  =  u;,'  =  u),-  =  u)J  =  1,  it  follows  from  Lemma  8 
that  =  1  for  all  i  and  e  =  1.  From  the  equation 
(7  I  0)P  =  (7  I  0)  the  form  of  H  may  be  deduced  ; 
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for  some  3- vector  v  and,  since  H  is  orientation  revers¬ 
ing,  k  =  -1. 

Now,  according  to  Lemma  9,  r}i  =  ibv^x,-,  and  since 
?;,•  =  1  and  t  =  —  1  it  follows  that  v^x,-  =  —1.  This 
condition  may  be  interpreted  as  meaning  that  all  the 
X,  lie  on  one  side  of  the  plane  defined  by  v. 

On  the  other  hand,  by  applying  Lemma  9,  we  get 

v^t  =  f  =  1  and  v^t  =  =  1.  These  equations 

mean  that  t  and  t'  lie  on  the  opposite  side  of  the  plane 
V  from  all  the  points  x,-.  This  completes  the  only  if 
part  of  the  proof. 

The  converse  may  be  proven  by  working  backwards 
through  this  proof.  Assuming  the  existence  of  a  sep¬ 
arating  plcuie  V  one  constructs  the  orientation  revers¬ 
ing  matrix  H  as  above  and  verifies  that  the  resulting 
{P,  F ,  {x,  })  is  a  strong  realization.  0 

Note  that  the  existence  of  such  a  separating  plane 
as  described  in  Theorem  17  may  be  checked  using  any 
strong  realization. 

8  3D  cheirality  invariants 

The  cheirality  invariant  of  a  set  of  points  may  be 
computed  from  two  views  by  constructing  a  strong 
realization  of  the  epipolar  configuration  and  then  in¬ 
voking  Theorem  13.  If  in  addition  each  pair  of  views 
is  discovered  to  satisfy  the  condition  of  Theorem  17 
then  the  orientation  of  the  set  of  points  with  respect 
to  a  canonical  basis  gives  a  further  invariant. 

In  general,  finding  a  strong  realization  involves  sub¬ 
stantial  computation.  It  is  therefore  convenient  to  be 
able  to  compute  the  cheirality  invariant  of  a  set  of 
points  from  a  weak  realization.  This  may  be  done 
using  the  following  theorem 

Theorem  19.  Suppose  (f*,  f*',  {x,})  is  a  weak  realiza¬ 
tion  of  an  epijolar  configuration  (Qi  {«• }.  {uj })  and 
let  constants  iji  be  defined  for  each  x,  as  in  the  def¬ 
inition  of  the  cheiral  invariant.  Suppose  that  Pxi  = 

and  define  rji  =  then  sign(q,)3*~*  j 

is  the  cheiral  invariant  of  a  strong  realization  of 

Details  of  the  proof  will  not  be  given.  It  is  simply  a 
matter  of  considering  the  composition  of  two  projec- 
tivities  :  from  the  strong  realization  to  the  weak  real¬ 
ization  and  from  the  weak  realization  to  the  canonical 
frame. 

9  Experimental  results 

In  considering  real  images  of  3-D  configurations  it 
is  necessary  to  take  into  account  the  effects  of  noise. 
In  particular,  because  of  measurement  inaccuracies,  it 
will  (virtually)  never  be  the  case  that  a  point  x^  in  a 
strong  realization  will  map  by  chance  exactly  onto  the 
plane  at  infinity  under  the  mapping  to  the  canonical 
basis.  For  this  reason,  in  practical  experiments  I  have 
preferred  to  define  the  cheiral  invariant  by  interpreting 
the  values  jji  as  bits  of  a  binary  integer  :  tji  >  0  corre¬ 
sponds  to  a  1  bit  and  qi  <  0  to  a  0  bit.  In  some  cases, 
a  value  of  rfi  will  lie  so  close  to  0  variations  due  to  noise 
can  swap  its  sign.  For  robust  evaluation  of  a  cheiral 
invariant  value,  it  is  necessary  to  select  a  noise  model 


and  determine  how  errors  in  the  input  data  affect  the 
sign  of  each  tfi .  An  investigation  of  noise  propagation 
is  underway  with  the  purpose  of  assigning  a  computed 
error  bound  to  each  value  q, .  The  developed  methods 
have  not  been  implemented  at  present,  so  in  the  fol- 
lowin^iscussion.  noise  effects  are  ignored. 

In  Partley-93]  projective  invariants  of  3D  point 
sets  were  discussed.  As  an  experiment  in  that  par 
per,  a  set  of  images  of  some  model  houses  were  ac¬ 
quired.  Figures  1,  2  and  3  of  [Hstftley-93]  show  the 
three  images.  Corresponding  vertices  were  selected  by 
hand  from  among  those  detected  automatically.  The 
13  vertices  used  are  shown  in  [Hartley-93], Fig  4. 

Six  sets  of  six  points  were  chosen  as  in  the  following 
table  which  shows  the  indices  of  the  points  as  given  in 
[Hartley-93],  Fig  4. 
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=  {1, 

2,  3, 

6, 

9, 

10 

52 

=  2, 

4,  6, 

8, 

10, 

12 

53 

=  (l’ 

3,  5, 

7, 
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11 

54 
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2,  3, 
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7, 

8 

55 

=  1, 

4,  7, 

10, 

13, 

12 

56 

=  {2, 

5,  8, 

11. 

12, 

13 

image 

correspondences 

views  ([Hartley-93],  Figs  1  and  2)  the  essential  ma¬ 
trix  Q  was  found  and  a  weak  realization  (P,  P',  (x,)) 
was  computed.  For  each  of  the  six  sets  of  indices  i 
shown  above  a  complete  projective  invariant  of  the 
points  {i,}  was  computed  by  mapping  the  first  five 
points  onto  a  canonical  basis.  The  coordinates  of  the 
mapped  sixth  point  constitute  a  projective  invariant 
of  the  set  of  six  points. 

This  computation  was  repeated  with  a  different  pair 
of  views  ([Hartley-931,  Figs  2  and  3).  Theory  predicts 
that  the  invariants  snould  have  the  same  value  when 
computed  from  different  views,  and  should  distinguish 
between  non-equivalent  point  sets. 

Table  (4)  shows  the  comparison  of  the  computed 
invariant  values. 
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0.970 

0.975 

0.619 

0.847 

0.823 

0.995 

0.015 

0.064 

0.841 

0.252 

0.548 

0.967 

0.066 

0.013 

0.863 

0.276 

0.516 

0.617 

0.830 

0.873 

0.016 

0.704 

0.752 

0.861 

0.238 

0.289 

0.708 

0.005 

0.590 

0.828 

0.544 

0.519 

0.719 

0.574 

o.oaG 

The  (i,  j)-th  entry  of  the  table  shows  the  distance 
according  to  an  appropriate  metric  between  the  in¬ 
variant  of  set  5,  as  computed  from  the  first  image 
pair  with  that  of  set  5y  as  computed  from  the  second 
image  pair.  The  diagonal  entries  of  the  matrix  fin 
bold)  should  be  close  to  0.0,  which  indicates  that  tne 
invariants  had  the  same  vedue  when  computed  from 
different  pairs  of  views. 

Although  the  projective  invariants  computed  here 
are  quite  effective  at  discriminating  between  different 
point  sets,  indicated  by  the  fact  that  most  off-diagonal 
entries  are  not  close  to  zero,  entries  (2, 3)  and  (3, 2)  are 
small  indicating  that  the  point  sets  numbered  2  and  3 
are  close  to  being  equivalent  up  to  projectivity. 

Next,  the  cheirality  invariants  for  each  of  the  point 
sets  were  computed  from  the  weak  realization  using 
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the  method  described  here.  The  computed  values  for 
each  of  the  six  point  sets  were  as  follows  :  x{Si)  = 
28,  xiS2)  =  3,  x(53)  =  59,  x{S4)  =  60,  xls^j  = 
21,  Xws)  =  27.  As  expected  these  invariant  values 
were  the  same  whether  computed  using  the  first  pair 
of  views  or  the  second  pair.  Note  that  the  cheirality 
invariant  clearly  distinguishes  point  sets  2  and  3.  In 
fact,  all  six  point  sets  are  distinguished. 

Reordering  :  Although  there  are  no  invariants  of 
projectivity  for  5  points  in  the  cheirality  invari¬ 
ant  is  defined.  In  order  to  estimate  its  effectiveness 
for  distinguishing  different  configurations  the  follow¬ 
ing  experiment  was  carried  out.  Five  points  in  P^were 
sefected  and  the  cheirality  invariant  computed  for  all 
permutations  of  the  five  points.  The  result  was  that 
10  different  invariant  values  were  found  (out  of  16  pos¬ 
sible),  each  one  occuring  12  times.  It  may  be  seen 
that  this  will  be  true  whichever  5  points  are  selected 
(though  the  invariant  values  will  be  different).  In 
short,  there  is  about  one  chance  in  10  that  two  sets 
of  five  arbitrarily  selected  points  will  have  the  same 
cheirality. 

When  this  experiment  was  carried  out  with  6  points 
arbitrarily  chosen  the  results  were  seen  to  vary  accord¬ 
ing  to  the  particular  configuration  of  the  points.  For 
various  choices  of  points  it  was  seen  that  the  proba¬ 
bility  of  getting  a  chance  match  for  arbitrary  permu¬ 
tations  of  the  point  set  is  about  one  chance  in  20  or 
30. 

Conclusions  ;  These  results  show  that  the  cheiral¬ 
ity  invariant  is  quite  effective  at  distinguishing  be¬ 
tween  arbitrary  sets  of  points.  Given  the  relative  ease 
with  which  the  cheirality  invariant  may  be  computed, 
it  may  be  extremely  useful  in  grouping  points.  In  ad¬ 
dition,  it  may  conveniently  be  used  as  2ui  indexing 
function  in  an  object  recognition  system.  It  has  been 
demonstrated  that  the  cheirality  invariant  gives  sup¬ 
plementary  information  that  is  not  available  in  pro¬ 
jective  invariants.  As  a  theoretical  tool,  the  cheirality 
invariants  provide  conditions  under  which  image  point 
matches  may  be  realized  by  real  point  configurations. 
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Figure  1.  At  the  left  a  comb.  At  the  right  a  naughty  projection  of  the  comb. 


Weak  Realizatioiis  of 
Feasible  Configurations 


Weak  Realizations  of 
Infeasible  Configurations 


Figure  2.  Each  camera  is  shown  symbolically  as  a  line  representing  the  meridian  plane  and  an  arrow  indicating 
the  direction  of  the  front  of  the  camera.  Each  diagram  represents  a  weak  realization  of  an  epipolar  configuration. 
The  two  top  configurations  of  points  and  cameras  satisfy  the  condition  of  Theorem  10  and  may  be  converted  to 
strong  realizations.  The  two  lower  configurations  do  not,  and  hence  can  not  be  weak  realizations  of  a  real  scene. 


753 


Efficient  Recognition  of  Rotationally  Symmetric  Surface  and 
Straight  Homogeneous  Generalized  Cylinders 


Janf  Liu* 

CiE  Corporate  Hesearch  and  Development 
Schenectady,  N\'  12301 

David  Forsyth 

Department  of  Computer  Science 
lhii\’ersitv  of  Iowa 


Joe  Mundy 

GE  Corporate  Research  and  Develoi)ment 
Schenectady.  N^'  12301 

Andrew  Zisserman 
Department  of  Engineering  Science 
Oxford  Universitv 


Charlie  Rothwell 

Department  of  Engineering  Science 
Oxford  Ihiiversity 


Abstract 


It  in  known  that  rotationally  symmetric  surfaces 
can  he  recognizetl  from  their  outlines  alone,  using 
cross-ratio's  of  bitangent  intersections.  This  paper 
ilemonstrates  a  successful  implementation  of  this  tech- 
ni</ue.  using  a  novel  hitangent  finder,  that  works  on 
images  of  real  scenes.  The  technique  is  shown  to  e.v- 
tend  to  the  case  of  straight  homogeneous  generalised 
cylinders,  and  a  dual  construction  for  computing  fur¬ 
ther  invariants  from  outlines  is  displayed. 

1.  This  paper  is  about  recognizing  SHGC's  and  ro¬ 
tationally  symmetric  i  bjects.  using  outlines  ob¬ 
tained  from  a  single  view  by  an  uncalibrated  cam¬ 
era.  at  an  unknown  viewing  position. 

2.  This  paper  demonstrates  a  syistem  that  works, 
built  using  theory  described  in  another  pajier 
(proc  ECCV'92).  It  shows  how  this  theory  natu¬ 
rally  extends  to  SHGC's.  from  rotationally  sym¬ 
metric  objects.  It  then  demonstrates  construc¬ 
tions  that  yield  further  information  about  the  sur¬ 
face.  based  on  the  ilual  of  a  surface. 

•I.  One  section  briefly  describes  material  already 
published,  for  background  information  only,  and 
.some  of  the  paragraphs  in  the  introduction  ap¬ 
pear  in  a  paper  submitted  to  the  International 
(  biiference  on  Computer  Vision:  all  other  mate¬ 
rial  is  original. 

1  Introduction 

Outlines  are  a  potentially  important  .source  of  in¬ 
formation  aliout  tlie  olijects  in  a  .scene  liecau.se  image 
edges  appear  at  most  outline  points,  and  image  e<lges 
can  lie  computed  rea,sonal)ly  reliably.  This  potential 

*Wi>ilt  At  ejE  A*  siippoiiPfl  ill  pAii  by  (lip  DAFtPA  umirr 
(ViiUrATt  No.  MDA!iT2-m-C'-005:3 


lias  not  been  realised  in  the  case  ofcurwd  surfaci.s«.  be¬ 
cause  the  outline  of  a  curved  surface  is  extremely  hard 
to  interpret.  This  paper  demonstrates  how  a  iisi'fid 
range  of  disscriptious  for  rotationally  symmetric  sur¬ 
faces  and  for  straight  homogeneous  generali.si'd  cylin¬ 
ders  can  be  be  determined  from  an  outline  in  a  single 
image,  using  simple  geometric  arguments.  Th**s4'  de¬ 
scriptions  are  invariant  to  camera  |>osition.  orientation 
and  calibration. 

1.1  Description  for  recognition 

A  number  of  recent  papers  have  shown  how  projec¬ 
tive  or  affine  invariants  can  be  used  to  index  a  model, 
ainl  thereby  avoid  searching  a  model  base  (e.g.  [ill. 
47.  -V).  21]).  To  be  used  for  indexing,  a  function  must  ; 

1.  be  computable  from  image  outline  informal imi 
alone. 

2.  ideally  should  have  different  valui's  for  different 
objects  and 

3.  be  unaffected  by  object  pose  and  int  rinsic  param¬ 
eters  of  the  camera. 

Functions  with  these  properties  have  the  same  value 
for  any  view  of  a  given  object,  and  so  can  be  nsi'd 
to  index  into  a  model  ba.se  without  search:  liy  abuse 
of  terminology,  we  call  sinh  functions  “iinh-xing  liine- 
lion.s".  Indexing  functions  have  been  di’inonslraled 
for  polyhedra  [.'17]  (derived,  as  .always  from  a  single, 
unknown  view),  and  for  rotationally  symmetric  sur¬ 
faces  [ll]  (again,  computed  from  an  outline  in  a  sin¬ 
gle.  unknown  vieiv).  A  companion  paper  [l2]  pro\<>s 
the  remarkable  fact  that  a  single  outline  yiehls  all  the 
projective  geometry  of  an  algebraic  surface,  and  so 
demonstrates  that  all  the  projective  invariants  of  an 
algebraic  surface  can.  in  principle,  be  computed  for  a 
single  image.  In  this  pajier.  wi*  demonstrate  sncces.s- 
ful  indexing  for  a  range  of  imag«^  of  real  scent's  using 
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(lie  constructions  of  [ll].  We  show  (hat  these  con¬ 
structions  also  yiehl  iinlexing  functions  for  straight  ho- 
inogeneons  generalisi'«l  cylinders,  and  we  demonstrate 
how  further  indexing  functions  may  he  obtained  using 
the  concept  of  the  dual  surfact. 

In  a  typical  recognition  system  for  planar  objects, 
projective  invariants'  are  computed  for  a  range  of  ge¬ 
ometric  primitives  in  the  image.  If  the  values  of  these 
invariants  match  (he  values  of  the  invariants  for  a 
known  model,  we  have  good  evidence  that  the  im¬ 
age  features  are  within  a  camera  transformation  of 
the  model  features.  Object  models  consist  of  sets  of 
invariant  valu<>s  ami  are  therefore  relatively  sparse, 
meaning  that  hypothesis  verification  is  required  to 
confirm  a  model  match.  However,  no  searching  of 
the  model  ba.s»'  is  reipiired  because  the  hyjiot hesised 
object's  iilentity  is  determined  by  (he  invariant  de¬ 
scriptors  measured.  As  a  result,  systems  with  rela¬ 
tively  large  model  ba.ses  can  be  constructed*.  Sys¬ 
tems  of  Uiis  sort  have  been  demonstrated  for  plane 
objects  in  a  number  of  papers  (lO,  •!().  49.  21.  42. 
48]. 

This  paper  concentrates  on  the  more  difficult  prol>- 
lem  of  recognising  curved  surfaces  from  a  single  out¬ 
line.  Previous  approaches  include  attempts  to  extend 
line  labelling  [13.  22],  the  development  of  constraint- 
ba.seil  systems  [4],  and  the  study  of  how  the  topology 
of  a  surface's  outline  changes  as  it  is  viewetl  from  dif¬ 
ferent  points,  formalised  into  a  structure  known  as  an 
asiifcl  graph  (for  example.  [18,  19.  27.  33.  34]).  As¬ 
pect  graphs  can  be  extremely  complicated  for  even 
.simple  curved  .surface.s;  some  examples  appear  in  [33. 
34].  Recently,  there  have  been  attempts  to  represent 
the  system  of  outlines  of  a  curved  surface  as  a  linear 
combination  of  some  small  number  of  outlines  (see,  for 
example,  [l,  2]).  This  approach  is  represented  as  pro¬ 
viding  an  approximation  sufficiently  accurate  for  some 
purposes,  although  it  cannot  capture  all  the  complex¬ 
ities  that  the  aspect  graph  does. 

Another  area  that  has  been  extensively  studied 
is  the  relationship  between  the  differential  geome¬ 
try  of  tlie  outline  and  of  the  surface,  both  for  sin¬ 
gle  imagtss  [I9.  23]  ami  for  motion  .se((nences  [14. 
().  4()].  It  is  generally  accepted  that  the  problem 
of  recognising  a  surface  from  its  outline  alone  is  in¬ 
tractable  if  the  surface  is  constrained  only  to  be 
smooth,  or  piecewise  smooth,  as  in  this  case  signif- 
ii’Hiit  changes  can  be  made  to  the  surface  gwuietry 
without  afmcting  the  outline  from  a  given  viewpoint. 
As  a  result,  an  important  part  of  the  problem  in¬ 
volves  constructing  as  large  a  cla.ss  of  surfaces  as  pos¬ 
sible  that  can  either  be  directly  recogniseii.  or  use¬ 
fully  constrained,  from  their  outline  alone.  Dlioine 
ft  ni  showed  that  for  a  cla.s.s  of  rotationally  sym¬ 
metric  surface,  obji-ct  pos»>  could  be  recoveretl  for  a 
liiouit.  ra/ihraftd  caiiifiv.  ami  incorporatetl  this  fact 
into  a  recognition  scheme  [7],  which  was  later  e.\- 


*A  rl«?«r  iiilnxlurliiiii  to  appl.viiig  invariant  tli*-ory  in  roin- 
piitrr  vision  appears  in  [i-l]. 

^Current  ».v»trtn»  iiiHing  indexing  functions  have  model-bases 
containing  of  the  orfler  of  thirt.v  objects. 


tended  to  include  straight  homogeneous  generalis«'d 
cylinders  [8].  Relationships  betwt'en  sections  of  the 
outline  of  a  straight  homogemxius  generalis<'d  cylitnh'r 
have  been  wiilely  studied,  and  are  known  to  yield  a  va¬ 
riety  of  surface  parameters  in  orthogriiphic  views  [28, 
44.  45]. 

In  the  case  of  plane  objects,  indexitig  functiotis  are 
ea.sy  to  compute,  because  viewing  a  plane  citrve  fr<»m 
an  arbitrary  focal  point  iiitluces  ait  action  of  the  pro¬ 
jective  group  on  the  curve.  Constructing  indexing 
functions  for  thr<’e  dimensional  objects  is  challenging. 
becan.se  it  is  more  difficult  to  ensure  (hat  these  func¬ 
tions  are  computable  from  outline  tiiformalion  alone. 
as  changing  viewing  position  no  longer  induces  a  group 
action  on  the  outline. 

1.2  The  outline  and  its  geometry 

Throughout  the  paper,  we  a.ssiime  an  idealized  pin¬ 
hole  camera.  These  cameras  possess  a  focal  point  ami 
an  image  plane.  For  each  point  in  space,  there  is  a  line 
through  tliat  point  and  the  focal  point:  the  point  in 
space  appears  in  the  image  as  the  intersection  of  this 
line  with  the  image  plane  -  figure  1  illustrates  such  a 
camera. 

It  is  ea.sy  to  see  that  if  the  focal  point  is  fixed  and 
the  image  plane  is  moved,  the  resulting  distortion  of 
the  image  is  a  collineatioir^.  In  what  follows,  it  is  as¬ 
sumed  mat  neither  the  position  of  the  image  plane 
with  respect  to  the  focal  point  nor  the  size  and  sispect 
ratio  of  the  pixels  on  the  camera  |)lane  is  known'',  .scj 
that  the  image  pr<'S<Mited  to  the  algorithm  is  within 
some  arbitrary  colli  neat  ion  of  (he  "correct"  image.  In 
this  ab.stract  model,  the  image  plane  makes  no  con¬ 
tribution  to  the  geometry,  and  its  position  in  space 
is  ignored.  Notice  that  an  orthographic  vi»*w  occurs 
when  the  pinhole  is  "at  infinity" . 

The  outline  of  a  surface  is  a  plane  curve  in  the  im¬ 
age.  which  itself  is  the  projection  of  a  space  curve, 
known  as  a  contour  generator’.  The  contonr  genera¬ 
tor  is  given  by  those  points  on  the  surface  wln‘re  the 
surface  turns  away  from  the  image  plane;  formally,  (he 
ray  through  the  focal  i)oin(  to  the  surface  is  tangent 
to  the  surface.  As  a  result,  at  an  outlim'  point,  if  the 
relevant  surface  patch  is  visible,  m'arby  pixels  in  the 
image  will  see  vastly  different  points  on  the  surface, 
and  so  outline  points  nsnally  have  sharp  changt-s  in 
image  brightness  associated  with  them.  Figure  1  il¬ 
lustrates  these  concepts. 

1.3  Indexing  rotationally  symmetric  ob¬ 
jects 

It  is  .shown  in  [ll]  that: 

•  Lpimua:  ExcejU  where  the  imageoutline  cusps'', 
a  plane  tangent  to  tin-  snrfac.>  ;i(  a  point  on  the 

'.A  eiilliiieai iiiii  is  a  ■’l•ll(iIll|I•lls.  ■•up-Oeeiie  iiia{>  taking  (lie 
pmjevlive  plane  the  prnjeiiive  plane  iiV/e/i  nm/’f  Hint  I" 
lints:  any  rolliiieatinn  is  a  plane  projeelive  ti'aiisrerinalion. 

*  These  (|iiaiitilies  can  be  measured  willi  varying  degre.-s  of 
difficulty;  they  dn  not  appear  to  be  p.-ui icidarly  stable  when 
cameras  are  moved,  shaken  or  dropped,  however. 

’"There  are  a  number  of  widely  used  terms  for  Imlh  ciirvi-s. 
and  no  standard  terminology-  has  yet  emerged. 

*'\Ve  ignore  cusps  in  the  image  outline  in  what  follows. 
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Figure  I:  The  outline  and  contour  generator  of  a 
curved  object,  viewed  from  a  perspective  camera. 


contour  generator  (by  definition,  such  a  plane 
pa.ss«*s  through  tlie  focal  point),  projects  to  a  line 
tangent  to  the  surface  outline,  and  conversely,  a 
line  tangent  to  the  outline  is  the  image  of  a  plane 
tangent  to  the  surface  at  the  corresponding  point 
on  the  contour  generator. 

•  Corollai'y  1:  A  line  tangent  to  the  outline  at  two 
distinct  |)oints  is  the  image  of  a  plane  through 
the  focal  |ioint  and  tangent  to  the  surface  at  two 
distinct  points,  both  on  the  contour  generator. 

•  Corollary  2:  The  intersection  of  two  lines,  bi¬ 
tangent  to  the  outline  is  a  point,  which  is  the  im¬ 
age  of  the  intersection  of  the  two  bitangent  planes 
represented  by  the  lines. 

•  Lciuiua:  For  a  rotationally  symmetric  .surface, 
the  envelope  of  the  bitangent  planes  must  be  a 
right  circular  cone,  or  a  cylinder  (a  cone  with  ver- 
te.\  at  innnity). 

•  Leuiiua:  The  vertices  of  the  cones  bitangent  to 
a  rotationally  symmetric  surface  must  lie  on  the 
axis  (by  .symmetry),  and  so  are  collinear.  The 
vertices  of  the  bitangeiit  cones  appear  in  the  im¬ 
age  as  the  intersections  of  a  pair  of  lines  bitangent 
to  the  outline;  the  image  of  the  axis  of  the  surface 
is  a  line  pa.ssing  through  these  bitangent  intersec¬ 
tions. 

•  Indexing  theorem:  Cross-ratios  of  correspond¬ 
ing  image  bitangent  lines  mea.sure  projective  in¬ 
variants  of  the  surface.  These  projective  invari¬ 
ants  .are  cross-ratios  of  vertices  of  the  l>itangent 
cones  which  project  to  the  bitangent  lines.  These 
invariants  are  determined  from  tlie  outline  alone. 
Furthermore,  these  image  intersections  can  be 
ii.sed  to  construct  the  image  of  the  axis  of  a  rota- 
tioiially  .symmetric  surface  from  its  outline. 

Thus,  rro9>t-ralio9  of  inltrstclion  points  of  cornspond- 
ing  hilanginl  Inxs  yiild  indf  ring  funrlions  for  rota- 
lionally  synimdnc  surfacfs.  Note,  in  particular,  that 


these  cros.vratio's  are  inranant  to  canitra  calibration, 
and  so  can  be  used  with  n»  unknown  camera,  in  st'c- 
tion  2.  we  show  how  these  cros.vratios  can  be  com¬ 
puted  reliably  from  image  data,  and  demonstrate  a 
simple  recognition  system  using  these  cros.vratio's;  in 
section  3.  we  show  that  these  cros.s-ratios  can  be  used 
to  index  straight  homogeneous  generalis'd  cylinders, 
and  in  section  4,  we  show  a  body  of  mathematical  tech- 
nic|ues  that  can  be  u.sed  to  construct  further  indexing 
functions  for  rotationally  symmetric  surfaces  and  for 
straight  homogeneous  generalised  cylinders. 

2  A  recognition  system  using  cross¬ 
ratio’s 

A  recognition  system  using  cros.vratiu‘s  as  indexing 
functions  works  as  follows: 

•  cros.s-ratio’s  are  constructed  for  corresponding 
pairs  of  bitangents  in  an  image; 

•  these  cros.s-ratio‘s  are  used  as  keys  to  a  li.asli- 
table  that  contains  the  correspoinlence  betwet'n 
surfaces  and  cros.s-ratio's  to  yield  recognition  hy¬ 
potheses; 

•  the  recognition  hypotheses  are  tallied,  verified 
and  accej^ted  or  rejecteil. 

In  our  existing  system,  we  do  not  verify  recognition 
hypotheses,  as  edge-ba.sed  verification  for  curved  sur¬ 
faces  is  difficult  without  pose  information,  which  is  not 
available.  The  system's  model-base  contains  thn't'  sur¬ 
faces,  and  the  system  assumes  that  I  here  is  only  one 
surface  in  each  image  to  simi)lify  the  compulation  of 
corresponding  bitangents. 

The  main  step  is  computing  cras.s-ratios  from  out¬ 
lines.  This  process  requires  that: 

1.  all  bitangents  to  the  outline  In-  found,  and 

2.  corresponding  bitangents  identified  and  inter¬ 
sected. 

2.1  Finding  bitangent  lines  to  a  curve 

A  tangent  line  can  be  represented  by  Hough  trans¬ 
formation  as  1(0.')),  where  0  is  the  orientation  of  the 
line  and  ■)  is  the  distance  from  the  image  center  to 
the  line,  as  shown  in  figure  2.  Any  line  in  the  im¬ 
age  can  be  mapped  into  a  particular  cell  in  tin'  Hough 
transform  table  by  its  location  aiul  its  orientation.  If 
tangent  lines  derived  from  two  different  points  fall  into 
the  .same  cell  in  the  Hough  transformation  table,  then 
those  tangent  lines  are  a  bitangent  line.  The  proci-ss 
of  finding  bitangents  proceeds,  therefore,  by: 

1.  computing  the  tangents  to  the  curve  and  lloiigh 
transforming  these  lines; 

2.  checking  the  Hough  transformed  system  for  cells 
containing  more  than  one  line,  which  ao'  Intan¬ 
gents. 
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Figure  2:  A  Hough  traiisronnation  table  at  the  right  is 
created  with  6  and  •)  indices.  A  tangent  line  to  curve 
C  is  represented  by  0  and  -)  and  stored  in  the  cell  (O.'j) 
in  the  Hough  transforniation  table. 


2.1.1  Cuiuputiug  aud  Hough  trausformiug 
taugeuts 

Tangent  lines  are  computed  using  an  eigenvector  line- 
fitting  method  [9].  As  shown  in  the  left  half  of  figure  3. 
/  is  the  best  eigenvector  fitting  line  ba.sed  on  the  7 
points  around  Pi.  In  the  experiments,  we  used  an  11 
point  neighltourliood.  The  tangent  line  at  Pi  is  the 
line  passing  through  P,  and  parallel  to  /,  labelled  /  in 
the  figure. 

When  the  curve  has  high  curvature,  the  0  and  ■) 
values  of  consecutive  tangent  lines  can  be  quite  dif¬ 
ferent  because  of  the  sample  spacing,  with  the  result 
that  two  consecutive  tangents  can  be  marked  in  cells 
some  way  apart  in  the  hough  space.  As  a  result,  bi¬ 
tangent  lines  can  l>e  mis-sed,  because  high-curvature 
segments  of  curves  can  lead  to  widely  scattered  points 
in  the  Hough  space,  which  may  not  intersect  pro|>- 
erly  (see  the  right  half  of  figure  3  for  an  example). 
The  solution  to  this  problem  is  to  interpolate  between 
points  in  the  Hough  space,  using  either  a  linear  or 
(piadratic  interpolate,  depending  on  the  variation  in 
0  (for  our  experiments  we  used  a  quadratic  interpo¬ 
late  if  >  6".  and  otherwise  a  linear  interpolate). 
This  strategy  leads  to  continuous  curves  in  the  Hough 
space,  and  is  succes.sful  in  finding  bitangents. 

2.2  Determining  corresponding  bitan* 
gents 

Once  alt  bitangents  have  been  found,  it  is  neces.sary 
to  determine  which  pairs  of  bitangents  correspond  (i.e. 
both  come  from  the  same  cone  of  bitangents).  This 
problem  can  be  solved  by  exploiting  (he  following  re¬ 
markable  symmetry  property  of  rotationally  symmet¬ 
ric  surfaces: 

TLuxn'oiu:  There  is  a  non-trivial  plane  pro- 
jectivity  which  maj>s  the  outline  of  a  rota- 
tionally  symmetric  surface  to  itself.  The  con¬ 
tour  generators  corresponding  to  each  half 


Figure  3:  In  the  left  half  of  the  figure,  an  eigen¬ 
vector  fitting  line  /,  is  constructed  from  seven  points 
Pi-3.  Pi-7.  Pi-i.Pi.  P»4.i,  Pif7.  Pi+A  *'"d  the  tan¬ 
gent  line  at  point  P,  is  /,  .  winch  passes  tlirough  Pi  and 

is  parallel  to  /.  In  the  right  half.  /  is  a  bitangent  line 
which  is  missed  because  /  is  not  booked  in  I  he  Hough 
transformation  table  while  scanning  P,  and  Pi+i. 


are.  in  general,  space  curves,  ainl  are  relatetl 
by  a  mirror  symmetry  in  space. 

In  effect,  this  theorem  is  a  stronger  way  of  slating 
that  the  outline  of  a  rotationally  symmetric  surface 
can  be  separated  into  two  sides,  which  are  related  by 
plane  project ivity.  To  see  that  the  two  sides  of  the 
contour  in  the  image  are  projectively  equivalent,  for 
an  arbitrary  view,  construct  the  plane  containing  the 
axis  of  the  surface  and  the  focal  point .  The  surface 
then  has  a  mirror  symmetry  in  this  plane.  a.s  does  the 
cone  of  rays  through  the  focal  point  and  tangent  to 
the  surface.  This  cone  yields  the  outline  when  it  is 
intersected  with  the  image  plane. 

If  the  image  plane  is  perpendicular  to  the  plane  of 
symmetry,  then  the  outline  has  a  mirror  symmetry: 
but  the  outline  in  any  other  image  plane  is  within  a 
projective  map.  say  T  of  this  outline  (by  construction, 
with  the  focal  point  as  the  centre  of  projwtion).  and 
so  we  can  construct  a  non-trivial  projective  mapping 
that  takes  the  outline  to  it.self  as  P  =  T  o  o  T“'. 
where  M  is  a  mirror  symmetry.  Since  by  construction 
r  is  a  projectivity.  and  M  is  a  projectivity  (it  can  be 
given  as  diui/  [1.  —1. 1]).  P  is  a  projectivity. 

This  delivers  a  uniform  method  for  determining 
points  lying  on  the  proj<’ction  of  (he  30  symiin'lry 
axis.  Any  pr<»jectively  covariant  const  met  ion'  in  the 
particular  (symmetric)  image  plane  which  gemTali's 
points  on  the  image  axis,  can  be  us*-<l  in  any  image. 
Examples  include. 

'By  tliis.  we  mean  that  we  would  olilaiii  llie  same  resnll  if 
we  were  lo  perform  the  roiisIriM'Inm  in  one  frame,  and  llieii 
projei'l  the  result  to  a  new  frame,  or  if  we  were  to  |>erfonn  the 
construction  in  the  new  frame  on  a  projection  of  the  i^riginal 
curves:  eonstnictions  with  this  pmperty  are  liased  around  inci¬ 
dence  and  counting  properties.  For  example,  a  tangent  line  is  a 
covariant  construction. 
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1.  Fiiul  corr«*spoiuliiig  pairs  of  distinguished  points 
on  each  side  of  the  outline,  say  a  corresponding  to 
a\  b  corresponding  to  6'.  Then  the  lines  ab',  a'b 
iiitersect  on  the  symmetry  axis.  Appropriate  dis¬ 
tinguished  points  are  covariants  such  as:  points 
of  contact  of  bitangents  and  inflections. 

2.  Determine  the  projective  transformation  that 
maps  the  contour  to  itself,  and  And  the  points 
that  are  fixed  by  this  transformation.  These 
points  will  form  the  projection  of  the  axis.  This 
construction  might  well  be  the  best  method  (in  a 
LMS  senw?),  but  has  not  yet  been  implemented. 
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In  practice,  we  use  approach  I  (above)  with  dis¬ 
tinguished  points  derived  from  a  bitangent's  contact 
witl»  the  curve.  Any  pair  of  corresponding  bitangents 
then  generates  two  points  on  the  axis  image:  one  by 
the  intersection  of  the  bitangents  (i.e.  the  lines  ab 
and  a*b'),  the  other  by  the  cross-conairudion  above 
(i.e.  the  lines  ab'.  a'b).  This  is  a  simple  and  successful 
construction.  Note  that  the  order  of  the  points  of  tan- 
gency  on  each  bitangent  can  be  given  with  reference 
to  their  intersection  point  and  so  is  uniquely  defined. 

Now,  select  any  two  bitangent  lines  in  the  image. 
We  give  a  vote  to  line  c,  from  both  their  intersection 
and  cross-construction.  The  total  number  of  votes  for 
the  correct  image  of  the  central  axis,  ii,  is  the  number 
of  distinguished  bitangent  cones  constructed  by  the 
.sha|)e  of  the  object.  The  total  number  of  votes  for 
each  incorrect  image  of  the  central  axis  clearly  must 
be  1  or  small  if  the  surface  is  not  degenerate,  and  .so 
the  line  with  maximum  number  of  votes  is  the  image 
of  the  real  central  axis.  This  voting  system  is  refined 
further  by  noting  that,  for  real  views,  it  is  extremely 
hard  to  arrange  the  camera  such  that  corresponding 
bitangent  lines  are  more  than  a  few  degrees  away  from 
parallel,  thirrently.  pairs  of  bitangents  where  these 
lines  are  more  than  4“  off  parallel  do  not  contribute 
to  the  vote. 


2.3  Results 

In  a  total  of  lb  test  images,  the  correct  object  was 
identified  in  each  case.  Recognition  proceeded  by  com¬ 
puting  all  possible  cross-ratio's  of  bitangent  intersec¬ 
tions  from  an  image,  rounding  these  values  to  a  single 
digit,  and  using  them  as  a  key  to  a  hash-table,  which 
was  preloaded  with  the  names  of  the  surfaces,  using 
cross-ratio's  computed  from  one  image  of  each  sur¬ 
face.  Tables  1  3  show  the  details  of  the  returns  from 
the  hash-table  for  a  range  of  different  images  of  differ¬ 
ent  objects,  and  table  4  shows  the  data  collated.  In 
particular,  for  the  stand  and  the  doorknob,  a  number 
of  cros.s-ratio's  could  be  computed  from  each  image, 
and  the  final  identification  was  made  by  voting  lor 
the  object  with  the  greatest  number  of  returns.  Note 
that  tiie  technique  described  is  showing  a  degree  of 
robustness,  as  surfaces  are  correctly  identified  despite 
the  differing  number  of  cross-ratio’s  computed  for  each 
image  as  a  n*sult  of  noise-related  difficulties  in  obtain¬ 
ing  all  bitangents. 


Table  2:  Hash-table  returns  for  five  views  of  a  lamp, 
using  a  hash  table  preloaded  using  a  sixth  view  of  that 
surface,  and  views  of  two  other  surfaces.  Note  that  t  he 
surface  is  clearly  identified  in  each  case  by  choosing  the 
return  with  the  most  votes.The  columns  show  the  la¬ 
bel  returned  from  the  hash-table;  alternatives  are  the 
correct  label,  a  number  of  labels  including  the  correct 
label,  a  collection  that  does  not  include  the  correct 
label,  and  nothing  at  all. 
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Table  3:  Hash-table  returns  for  six  views  of  a  stand, 
using  a  hash  table  preloaded  using  a  seventh  view  of 
that  surface,  and  views  of  two  oilier  surfaces.  Note 
that  (he  surface  is  clearly  identified  in  e.ach  ca»‘  by 
choosing  the  return  with  the  most  Volf's. Th**  coluimis 
show  the  label  returned  from  the  hash-table:  alterna¬ 
tives  are  the  correct  label,  a  number  of  labels  ineludiiig 
the  correct  label,  a  collection  that  doi's  not  include  the 
correct  label,  and  nothing  at  all. 
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lahle  1:  Hash-table  returns  for  four  views  of  a  doorknob,  using  a  hash  table  preloadeil  using  a  fifth  view  of 
that  surface,  and  views  of  two  otlier  surfaces.  Note  tliat  the  surface  is  clearly  identifieil  in  each  ca.se  by  choosing 
the  return  with  the  most  votes.  The  columns  .show  the  label  returned  from  the  hash-table;  alternativt*s  are  the 
correct  label,  a  iumd)er  of  labels  including  the  correct  label,  a  collection  that  does  not  include  the  correct  label, 
and  nothing  at  all. 
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Table  4:  Composite  results  of  inde.xing  surfaces  from  a  range  of  views,  showing  the  surfaces  identified  by  a  return 
from  a  hash-table,  iiulexed  by  invariants  compute<l  from  image  information.  Note  that  in  each  case,  tin-  vast 
majority  of  returns  from  the  hash  table  either  unitjuely  identify  the  correct  surface,  or  contain  the  correct  surface 
as  an  option  in  an  ambiguous  return.  To  identify  surfaces,  all  returns  are  taken  as  votes  for  the  surfaces  returned, 
and  the  surface  receiving  the  maximum  number  of  votes  is  accepted.  In  no  views,  of  a  total  of  15.  was  the  final 
identification  incorrect . 


Figure  4:  Typical  images  of  real  rotationally  symmet¬ 
ric  objects.  us«hI  to  obtain  the  recognition  results;  the 
top  figure  shows  the  knob,  the  lower  figure  shows  the 
stand. 


3  Indexing  straight  homogeneous  gen¬ 
eralised  cylinders 

A  straight  homogeneo\ts  generalised  cyli\ider 
(SHGC)  can  be  defined  as  a  surface  that,  in  some  Eu- 
clulean  frame,  can  be  parametrisi'd  as; 

(/l(0.'/l{.s)./l(0.'/L.(  s). /,(/)) 

Thus,  in  the  appropriate  frame,  the  sections  <>f  this 
surface  correspomling  to  planes  ;  =  rousfant  are  uui- 
forndy  scaleil  copies  of  the  plane  curve  ('/if-'*).  </•.-(■•>)) 
As  a  result,  in  this  frame  the  :-coortliuale  axis  forms 
an  ’axis"  for  the  surface,  which  has  a  similar  role  to 
the  axis  of  a  rotationally  .symmetric  surface. 

Now  consider  the  family  of  planes  through  this  axis; 
an  arbitrary  plane  from  this  family  is  given  by  or  + 
hji  =  0.  for  some  a.  h.  In  coordinates  in  this  plane,  tin' 
intersection  between  the  surface  and  the  plane  can  be 
given  bv; 

(A/,(f)./,(n) 

where  A  is  a  function  of  .s  (figure  (i).  In  particid.'ir. 
only  A  changes  as  we  move  from  plane  to  plane  in  tin' 
family.  We  liave: 

Lemma:  The  envelope  of  the  family  of 
planes  tangent  to  the  surface  along  a  curvt'  of 
fixed  t  (a  "paraller’).  1“  a  cone  or  a  cylinder. 

The  h'lnma  is  proven  by  noting  that  every  langeui 
plaiie  in  this  family  intersects  the  -.-axis  in  the  same 
point;  this,  in  turn  is  proven  by  showing  that  the 
.v-intercept  of  a  line  tangent  to  a  curve  of  the  form 
(A/i(t)./i>(f))  the  same  for  any  A  ^  ().  Note  that 
the  coin's  or  cvlinders  are  alsoSIKit  "s.  with  the  ;-axis 


760 


Figure  5:  This  figure  shows  all  the  constructed  bi- 
tangent  lines  and  the  outlines  of  the  images  of  three 
samples:  a  lamp  (top),  a  knob  and  a  candle  stand 
(bottom). 


Figure  6:  Plane  sections  of  a  straight  homogeneous 

i generalised  cylinder,  illustrating  meridians  and  paral- 
els. 


as  their  “axis"  and  with  the  same  cross-section  as  the 
surface. 

From  this  lemma,  we  have  immediately: 

Lemma:  Families  of  planes  bitangeiit  to 
SHGC's  form  cones,  with  their  vertices  on 
the  axis  of  the  surface. 

Thus,  we  can  construct  indexing  functions  for  SHGt's 
in  exactly  the  same  way  as  we  constructed  indexing 
functions  for  rotationally  symmetric  surfaces. 

4  Using  the  dual  to  construct  further 
indexing  functions 

The  previous  constructions  have  been  shown  to 
yield  indexing  functions  for  rotationally  symmetric 
surfaces,  which  we  have  shown  have  genuine  value  for 
identifying  the  surface.  Simple  constructions  like  bi¬ 
tangent  cones  appear  to  yield  no  further  invariants: 
for  that,  we  must  pa.ss  to  the  dual  of  the  surface. 

There  is  a  natural  duality  between  points  in  space 
and  planes  in  space;  a  point  is  given  by  four  homoge¬ 
neous  coordinates,  and  so  is  a  plane.  This  duality  can 
be  extended  to  the  case  of  surfaces,  where  the  dual  of 
a  surface  is  defined  to  be  the  object  given  by  the  col¬ 
lection  of  points  dual  to  the  surface's  tangent  planes. 
For  example: 

•  The  dual  of  a  plane  is  a  point. 

•  The  dual  of  a  cone  is  a  plane  curve:  to  .see  this, 
note  that  the  planes  tangent  to  a  cone  all  p.-uss 
through  its  vertex,  and  hence  all  satisfy  a  single 
linear  equation.  Thus,  all  the  points  on  the  dual 
must  satisfy  a  single  linear  equation,  and  .so  th<' 
dual  must  be  a  plane  curve. 

•  The  dual  of  a  quadric  surface  given  by  the  equa¬ 
tion  x^Qx  =  0  is  a  tpiadric  surface  given  by  the 
equation  x^Q~*x  =  0. 

The  dual  has  the  following  important  property: 
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Theorem:  Tlie  outline  ol'.-i  .surfare  in  a  per- 
spertive  view  is  ecpii valent  to  a  plane  section 
ol'  the  (Inal  of  that  snrt'ace,  where  the  sec- 
lioninn  plain-  is  the  plane  dual  to  tin-  focal 
[loiiit . 

A  version  of  this  result  has  been  well-known  amonj;st 
j^t'ometers  for  a  lonj-  linn-  [h],  hut  it  does  not  aiipear  tv^ 
have  diffused  into  tin-  vision  coniiunnity  to  date.  Tin- 
rt'snit  follows  from  tin-  observation  that  the  outline  is. 
('ssentially,  formed  by  the  family  of  ])lane.s  tangent  tc. 
Ihesnrfac*'  and  passing  through  tin-  focal  point.  'Unis, 
it  is  given  by  a  family  of  plain-s  langi-nt  to  tin-  surface. 
and  .salisji/ina  a  sin(il( .  Inxar  r<lafioii.  In  turn,  this 
iin-ans  that  the  family  is  a  plane  section  of  tin-  dual 
siirfact'.  W’e  list  below  how  taking  duals  affects  a  range 
of  geoiiK'l  ric  conci-pis: 

Define  a  projective  rotationally  symmetric  (PRS) 
surface  to  In-  a  surface  that  is  within  a  dD  projectivity 
of  a  rotationally  symmetric  snrfact-.  Then  the  follow¬ 
ing  is  easily  proven: 

Theortmi:  'Ilie  dual  of  a  PRS  snrfacf-  is  a 
PRS  surface.  In  particular,  if  M  is  a  merid¬ 
ian  of  t  he  original  PRS  surface  .S'  lak(  ii  in  llu 
franit  in  wliicli  it  is  ivlalionally  syiinni  ti  n . 
then  the  dual  of  .S'  is  within  a  dD  projectiv¬ 
ity  of  a  rotationally  .symmetric  surface  whose 
meridian  is  M. 

4.1  Example:  proving  existing  results 

Here  we  rederive  the  results  of  section  l.il  using 
the  concept  of  a  dual  surface.  Notice  that  because 
we  are  dealing  with  a  rotationally  .symmetric  surface, 
the  bitangents  form  cones:  the  duals  of  these  cones 
are  then  plane  curves  in  the  dual  space,  where  tin- 
plane  the  curve  lies  on  is  dual  to  the  vertex  of  the  cone 
(from  above),  riiiis.  there  is  a  system  of  distingnislied 
planes  in  the  dual  space,  where  the  dual  surface  has 
s('l f-i lit ers<'ct ions  (dn;d  to  bitangents).  All  these  planes 
have  the  further  property  that  they  are  drawn  from  a 
single  pencil  of  plain's  (the  points  to  which  they  are 
dual  are  collinear).  Since  a  pencil  of  planes  is  a  one- 
paraiiK'tf’r  family  of  planes.  |nirametrized  by  a  line, 
these  iilaiies  have  .1  meaningful  cro.ss-r.it io. 

In  a  plane  section  of  the  dual,  any  self-intersections 
that  cross  the  pl;me  will  be  obvious  as  se|f- 
inlersi'ct ions  of  tin-  section  (figure  8).  Note  tiiat  for 
some  sectioning  jilanes,  the  singularities  of  tin-  dual 
do  not  appear  in  the  section,  and  this  correspcinds 
to  tlio.si-  awkward  viewing  positions  where  the  out¬ 
line  (jf  a  rotationab.v  .symmetric  surface  does  not  have 
bitangents  -  for  example,  .'i  view  dmvn  the  axis.  If 
we  construct  the  lines  connecting  corresponding  sin¬ 
gularities.  we  obtain  lines  drawn  from  a  pencil:  but 
tin-  cros,s-ratio  of  tie  se  liin-s  is  .-i|ni\a|ent  to  the  cross¬ 
ratio  of  tin*  planes,  and  so  is  a  projective  invariant  of 
the  surface,  that  is  invariant  to  choice  of  s<'ci inning 
|)lane,  as  long  as  it  c:m  be  observa-d. 

4.2  A  ne'w  invariant 

Most  of  this  discussion  is  ba.sed  around  the  follow¬ 
ing  useful  lemma,  which  is  dual  to  that  giving  a  pro- 


Figure  7:  This  figure  shows  a  cut-away  version  of  a  ro- 
tatioiially  symmetric  surface;  nott-  the  inllections  ;md 
bitangents  of  the  meridian. 


Figure  !S:  This  figure  shows  a  cnt-avv;\v  versidu  nf  the 
dual  of  the  surf.'ice  in  figure  7.  Ni'iici-  the  cuspi<lal 
edges,  corresponding  to  inflections  of  the  meridian, 
and  tin-  self-intersections,  corresponding  to  bitaug'-nl 
cones:  ;dl  these  singularities  lie  on  p;ir<'dlels. 
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jpctivity  from  tlie  outline  of  a  rotationally  symmetric 
surface  to  itself: 

Lemma:  There  is  a  noii-trivial.  plane  pro- 
jectivity  that  takes  any  arbitrary  plane  sec¬ 
tion  of  a  PRS  to  itself. 

To  prove  this,  we  work  in  a  frame  in  which  the  sur¬ 
face  is.  in  fact,  rotationally  symmetric.  In  this  frame, 
construct  a  plane  through  the  axis  of  the  surface  at 
right  angles  to  the  sectioning  plane.  The  constructed 
plane  yields  a  plane  in  space  about  which  the  surface 
and  the  sectioning  plane  are  symmetric,  and  so  yields  a 
line  in  the  sectioning  plane  about  which  the  plane  sec¬ 
tion  has  a  symmetry.  In  turn,  this  symmetry  (which  is 
a  projectivity  of  space,  say.  I)  yields  a  map  that  takes 
the  section  to  itself  in  the  same  way  that  the  symmetry 
of  an  outline  from  a  special  viewing  position  yields  a 
map  that  takes  the  outline  to  itself  (section  2.2).  that 
is,  a  map  of  the  form  P’  -  1)/P,  where  P  and  I  are 
projectivities  of  space. 

Given  a  surface  is  a  PRS.  its  dual  is  a  PRS  too; 
now  consider  dual  surface,  which  is  PRS.  and  let  us 
work  in  the  frame  in  which  this  dual  surface  is  rota¬ 
tionally  symmetric,  with  the  ::-axis  as  its  axis.  For 
a  general  surface,  the  dual  will  have  self-intersections 
which  are  circles  lying  on  planes  of  constant  c.  and  .so 
we  can  construct  a  cjuadric  which  contains  any  three 
of  these  circles.  In  our  frame,  the  quadric  will  have 
the  equation: 

X-  -t-  y-  -  {a:-  -I-  c)  =  0 

It  is  easy  to  see  that  this  qua<lric  exists,  is  unicpie  for 
threi'  di.stinct  circles  and  can  be  constructed:  further¬ 
more.  any  intersections  between  the  ({uadric  and  the 
dual  surface  will  be  circles,  again  lying  on  planes  of 
constant  Since  the  quadric  is  uniquely  defined  by 
incidence  relations  alone,  the  construction  is  projec- 
tively  covariant,  and  so  any  cross-ratio's  incorporating 
this  quadric's  intersection  planes  will  be  projective  in¬ 
variants. 

'I'o  show  that  these  invariants  can  be  measured 
from  a  single,  unknown  image,  we  need  to  show  that 
they  can  be  iiu'asured  from  a  single  plane  section  of 
the  dual.  For  this,  work  in  a  frame  in  which  the 


plane  section's  symmetry  is  expressed  as  1. 1. 1]: 

then  any  intersection  between  a  rotationally  symmet¬ 
ric  quadric  and  the  surface,  that  pas.ses  through  thn-e 
singularities,  must  appear  in  the  section  in  Uie  form 
X-  —  ay^  —  by  —  c  =  0.  Three  incidence  coiidition.s 
determine  this  curve  exactly,  and  since  the  result  is 
uni«(ue,  it  must  be  a  section  of  the  unique  (piadric 
passing  through  the  corresponding  circles.  Thus,  the 
intersection  points  between  this  curve  and  the  plane 
section  correspond  to  intersection  points  betwe*'n  the 
dual  and  the  quadric,  and  we  are  done. 

Returning  to  non-dual  space,  the  dual  of  the 
quadric  intersecting  the  dual  surface  is  again  a  PRS 
quadric,  but  here  tangent  to  the  original  surface  at  the 
two  circles  of  inflections,  and  tangent  to  the  bitnngent 
cone  (again  a  circle  of  contact).  The  proit-ction  of  this 
quadric  in  the  image  is  the  unique  conic  tangi'iit  at 
corresponding  inflections  and  bitangent  lines  on  both 
“sides"  of  the  outline.  New  invariants  can  then  be 
generated  from  this  conic.  For  example,  correspoiul- 
ing  bitangents  between  this  conic  and  the  outline  in¬ 
tersect  on  the  axis  in  the  same  manner  as  bitangents 
of  the  outline. 

These  invariants  can  be  constructed  from  inia^e 
data,  by  taking  the  dual  of  the  outline,  which  wdl 
have  the  features  described,  and  performing  the  con¬ 
structions  described  on  that  dual.  It  is  not  yet 
known  whether  more  efficient  algorithms  I'xist.  nor  is 
it  known  how  many  independent  invariants:  can  be  ob¬ 
tained  in  this  way. 

The  lemma  certainly  allows  many  invariants  to  be 
constructetl,  either  by  constructing  higher  di'gree  in- 
terpolants  and  u.sing  them  in  the  sairn'  w;iv  the  (piadrir 
was  used,  or  by  noting  that,  for  any  plain'  si'ction  of 
the  dual,  if  the  section  is  in  the  frann-  in  which  its  sym¬ 
metry  is  of  the  form  ilinyl—l.  1.  I].  then  points  nu  the 
outline  with  the  same  y  value  corr<'S|>ond  to  the  same 
parallel.  If  the  parallel  can  be  identified  from  plane 
section  to  plane  section,  for  example,  by  the  pn-ST'iice 
of  a  surface  marking,  a  change  in  colour,  or  an  inci¬ 
dence  property  (similar  to  thost'  abov**).  it  ran  hi  iimiiI 
to  gairraU  cross-ratio's.  Thus,  a  rather  full  invari¬ 
ant  description  of  a  rotationally  symmetric  surface  is 
possilile  from  a  single  outline. 
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Figure  9:  A  plane  section  of  a  ilual  surface,  showing 
tliree  singularities  on  each  side,  with  a  quadric  of  tlie 
form  J"  —  (i!/~  —  bu  —  c  =  0  superimposed.  This  is 
the  imicp'"  qiiailric  of  tins  form  that  pa.sses  through 
the  singularities;  it  generates  two  further  intersection 
planes,  which  are  shown,  to  yield  a  cross-ratio. 

5  Discussion 

We  have  demonstrated  that  rotationally  symmetric 
surfaces  can  he  succe.ssfully  indexed  u.sing  hitangents 
computed  automatically  from  image  edges  by  a  bitan¬ 
gent  finding  algorithm  which  we  have  describetl.  ^Ve 
have  shown  that  this  approach  can  be  extended  to 
rerogni.se  straight  homogeneous  jgeneralised  cylinders, 
and  we  have  demonstrated  techniques  for  constructing 
further  indexing  functions  for  rotationally  symmetric 
surfaces. 
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Abstract 

We  address  the  problem  of  reconstructing  3D 
space  in  a  projective  framework  from  iwo  views, 
and  the  problem  of  artificially  generating  novel 
views  of  the  scene  from  two  given  views.  We 
show  that  with  the  correspondences  coming 
from  four  non-coplanar  points  in  the  scene  and 
the  corresponding  epipoles,  one  can  define  and 
reconstruct  (using  simple  linear  methods)  a 
projective  invariant  that  can  be  used  later  to 
reconstruct  the  projective  or  affine  structure  of 
the  scene,  or  directly  to  generate  novel  views 
of  the  scene.  The  derivation  has  the  advantage 
that  the  viewing  transformation  matrix  need 
not  be  recovered  in  the  course  of  computations 
(i.e.,  we  compute  structure  without  motion). 

1  Introduction 

This  paper  presents  a  study  on  the  geometric  relation 
between  objects  and  their  views  (perspective  and  ortho¬ 
graphic)  geared  towards  developing  tools  with  applica¬ 
tions  to  3D  reconstruction  and  visual  recognition.  For 
this  purpose  we  define  a  new  projective  invariant  that 
can  be  computed  from  image  measurements  across  two 
views  (four  corresponding  points  and  the  epipoles)  using 
simple  linear  methods.  The  invariant  is  then  used  for 
reconstructing  the  3D  scene  in  projective  or  affine  space, 
and  for  generating  novel  views  of  the  scene/object  di¬ 
rectly  —  without  going  through  projective  coordinates 
and  camera  transformation. 

We  adopt  the  projective  framework  for  representing 
3D  space  as  was  also  done  recently  by  [6,  19,  11],  In 
a  projective  framework  the  scene  is  represented  with  re¬ 
spect  to  a  frame  of  reference  of  five  points  whose  location 
in  space  are  unknown  and  can  assume  arbitrary  general 
configurations  in  3D  projective  space  [29].  This  allows  us 
to  work  in  a  framework  that  does  not  make  a  distinction 
between  orthogr^hic  and  perspective  views  and  does 
not  require  internal  camera  calibration,  i.e.,  the  internal 
camera  parameters  are  folded  into  the  camera  transfor¬ 
mations. 

Related  to  3D  reconstruction  is  the  application  to  vi¬ 
sual  recognition.  The  alignment  approach  to  recognition 
[9,  18,  27,  13,  28]  is  based  on  the  notion  that  the  ge¬ 
ometric  relation  between  objects  and  their  images  can 


be  used  to  create  an  equivalence  class  of  images  d  an 
object  of  interest.  This  approach  can  be  realixed  by 
storing  a  few  number  of  “niodel”  views  (two,  for  exam¬ 
ple)  and  with  the  help  of  corresponding  points  between 
the  model  views  and  any  novel  input  view,  the  object  is 
“re-projected”  onto  the  novel  viewing  position.  Recogni¬ 
tion  is  achieved  if  the  re-projected  image  is  successfully 
matched  against  the  input  image.  We  refer  to  the  prob¬ 
lem  of  predicting  a  novel  view  fr(»n  a  set  of  mod'll  views 
using  a  limited  number  of  corresponding  points,  as  the 
problem  of  re-projecUon. 

The  problem  of  re-projection  can  in  principal  be  dealt 
with  via  3D  reccmstruction  of  shape  and  camera  mo¬ 
tion.  For  purposes  of  stability,  however,  it  is  worthwhile 
exploring  more  direct  tools  for  achieving  re-projection. 
Most  of  the  current  tools  available  for  this  purpose  as¬ 
sume  orthographic  projection  [28,  14,  22].  The  method 
of  epipolar  line  intersection  is  a  possibility  for  achiev¬ 
ing  re-projection  under  perspective  [3,  23]  but,  however, 
is  singular  for  certain  viewing  transformations.  For  ex¬ 
ample,  numerical  instabilities  arise  when  the  centers  of 
projection  of  the  three  cameras  are  nearly  coUinear,  or 
equivalently,  when  the  object  rotates  around  nearly  the 
same  axis  for  all  views.  The  re-projection  method  intro¬ 
duced  in  this  paper  is  not  based  on  an  epipdar  intersec¬ 
tion,  but  rather  is  based  directly  on  the  rdative  structure 
of  the  object,  and  does  not  suffer  from  any  singularities, 
a  finding  that  implies  greater  stability  in  the  presence  of 
noise. 

We  derive  a  geometric  invariant  defined  by  a  single 
cross  ratio  along  a  ray  cutting  through  the  frame  of  ref¬ 
erence.  The  invariant  can  be  used  later  to  recover  homo¬ 
geneous  coordinates  if  desired,  or  used  directly  to  achieve 
re-projection  onto  a  third  view.  The  derivation  has  the 
advantage  that  the  viewing  transformation  need  not  be 
recovered  in  the  course  of  the  computations  —  only  the 
projections  due  to  two  faces  at  the  tetrahedron  of  ref¬ 
erence.  The  geometric  construction  we  use  requires  the 
projections  of  four  scene  reference  points  onto  two  views, 
and  as  the  fifth  reference  point  we  use  the  camera’s  cen¬ 
ter  of  projection  via  the  epipoles.  The  epipcdes  are  used 
both  as  a  fifth  corresponding  pair  and  a  means  for  de¬ 
termining  correspondences  due  to  projections  of  various 
faces  of  the  tetrahedron  of  reference. 

Part  of  this  work  originally  appeared  in  [24]  describ¬ 
ing  the  geometric  invariant  and  its  application  to  re- 
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projection,  and  was  derived  independently  of  [6,  19,  11]. 
The  later  stage  of  reconstructing  homogeneous  coordi¬ 
nates  given  the  recovered  invariant  is  inspired  by  the 
work  of  [6]. 

2  Projective  Framework  and  Related 
Work 

In  a  projective  framework  the  location  of  an  object  point 
is  measured  relative  to  a  frame  of  reference  of  five  points 
(a  tetrahedron  and  ^unit  point)  whose  positions  in  space 
are  unknown  and  which  are  allowed  to  map  onto  any  gen¬ 
eral  configuration  of  five  points  in  3D  projective  space. 
It  is  not  difficult  to  show  [25]  that  the  space  of  images  we 
can  get  out  of  this  framework  are  no  more  than  perspec¬ 
tive  and  orthographic  images  of  the  scene,  and  images 
of  images  of  the  scene,  produced  by  a  pin-hole  camera 
in  which  the  camera’s  coordinate  frame  is  allowed  to  un¬ 
dergo  arbitrary  affine  transformations  in  space. 

The  projective  framework  enlarges  the  equivalence 
class  of  images  of  an  object  compared  to  the  metric 
framework,  but  in  return  does  not  require  internal  cam¬ 
era  calibration  and  does  not  make  a  distinction  between 
orthographic  and  perspective  projections.  The  internal 
camera  parameters  (focal  length,  principal  point  and  im¬ 
age  coordinates  scale  factors)  are  folded  into  the  affine 
transformation  of  the  camera  coordinate  frame  ([20],  for 
example)  and,  therefore,  can  assume  arbitrary  values 
(which  can  also  change  from  one  view  to  another).  Or¬ 
thographic  images  are  included  in  this  framework  be¬ 
cause  any  of  the  reference  points  (including  the  COP) 
can  be  anywhere  in  3D  projective  space.  These  features 
of  the  projective  framework  imply  greater  stability  in 
the  presence  of  noise  compared  to  the  metric  framework 
(see  [1,  5,  26,  4,  23]  for  discussions  on  the  performance  of 
metric  structure-from-motion  in  the  presence  of  noise). 

Projective  space  can  be  represented  by  homogeneous 
or  non-homogeneous  coordinates.  In  a  non-homogeneous 
representation  a  point  P  is  represented  by  three  cross 
ratios  along  three  axes  of  the  tetrahedron  of  refer¬ 
ence  (see  Figure  1).  A  homogeneous  representation 
is  a  tetrad  (z,p,  z,t)  of  coordinates  which  is  typi¬ 
cally  realized  by  assigning  the  standard  coordinates 
(0,0,0, 1),(1,0,0,0),(0,1,0,0),(0,0,1,0),(1, 1,1,1)  to 
the  vertices  of  the  tetrahedron  O,  U,  V,  W  and  the  unit 
point  T,  respectively  (see  Figure  2).  For  example,  the 
points  with  t  =  0  are  on  the  plane  UVW,  and  the  projec¬ 
tion  of  P  via  O  is  the  point  with  coordinates  (z,y,z,0) 
(i.e.,  orthographic  projection  in  coordinate  space).  In 
general,  any  ordered  set  of  four  numbers,  not  all  zero, 
determine  uniquely  a  point  in  space. 

A  geometric  reconstruction  of  non-homogeneous  co¬ 
ordinates  was  recently  proposed  by  Mohr  et  al.  [19]. 
The  authors  use  the  projections  of  five  scene  reference 
points  and  the  epipolar  geometry  (the  “Essential”  Ma¬ 
trix  of  [16]  which  is  found  by  matching  eight  points)  to 
determine  the  projections  of  the  various  stages  of  the 
construction  needed  to  determine  the  three  cross  ratios 
for  each  point.  The  construction  is  elaborate  and  instead 
the  authors  propose  and  implement  a  direct  non-linear 
algorithm  for  recovering  the  camera  transformations  be- 
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Figure  1:  A  non-homogeneous  representation  of  space. 
The  points  0,U,V,W  define  the  tetrahedron  of  refer¬ 
ence.  The  point  Pg  is  at  the  intersection  of  the  plane 
PVW  with  the  x-axis  (the  line  OU).  The  point  Tg  is 
similarly  constructed  by  replacing  P  with  the  unit  point 
T  (not  shown  in  the  drawing).  The  x  coordinate  of  P 
is  defined  as  the  cross  ratio  of  0,Tg,  Pg,  U  (see  [29],  pp. 
191). 


tween  the  scene  and  the  two  views. 

Faugeras  [6]  proposes  a  linear  algorithm  for  recovering 
the  camera  transformations  and  the  homogeneous  coor¬ 
dinates.  The  projections  of  five  scene  reference  points  are 
used  to  determine  each  camera  transformation  matrix 
up  to  one  unknown  parameter  (a  camera  transformation 
has  11  parameters  and  the  correspondence  between  the 
reference  points  and  their  projections  add  five  more  un¬ 
knowns,  but  produce  15  linear  equations).  The  epipoles 
are  then  used  as  a  sixth  corresponding  pair  to  fully  de¬ 
termine  (projectively  speaking)  the  camera  transformar 
tions.  Once  the  camera  transformations  are  recovered  it 
becomes  a  simple  matter  to  recover  the  homogeneous  co¬ 
ordinates  of  any  scene  point  whose  projections  in  both 
views  are  known.  Faugeras  then  craisiders  the  case  of 
having  four  corresponding  points  instead  of  five.  In  that 
case  the  camera  transformations  are  recovered  up  to  four 
unknown  parameters.  Once  these  parameters  are  set  (ar¬ 
bitrarily),  then  affine  reconstruction  becomes  possible. 

In  our  framework  we  do  not  recover  the  camera  trans¬ 
formation  matrices  in  order  to  achieve  reconstruction. 
Instead  we  regard  the  camera’s  center  as  part  of  the  pro¬ 
jective  reference  frame  making  it  necessary  to  use  only 
four  corresponding  points  coming  from  the  scene.  This 
still  enables  a  projective  reconstruction,  and  in  addition 
to  achieve  an  affine  reconstruction  in  case  the  scene  un¬ 
dergoes  only  affine  transformations  in  space. 

In  the  next  section  we  derive  the  projective  struc¬ 
ture  invariant  and  show  how  it  can  be  computed  given 
projections  of  four  scene  reference  points  (four  corre¬ 
sponding  points)  and  the  corresponding  epipoles.  Sec¬ 
tion  4  describes  the  method  by  which  3D  reconstruc- 
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(0,0, 0,1) 

Figure  2:  Homogeneous  coordinates  in  space.  If  P  is 
any  point  not  on  a  face  of  the  tetrahedron  of  refer¬ 
ence,  there  exists  four  numbers  x,y,z,t,  all  different 
from  zero,  such  that  the  projections  of  P  from  the 
four  vertices  (1, 0,0,0), (0, 1,0,0), (0,0, 1,0), (0,0,0, 1) 
respectively  onto  their  op¬ 

posite  faces  are  (0,  y,  z,  t),  (i,  0,  z,  i),  (x,  y,  0,  <),  (i,  y,  z,  0) 
(see  [29],  pp.  194-195). 


tion  is  achieved  given  the  recovered  invariant.  Section  5 
describes  two  schemes  for  achieving  re-projection,  one 
using  the  invariant  directly,  and  the  other  using  the  re¬ 
constructed  structure.  Section  6  briefly  goes  over  two 
schemes  for  recovering  the  epipoles.  Finally,  Section  7 
shows  computer  simulations  intended  to  test  the  robust¬ 
ness  of  the  schemes  against  noise  in  image  correspon¬ 
dences. 


3  The  Projective  Structure  Invariant 

Let  the  tetrahedron  of  reference  consist  of  four  scene 
points  Pi,...,Pt  and  let  the  fifth  reference  point  be  the 
C2tmera’s  COP  denoted  by  O.  Let  P  be  an  arbitrary 
point  of  interest,  and  consider  the  ray  from  O  to  P.  As 
illustrated  in  Figure  3,  the  ray  OP  intersects  the  two 
faces  P1P3P3  and  P2P3P1  at  P  and  P,  respectively.  We 
define  our  projective  structure  invariant  as  the  cross  ratio 
of  P,  P,  P,  O,  denoted  by  Op : 


Op  =< 


P,P,P,0>= 


P-O 

P-P 


P-P 

P-O’ 


where  distances  are  measured  along  the  ray  OP.  We  will 
use  Op  for  reconstructing  the  homogeneous  coordinates 
of  P  and  for  re-projecting  P  onto  novel  views,  but  first 
we  describe  the  way  Op  can  be  computed  from  image 
measurements  alone. 

In  the  first  view  all  points  along  the  ray  OP  project 
onto  a  single  point,  denoted  by  p,  in  the  image  plane. 
Because  internal  camera  parameters  are  folded  into  the 
affine  component  of  camera  motion,  we  can  assign  p  = 
(z,y,  1)  where  (x,  y)  are  the  observed  image  coordinates 


Figure  3:  Projective  structure  of  a  scene  pcxnt  P  is 
defined  with  respect  to  four  reference  points  Pi,..., P| 
and  the  center  of  projection  O  cf  the  first  camera  po¬ 
sition.  The  camera’s  center  serves  as  the  unit  point  in 
the  projective  frame  of  reference  instead  of  a  fifth  scene 
point.  The  cross  ratio,  denoted  by  Op,  of  the  four  points 
P,  P,  P,  O  uniquely  fixes  P  with  respect  to  the  frame  of 
reference.  The  cross  ratio  can  be  computed  from  the 
projections  of  P,P,P,0  onto  the  second  image  plane. 
The  projection  of  O  is  the  epipole  v'  which  can  be  com¬ 
puted  from  eight  corresponding  points  [6];  the  other  pro¬ 
jections  pf,pf  can  be  recovered  using  the  projections  of 
the  four  reference  points  and  the  corresponding  epipoles 
V,  v'.  Finally,  since  Op  is  invariant  it  can  be  used  for  re¬ 
projection  onto  a  third  view  and  for  reconstructing  the 
projective  structure  of  the  scene. 


with  respect  to  some  image  origin  (say  the  geometric 
center  of  the  image  plane).  Consider  next  a  second  view 
of  the  scene.  The  points  P,  P,  P,0  project  onto  gener¬ 
ally  distinct  points  denoted  by  t/  which  are  also 

collinear.  Because  the  two  tetrads  points  are  projec- 
tively  related,  we  have 

Op  =<  P,P,P,0  >=<  >, 

and  therefore  the  structure  invariant  Op  can  be  computed 
from  the  projections  onto  the  second  view.  The  projec¬ 
tion  of  O  onto  the  second  view  is  the  epipole  v',  and  simi¬ 
larly  the  projection  of  O'  (the  COP  of  the  second  camera 
position)  onto  the  first  view  defines  the  other  epipole  v, 
and  therefore  v  and  v'  are  corresponding  points.  We  as¬ 
sume  for  now  that  the  epipoles  are  known,  and  we  will 
address  the  problem  of  finding  them  later  (Section  6). 
The  point  p'  is  given  to  us  (as  we  assume  that  corre¬ 
spondences  between  the  two  views  has  been  established, 
as  for  example  by  [22,  23,  2]),  and  we  can  assign  the  co¬ 
ordinates  p'  =  (z',]/,  1),  where  are  the  observed 

image  coordinates  with  respect  to  an  arbitrary  image 
origin.  What  is  left  is  to  recover  the  points  and  pf. 
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In  order  to  determine  p  and  ff  we  must  recover  the 
projective  transformations  due  to  the  two  faces  Pi  P2P3 
and  P2P3P4,  respectively.  This  can  be  done  by  iden¬ 
tifying  four  coplanar  points  on  each  of  the  two  faces, 
but  instead  we  can  make  use  of  the  epipoles  again.  For 
example,  we  can  use  the  projections  of  Pi,P2,P3  onto 
both  views  and  the  corresponding  epipoles  to  uniquely 
recover  the  2D  projective  transformation  A,  that  when 
applied  to  p  will  produce  up  to  a  scale  factor.  This 
is  expressed  in  the  following  preposition: 

Proposition  1  A  projective  transformation,  A,  which 
is  determined  from  three  arbitrary,  non-coUinear,  cor¬ 
responding  points  and  the  corresponding  epipoles,  is  a 
projective  transformation  of  the  plane  passing  through 
the  three  object  points  which  project  onto  the  correspond¬ 
ing  image  points.  The  transformation  A  is  an  induced 
epipolar  transformation,  i.e.,  the  ray  Ap  intersects  the 
epipolar  line  j/v'  for  any  arbitrary  image  point  p  and  its 
corresponding  point  pf . 

Proof:  Let,  pj  < — ►  p'-,  j  =  1,2,3,  be  three  arbitrary 
corresponding  points,  and  let  v  and  v'  denote  the  two 
epipoles.  First  note  that  the  four  points  pj  and  v  and 
the  corresponding  points  pj ,  v*  are  the  projections  of  four 
coplanar  points  in  the  scene.  The  reason  is  that  the  plane 
defined  by  the  three  object  points  Pi,  Pi,  P3  intersects 
the  line  OO'  connecting  the  two  centers  of  projection, 
at  a  point  —  regular  or  ideal.  That  point  projects  onto 
both  epipoles.  The  transformation  A,  therefore,  is  a  pro¬ 
jective  transformation  of  the  plane  Note  that 

A  is  uniquely  determined  provided  that  no  three  of  the 
four  points  are  collinear. 

Let  fipl  =  Ap  for  some  arbitrary  point  p.  Because  lines 
are  projective  invariants,  any  point  along  the  epipolar 
line  pv  must  project  onto  the  epipolar  line  j/v\  Hence, 

is  an  induced  epipolar  transformation.  Q 

Given  the  epipoles,  therefore,  we  need  just  three 
points  to  determine  the  correspondences  of  all  other 
points  coplanar  with  the  plane  passing  through  the 
three  corresponding  object  points.  The  transformation 
(collineation)  A  of  the  face  P1P2P3  is  determined  from 
the  following  equations: 

Api=Pjp'j,  i=  1,2,3 

Av  =  pv' , 

where  p,pj  are  unknown  scalars,  and  ^43, 3  =  1.  One 
can  eliminate  p,pj  from  the  e(|uations  and  solve  for  the 
matrix  A  from  the  three  corresponding  points  and  the 
corresponding  epipoles.  This  leads  to  a  linear  system  of 
eight  equations  (for  more  details  see  appendices  in  [20, 
23]).  Similarly,  we  can  solve  for  the  matrix  E  accounting 
for  the  projection  of  the  face  P2P3P4  from  the  equations 
below: 

^Pj-PjP'j,  i  =  2,3,4 

Ev  —  pv' . 

If  we  set  ^  =  Ap  and  f/  =  Ep  (note  that  p  and  p  are 
somewhere  along  the  rays  O'P  and  O'P,  respectively), 
then  the  cross  ratio  Op  can  be  computed  using  the  linear 
combination  of  rays  result  known  in  projective  geometry 


([10],  for  example)  as  follows:  we  represent  p  and  p  as 
linear  combinations  of  v'  and  p\ 

pp  =  v'  +  kp 
pp  =  v'  -k  k'p, 

then  ttp  =  ^  (note  that  p  and  k  are  fully  determined, 
and  so  are  /i  and  k').  Note  that  we  have  made  use  d 
the  epipoles  twice  in  our  derivations.  First,  is  because 
of  having  O  as  one  of  our  reference  points  —  this  by 
definition  brings  the  epipoles  into  the  picture.  Second, 
the  epipoles  were  used  in  order  to  determine  the  image 
correspondences  due  to  two  faces  of  the  tetrahedron  of 
reference.  Without  the  epipoles  we  would  have  needed 
an  extra  point  on  each  face,  hence  loosing  some  gener¬ 
ality  because  some  of  the  reference  points  would  have 
been  coplanar.  The  computations  for  recovering  Op  are 
simple  and  linear,  and  for  convenience  are  suiiunarized 
below: 

1:  Recover  the  transformation  A  that  satisfies  pt/  = 
Av  and 

Pjpj  =  Apj ,  j  =  1, 2,3.  Similarly,  recover  the  trans¬ 
formation  E  that  satisfies  pv'  =  Ev  and  pjpj  = 
^Pji  j  —  2,3,4. 

2:  Compute  Op  as  the  cross  ratio  of  p  ,Ap,Ep,v',  for 
all  points  p. 

One  can  easily  see  how  the  projective  invariant  can  be 
used  to  re-project  the  scene  onto  a  third  view.  Sim¬ 
ply  perform  Step  1  between  the  first  and  novel  view 
(only  four  corresponding  points  and  the  corresponding 
epipoles  are  required).  For  any  fifth  point  p,  its  corre¬ 
sponding  point  p'  in  the  third  image  can  be  found  via 
Op  that  has  been  recovered  from  the  ctvrespondence  be¬ 
tween  p  and  p  (three  points  on  the  epipolar  line  and  the 
cross  ratio  uniquely  determine  the  fourth  point  p').  We 
will  discuss  re-projection  and  3D  reconstruction  in  more 
detail  later,  but  before  doing  that  it  may  be  worthwhile 
to  consider  the  situation  of  orthographic  projection. 

As  mentioned  previously,  it  is  the  property  of  the  pro¬ 
jective  framework  that  orthographic  projection  becomes 
a  puticular  case  that  does  not  require  special  treatment 
—  this  because  the  reference  frame  can  map  onto  any 
configuration  including  the  case  where  O  is  at  infinity. 
Within  the  proposed  geometric  construction  there  are 
two  points  worth  mentioning  regarding  the  case  of  or¬ 
thographic  projection.  First,  the  invariant  Op  remains 
fixed  under  any  projective  transformation  of  the  second 
image  plane  (the  view  on  which  Op  is  computed).  In 
particular  the  projection  onto  the  second  view  can  be 
orthographic  (cross  ratios  are  well  defined  for  parallel 
rays  as  well).  Second,  consider  the  case  when  the  first 
view  is  orthographic,  i.e.,  O  is  at  infinity.  In  this  case  Op 
turns  into  an  affine  structure  invariant: 

-  -  P-P 

"  P-P 

As  a  result,  the  projective  invariant  is  defined  and  re¬ 
covered  under  both  orthographic  and  perspective  pro¬ 
jections.  Therefore,  in  addition  to  enabling  the  use  of 
uncalibrated  cameras,  we  have  the  property  (associated 
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Figure  4:  Reconstructing  homogeneous  coordinates  of  P 
(see  text). 

with  the  projective  framework  and  not  to  the  particular 
algorithm  we  proposed)  that  the  size  of  field  is  no  longer 
an  issue  as  in  a  metric  framework  [1,  5]. 

We  next  show  how  to  reconstruct  the  homogeneous  co¬ 
ordinate  representation  of  the  scene  given  that  we  have 
recovered  Op.  Taken  together,  the  central  result  is  that 
we  can  recover  projective  structure  without  recovering 
the  camera  transforms  using  only  four  corresponding 
points  and  the  corresponding  epipoles. 

4  Reconstructing  Homogeneous 
Coordinates 

Given  the  invariant  structure  Op  we  can  easily  recon¬ 
struct  the  homogeneous  coordinates  (X,  Y,  Z,  T)  of  any 
fifth  object  point  P  (its  actually  a  sixth  point  overall, 
but  its  the  fifth  object  point).  We  first  assign  the  stan¬ 
dard  projective  coordinates  to  our  frame  of  reference 
as  follows;  the  coordinates  (1, 1, 1, 1)  are  assigned  to  O 
(the  COP  of  the  first  camera  position),  then  the  coordi¬ 
nates  (0,0,1,0),(0,1,0,0),(0,0,0,1)  and  (1,0,0,0)  are 
assigned  to  the  four  reference  points  P\,P2,Pz  and  P|, 
respectively  (see  Figure  4). 

In  this  choice  of  coordinate  system  we  have  that  P  = 
(0,y,z,i)  and  P  =  (z,y,  0,  t).  Note  also  that  the  projec¬ 
tion  of  P4  onto  the  plane  P1P2P3  is  the  point  with  coordi¬ 
nates  (0, 1, 1, 1).  In  order  to  recover  P  we  map  the  image 
plane  onto  the  plane  P1P2P3  by  solving  for  the  projec¬ 
tive  transformation  B  that  is  determined  by  the  four 
following  correspondences.  Let  ei,...,e4  be  the  vectors 
(0,1,0),(1,0,0),(0,0, 1),(1, 1, 1).  The  correspondences 
Pj  < — ►  Cj,  j  =  1,...,4,  fully  determine  the  projective 
transformation  B,  i.e.,  Bpj  =  Pj^j-  We  can  therefore 
set  the  coordinates  of  P: 


In  a  similar  fashion  we  can  recover  P  and  with  the  knowl¬ 


edge  of  ttp  we  can  determine  the  coordinates  of  P.  We 
can  also  do  that  in  a  simpler  way  without  recovering  P, 
as  follows.  We  know  that 

/iP  =  0-bs'P, 

pP  =  0  +  sP, 

and  ttp  =  -fr-  Because  the  third  coordinate  of  P  is  always 
zero,  we  have  s'  =  —4.  Thus, 

P  =  O  -  ^P. 
z 

We  have  arrived  to  the  following  result: 

Theorem  1  In  the  case  where  the  location  of  epipoles 
are  known,  then  four  corresponding  points,  coming  from 
four  non-coplanar  points  ta  space,  are  sufficient  for  com¬ 
puting  the  SD  homogeneous  projective  coordinates  for 
all  other  points  in  space  projecting  onto  corresponding 
points  in  both  views.  In  the  case  the  scene  is  undergoing 
an  affine  transformation  in  space,  then  the  reconstructed 
scene  is  related  to  the  true  one  bg  some  unknown  affine 
transformation. 

Note  that  the  assignment  of  standard  coordinates  to 
the  frame  of  reference  is  an  arbitrary  choice  of  repre¬ 
sentation  and  therefore,  in  the  general  case,  the  recon¬ 
structed  structure  is  unique  up  to  an  unknown  projec¬ 
tive  transformation  of  the  scene.  When  the  scene  un¬ 
dergoes  only  affine  transformations  in  space,  then  the 
COP  can  have  fixed  coordinates  in  space  while  allowing 
the  remaining  basis  points  Pi,...,P|  to  have  any  arbi¬ 
trary  representation  in  projective  space.  Because  the 
COP  is  part  of  the  reference  frame,  it  is  always  assigned 
the  same  coordinates  regardless  of  the  viewing  position 
from  which  we  choose  to  reconstruct  the  scene.  There¬ 
fore,  the  reconstructed  scene,  using  the  algorithm  de¬ 
scribed  above,  will  be  unique  up  to  an  unknown  affine 
transformation  in  space,  and  not  a  general  projective 
transformation.  For  convenience  one  can  projectively 
transform  the  reconstructed  coordinates  (X,Y,Z,T)  to 
(X,  Y,Z,X  -kY  -k  Z  +  T)  which  ensures  that  the  fourth 
coordinate  is  non-zero. 

In  comparison  with  Faugeras’  [6]  results,  the  bottom 
line  is  the  same,  i.e.,  with  four  corresponding  points 
and  the  corresponding  epipoles  we  can  achieve  3D  re¬ 
construction  of  projective  or  affine  space.  The  approach 
and  the  reconstruction  algorithm  are  different,  mainly 
because  we  go  about  the  reconstruction  process  directly 
without  first  recovering  the  camera  transformation  ma¬ 
trices  and  instead  recover  first  a  geometric  invariant  a,, 
which  then  can  be  used  to  reconstruct  the  homogeneous 
coordinates.  Faugeras  goes  first  through  full  recon¬ 
struction  of  the  camera  transformations  using  five  corre¬ 
sponding  points  and  the  corresponding  epipoles.  In  the 
case  of  four  corresponding  points  (and  the  corresponding 
epipoles),  Faugeras  shows  that  the  camera  transforma¬ 
tion  can  be  recovered  up  to  four  unknown  parameters. 
Once  these  parameters  are  set  (arbitrarily)  then  recon¬ 
struction  follows  directly,  and  if  one  uses  the  same  setting 
of  the  four  parameters  when  reconstructing  the  scene 
from  different  view-points,  then  the  reconstructions  are 
only  an  affine  transformation  away  from  each  other.  In 
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our  case,  instead  of  fixing  four  parameters  in  the  camera 
transformation  from  the  scene  to  the  first  view,  we  fix 
the  coordinates  of  the  COP  by  having  it  being  part  of 
the  reference  frame. 

We  next  discuss  the  use  of  these  results  (the  projective 
invariant  or  the  reconstructed  scene)  for  obtaining  re¬ 
projection  onto  a  third  view. 

5  Achieving  Re-projection 

Considering  the  two  views  we  worked  with  so  far  as 
“model”  views  of  an  object  of  interest,  we  can  use  the 
projective  invariant  Op  or  the  homogeneous  coordinates 
to  re-project  the  object  onto  any  novel  view  given  a  small 
number  of  corresponding  points  across  the  three  views. 

First,  consider  the  use  of  Op  to  achieve  re-projection. 
Assume  we  have  four  corresponding  points  across  the 
three  views  pj  < — ►  pj  < — ►  pj',  j  =  1,...,4,  and  the 
epipoles  v,  v'  between  the  two  model  views  and  u,  u" 
between  the  first  mode!  view  and  the  novel  view.  From 
the  correspondences  pj  *—*  pfL  j  =  1,2,3,  and  u  * — ►  u" 
we  recover  the  collineation  B,  and  similarly  from  the 
correspondences  pj  < — ►  pJ',  j  =  2, 3, 4,  and  u  < — ►  u"  we 
recover  the  collineation  D.  Then,  for  any  corresponding 
points  p  * — *  j/,  the  third  correspondence  p"  can  be 
recovered  from  the  cross  ratio  Op  (computed  from  the 
two  model  views)  and  the  three  points  Bp,  Dp,  u". 

An  alternative  method  is  to  first  reconstruct  the  ho¬ 
mogeneous  coordinates  of  all  points  of  interest  from  the 
two  model  views  (by  using  four  corresponding  points  and 
the  corresponding  epipoles).  We  then  need  only  six  cor¬ 
responding  points  between  the  first  model  view  and  the 
novel  view  in  order  to  recover  the  camera  transformation 
matrix  T  from  the  scene  onto  the  novel  view: 

Pjp'l  =  TPj  j  =  l,...,6. 

Note  that  we  have  11  unknowns  for  T  and  6  more  un¬ 
knowns  for  Pj,  but  we  have  18  linear  equations.  Then, 
for  any  point  p  for  which  we  have  recovered  homogeneous 
coordinates  of  the  corresponding  scene  point  P,  we  can 
recover  the  projection  of  P  onto  the  novel  view  by, 

pp"  =  TP. 

This  method,  although  less  direct  than  the  previous  one 
does  not  require  the  epipoles  between  the  first  model 
view  and  the  novel  view  (which  requires  eight  corre¬ 
sponding  points),  and  therefore  achieves  re-projection 
with  fewer  corresponding  points  with  the  novel  view. 

For  completeness  we  review  next  two  methods  for  re¬ 
covering  epipoles  from  point  correspondences  between 
two  views.  Both  methods  are  linear  —  one  requires  cor¬ 
respondences  coming  from  six  points,  four  of  which  are 
assumed  to  be  coplanar,  and  the  second  method  requires 
eight  general  correspondences. 

6  Recovering  the  Epipoles 

The  problem  of  recovering  the  epipoles  is  well  known 
and  several  approaches  have  been  suggested  in  the  past 
(17,  21,  15,  8,  12,  7]. 

In  general,  the  epipoles  can  be  recovered  from  six 
points  [15]  (four  of  which  are  assumed  to  be  coplanar). 


Figure  5:  The  geometry  of  locating  the  left  epipole  using 
two  points  out  of  the  reference  plane. 

seven  points  (non-linear  algorithm,  see  [8]),  or  eight 
points  [6].  The  basic  idea  behind  the  six  point  method  is 
that  the  ray  connecting  the  COP  of  the  first  camera  posi¬ 
tion  O  and  any  object  point  P  projects  onto  an  epipolar 
line  in  the  second  image,  and  therefore  the  epipole  can  be 
found  by  intersecting  two  epipolar  lines  (see  Figure  5). 
Given  six  points  Pi,...,Pt  where  Pi,...,P4  are  copla- 
nar  and  Ps,P6  are  out  of  that  plane,  first  recover  the 
projective  transformation  A  that  satisfies  pjPj  =  Apj, 

j  =  1 . 4,  then  the  epipoles  v'  and  »  arc  obtained  as 

follows: 

v'  =  {p'i  X  Api)  X  (jfs  X  Ape), 

V  =  (pe  X  A'Ve)  (Ps  x  A'Ve)- 

Note  that  the  epipoles  are  represented  as  rays  with  re¬ 
spect  to  the  camera  centers,  and  therefore  the  case  of 
parallel  epipolar  lines  leads  to  a  ray  parallel  to  the  im¬ 
age  plane  (third  coordinate  vanishes). 

The  basic  idea  behind  the  eight  point  method  [6]  is 
that  since  epipolar  lines  in  both  images  are  projectively 
related,  then  the  epipolar  geometry  may  be  represented 
as  a  2D  correlation  matrix.  Let  F  be  an  epipcdar  trans¬ 
formation,  i.e.,  FI  =  pi',  where  l  =  vxp  and  I'  =  v'  xpf 
are  corresponding  epipolar  lines.  We  can  rewrite  the  pro¬ 
jective  relation  of  epipolar  lines  using  the  matrix  form  of 
cross-products: 

F{v  X  p)  =  Ffwjp  =  pf', 

where  [v]  is  a  skew  symmetric  matrix  (and  hence  has 
rank  2).  From  the  point/line  incidence  property  we  have 
that  p'  ■  I'  =  0  and  therefore,  p'*F[w]p  =  0,  or  pf*Hp  =  0 
where  H  —  F[v].  The  matrix  ^  is  a  2D  correlation 
(i.e.,  maps  points  onto  lines)  and  is  also  known  as  the 
“essential”  matrix  introduced  by  [16],  and  is  of  rank  2. 
One  can  recover  H  (up  to  a  scale  factor)  directly  from 
eight  corresponding  points,  or  by  using  a  principle  com¬ 
ponents  approach  if  more  than  eight  points  are  available. 
Finally,  it  is  easy  to  see  that  Hv  =  0,  and  therefore  the 
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epipole  V  can  be  uniquely  recovered  (up  to  a  scale  factor). 
Note  that  the  determinant  of  the  first  principle  minor  of 
H  vanishes  in  the  case  where  v  is  an  ideal  point,  i.e., 
^11^33  —  ^13^31  =  0.  In  that  case,  the  z,y  components 
of  V  can  be  recovered  (up  to  a  scale  factor)  from  the  third 
row  of  H. 


7  Computer  Simulation 

We  ran  computer  simulations  to  test  the  robustness  of 
the  re-projection  method  under  various  types  of  noise. 
Instead  of  measuring  the  error  due  to  reconstruction  we 
measured  the  errors  due  to  re-projection  onto  a  third 
view.  The  assumption  being  that  the  performance  of 
the  system  (reconstruction  and  re-projection)  largely 
depends  on  the  quality  of  Oj,,  so  we  may  as  well  ob¬ 
serve  noise  effects  on  re-projection.  We  tested  the  sys¬ 
tem  using  both  schemes  for  recovering  the  epipoles.  In 
general,  the  8-point  scheme  is  significantly  more  sensi¬ 
tive  to  noise,  and  in  practice  additional  corresponding 
points  are  required  to  achieve  reasonable  recovery  of  the 
epipoles.  The  experiments  we  describe  below  use  the  6- 
point  scheme  for  recovering  the  epipoles.  Because  the 
6-point  scheme  requires  that  four  of  the  corresponding 
points  be  projected  from  four  coplanar  points  in  space, 
it  is  of  special  interest  to  see  how  the  method  behaves 
under  conditions  that  violate  this  assumption,  and  under 
noise  conditions  in  general. 

The  object  we  used  for  the  experiment  consists  of  26 
points  in  space  arranged  in  the  following  manner:  14 
points  are  on  a  plane  (reference  plane)  ortho-parallel  to 
the  image  plane,  and  12  points  are  out  of  the  reference 
plane.  The  reference  plane  is  located  two  focal  lengths 
away  from  the  center  of  projection  (focal  length  is  set 
to  50  units).  The  depth  of  out-of-plane  points  varies 
randomly  between  10  to  25  units  away  from  the  refer¬ 
ence  plane.  The  x,y  coordinates  of  all  points,  except  the 
points  Pi,  ...jPe,  vary  randomly  between  0  —  240.  The 
points  Pi,...,Ps  have  x,y  coordinates  that  place  these 
points  all  around  the  object  (clustering  these  points  to¬ 
gether  will  inevitably  contribute  to  instability). 

We  applied  the  following  camera  motion:  The  first 
view  is  simply  a  perspective  projection  of  the  object. 
The  second  view  is  a  result  of  rotating  the  object  around 
the  point  (128,128,100)  with  an  axis  of  rotation  de¬ 
scribed  by  the  unit  vector  (0.14,0.7,0.7)  by  an  angle  of 
29  degrees,  followed  by  a  perspective  projection  (note 
that  rotation  about  a  point  in  space  is  equivalent  to  ro¬ 
tation  about  the  center  of  projection  followed  by  trans¬ 
lation).  The  third  (novel)  view  is  constructed  in  a 
similar  manner  with  a  rotation  around  the  unit  vector 
(0.7,0.7,0.14)  by  an  angle  of  17  degrees. 

We  conducted  three  types  of  experiments.  The  first 
experiment  tested  the  stability  under  the  situation  where 
Pi,...,Pti  are  non-coplanar  object  points.  The  second 
experiment  tested  stability  under  random  noise  added 
to  all  image  points  in  all  views,  and  the  third  experi¬ 
ment  tested  stability  under  the  situation  that  less  noise 
is  added  to  the  six  points,  than  to  other  points. 


7.1  Testing  Deviation  from  Coplanarity 

In  this  experiment  we  investigated  the  effect  of  translat¬ 
ing  Pi  along  the  optical  axis  (of  the  first  camera  position) 
from  its  initial  position  on  the  reference  plane  (z  =  100) 
to  the  farthest  depth  position  (z  =  125),  in  increments 
of  one  unit  at  a  time.  The  experiment  was  conducted 
using  several  objects  of  the  type  described  above  (the 
six  points  were  fixed,  the  remaining  points  were  assigned 
random  positions  in  space  in  different  trials),  undergoing 
the  same  motion  described  above.  The  effect  of  depth 
translation  to  the  level  z  =  125  on  the  location  of  pi  is  a 
shift  of  0.93  pixels,  on  is  1.58  pixek,  and  on  the  loca¬ 
tion  of  Pi  is  3.26  pixels.  Depth  translation  is  therefore 
equivalent  to  perturbing  the  location  of  the  projections 
of  Pi  by  various  degrees  (depending  on  the  3D  motion 
parameters). 

Figure  6  shows  the  average  pixel  error  in  re-projection 
over  the  entire  range  of  depth  translation.  The  average 
pixel  error  was  measured  as  the  average  of  deviations 
from  the  re-projected  point  to  the  actual  location  of  the 
corresponding  point  in  the  novel  view,  taken  over  all 
points.  Figure  6  also  displays  the  result  of  re-projection 
for  the  case  where  Pi  is  at  z  =  125.  The  average  error 
is  1.31,  and  the  maximal  error  (the  point  with  the  most 
deviation)  is  7.1  pixels.  The  alignment  between  the  re¬ 
projected  image  and  the  novel  image  is,  for  the  most 
part,  fairly  accurate. 

7.2  Situation  of  Random  Noise  to  all  Image 
Locations 

We  next  add  random  noise  to  all  image  points  in  all 
three  views  (Pi  is  set  back  to  the  reference  plane).  This 
experiment  was  done  repeatedly  over  various  degrees  of 
noise  and  over  several  objects.  The  results  shown  here 
have  noise  levels  between  0-1  pixels  randomly  added  to 
the  X  and  y  coordinates  separately.  The  maximal  per¬ 
turbation  is  therefore  \/2,  and  because  the  direction  of 
perturbation  is  random,  the  maximal  error  in  relative 
location  is  double,  i.e.,  2.8  pixels.  Figure  7  shows  the 
average  pixel  errors  over  10  trials  (one  particuh.r  object, 
the  same  camera  motion  as  before).  The  average  error 
fluctuates  around  1.6  pixels.  Also  shown  is  the  result 
of  re-projection  on  a  typical  trial  with  average  error  of 
1.05  pixels,  and  maximal  error  of  5.41  pixels.  The  match 
between  the  re-projected  image  and  the  novel  image  is 
relatively  good  considering  the  amount  of  noise  added. 

7.3  Random  Noise  Case  2 

A  more  realistic  situation  occurs  when  the  magnitude 
of  noise  associated  with  the  six  points  used  for  setting 
the  construction  (epipoles  and  projections  of  the  tetrahe¬ 
dron  of  reference)  is  much  lower  than  the  noise  associated 
with  other  points,  for  the  reason  that  we  are  interested  in 
tracking  points  of  interest  that  are  often  associated  with 
distinct  intensity  structure  (such  as  the  tip  of  the  eye  in  a 
picture  of  a  face).  Correlation  methods,  for  instance,  are 
known  to  perform  much  better  on  such  locations,  than 
on  areas  having  smooth  intensity  change,  or  areas  where 
the  change  in  intensity  is  one-dimensional.  We  therefore 
applied  a  level  of  0-0.3  perturbation  to  the  x  and  y  co¬ 
ordinates  of  the  six  points,  and  a  level  of  0-1  to  all  other 
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Figure  6:  Deviation  from  coplanarity:  average  pixel  error  due  to  translation  of  Pi  along  the  c^tical  axis  from  z  =  100 
to  z  =  125,  by  increments  of  one  unit.  The  result  of  re-projection  (overlay  of  re-projected  image  and  novel  image) 
f<w  the  case  z  =  125.  The  average  error  is  1.31  and  the  maximal  error  is  7.1. 


Figure  7:  Random  noise  added  to  all  image  points,  over  all  views,  for  10  trials.  Average  pixel  error  Buctuates  around 
1.6  pixeb.  The  result  of  re-projection  on  a  typical  trial  with  average  error  of  1.05  pixeb,  and  maximal  error  of  5.41 
pixeb. 


points  (as  before).  The  results  are  shown  in  Figure  8. 
The  average  pixel  error  over  10  triab  fluctuates  around 
0.5  pixeb,  and  the  re-projection  shown  for  a  typical  trial 
(average  error  0.52,  maximal  error  1.61)  is  in  relatively 
good  correspondence  with  the  novel  view.  With  larger 
perturbations  at  a  range  of  0-2,  the  algorithm  behaves 
proportionally  well,  i.e.,  the  average  error  over  10  trials 
is  1.37. 

8  Summary 

We  have  described  new  techniques  for  two  related  prob¬ 
lems:  the  problem  of  recovering  structure  from  point 
matches,  and  the  problem  of  vbual  recognition  via  align¬ 
ment  (the  problem  of  re-projection).  Our  approach  was 
based  on  recovering  a  geometric  projective  invariant  that 
can  then  be  used  for  both  purposes:  reconstruction  and 
re-projection. 

The  key  dbtinct  features  of  our  approach  b,  first, 
the  rob  played  by  the  center  of  projection  and  the 
epipoles.  Second,  the  approach  b  primarily  geomet¬ 
rically  motivated  with  the  definition  of  a  new  invari¬ 
ant  which  then  drives  the  applications  of  reconstruc¬ 
tion  and  re-projection.  Thirdly,  shape  reconstruction 
and  re-projection  are  achieved  without  going  through 
the  computations  of  the  camera  transformation  matri¬ 
ces  (e.g.,  structure  without  motion).  The  overall  fea¬ 


tures  of  the  approach  (shared  with  [6,  19,  11])  b  that 
the  system  treats  orthographic  and  perspective  projec¬ 
tions  alike,  and  internal  camera  parameters  are  folded 
into  the  projection  matrices,  thereby  allowing  for  views 
to  be  taken  by  uncalibrated  cameras. 

The  structure  invariant  was  recovered  from  four 
point  matches  arising  from  the  projections  of  four  non- 
coplanar  object  points,  and  the  epipoles.  The  epipoles 
played  a  double  role:  first,  the  corresponding  epipoles 
served  as  the  projection  of  a  fifth  point  in  space,  thereby 
allowing  us  to  have  a  projective  frame  of  reference  while 
observing  only  four  point  matches  from  the  scene.  Sec¬ 
ond,  with  the  epipoles  we  could  determine  the  projec¬ 
tions  of  various  faces  of  the  tetrahedron  of  reference  —  a 
task  that  otherwise  would  have  required  observing  point 
matches  coming  from  four  coplanar  pmnts  on  each  face. 
We  then  described  two  applications  for  which  the  invari¬ 
ant  can  be  used  for.  First,  we  have  shown  that  with  the 
invariant  we  can  achieve  projective  or  affine  reconstruc¬ 
tion  of  the  scene.  Second,  re-projection  onto  a  third  view 
was  shown  possible  using  the  invariant  directly  without 
going  through  an  explicit  reconstruction  of  projective 
structure. 

Finally,  the  algorithms  for  reconstruction  requires 
eight  corresponding  points,  or  six  assuming  four  of  them 
are  coming  from  coplanar  points  in  the  scene.  For  re¬ 
projection,  the  result  is  that  the  more  we  recover  about 
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Figure  8:  Random  noise  added  to  non-privileged  image  points,  over  all  views,  for  10  trials.  Average  pixel  error 
fluctuates  around  0.5  pixels.  The  result  of  re-projection  on  a  typical  trial  with  average  error  of  0.52  pixels,  wd 
maximal  error  of  1.61  pixels. 


the  scene  and  the  camera  transformation  the  less  point 
matches  are  needed.  We  have  seen  that  if  projective 
structure  is  recovered,  then  only  six  point  matches  with 
the  novel  view  are  required  for  linear  re-projection  (via 
recovery  of  the  camera  transform  matrix).  If  the  projec- 
tive  invariant  is  used  instead,  then  eight  point  matches 
are  required. 
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Abstract 

A  multilevel  energy  environment  has  been  de¬ 
veloped  that  simultaneously  performs  delin¬ 
eation,  representation  and  classification  of  two- 
dimensional  objects  by  using  a  global  optimisa¬ 
tion  technique.  The  energy  environment  sup¬ 
ports  a  novel  multipolar  shape  representation 
which  allows  the  delineation  and  representa¬ 
tion  tasks  to  be  viewed  as  a  single  operation. 

The  delineator  acts  as  a  hypothesis  generator 
for  the  multipolar  representation,  which  uses 
minimum  description  length  tests  to  determine 
whether  to  establish  new  polar  centers.  The 
polar  representations  at  these  centers  are  com¬ 
pared  with  a  database  of  such  representations 
in  order  to  identify  pieces  of  objects.  This 
method  is  more  robust  than  conventional  mul- 
tistaged  approaches  to  object  recognition  be¬ 
cause  it  incorporates  all  the  information  about 
the  objects  into  a  single  decision  process. 

1  Introduction 

This  paper  presents  a  novel  approach  to  two-dimensional 
object  recognition,  which  entails  the  solution  of  three 
sub-problems:  delineation,  representation  and  classifica¬ 
tion  (DRC).  Due  to  the  difficulty  of  solving  this  problem 
in  its  totality,  conventionad  object  recognition  systems 
address  each  of  the  sub-problems  separately,  and  solve 
them  sequentially.  For  example,  a  typical  system  might 
first  extract  the  contour  of  a  region  (delineation);  then 
polygonally  approximate  the  contour  (representation); 
and  finaJly  match  pieces  of  the  polygon  to  polygonal 
approximations  of  objects  in  a  database  (classification). 
Much  work  has  been  done  on  each  of  these  sub-problems 
and  the  relevant  literature  is  extensive.  Attempts  to  con¬ 
centrate  on  a  particular  stage  always  involve  (often  im¬ 
plicitly)  assumptions  regarding  the  quality  of  the  avail¬ 
able  input  information.  These  assumptions  are  often  too 
strong  and  may  lead  to  unpredictable  errors  in  the  recog¬ 
nition  process. 

A  well-known  theoretical  framework  describes  the 
kinds  of  difficulties  encountered  by  these  conventional 
systems.  Each  of  the  steps  in  the  DRC  process  can  be 
thought  of  as  a  sequential  stage  in  a  communication  sys¬ 
tem  [8].  The  output  of  each  stage  depends  only  on  the 


input  to  it  from  the  previous  stage.  Important  informar 
tion  may  be  lost  through  this  communication  process. 
For  example,  in  the  typical  approach  to  DRC  described 
above,  the  delineator  “sees”  only  the  input  image  and 
its  performance  is  dependent  on  the  extraction  parame¬ 
ters.  The  edge  information  passed  on  to  the  representor 
is  generally  incomplete  and  may  contain  false  alarms, 
thus  potentially  propagating  errors  of  the  first  and  sec¬ 
ond  kinds.  The  representor  creates  its  polygonal  ap¬ 
proximation  of  the  contour  based  only  on  this  edge  map; 
this  process  can  introduce  its  own  errors  into  the  sys¬ 
tem.  These  compounded  errors  are  then  carried  on  to 
the  classification  stage,  where  the  final  decisions  of  the 
recognition  process  are  made. 

Energy  function  (EF)  based  approaches  are  of  interest 
in  the  DRC  problem  because  of  their  ability  to  incorpo¬ 
rate  many  different  objective  functions  into  a  single  cost 
to  be  minimized.  These  approaches  include  a  paradigm 
called  “snakes”  first  proposed  by  Kass,  Witkin  and  Ter- 
zopoulos  [10].  This  is  an  active  contour  model  approach 
which  uses  controlled  continuity  splines.  The  EF  con¬ 
sists  of  a  linear  combination  of  three  components  which 
attract  the  snake  to  edges,  lines  and  terminations.  The 
first  component  attempts  to  minimize  a  cost  relating  to 
the  image  data — for  example,  to  maximize  edge  strength 
or  stereo  disparity.  The  next  component  minimizes  in¬ 
ternal  energy  and  has  an  effect  on  how  flexible  the  snake 
is  allowed  to  be.  These  components  are  integrated  over 
the  contour;  hence,  local  influences  are  propagated  glob¬ 
ally  throughout  the  contour.  Other  components  allow 
external  infiuences  or  volumetric  information  [2]  to  af¬ 
fect  the  delineation  process.  The  optimization  process  is 
driven  by  an  iterative  solution  of  the  Euler  equations;  it 
can  be  trapped  at  local  minima  and  is  therefore  sensitive 
to  initialization.  The  process  is  also  computationally  ex¬ 
pensive  [12].  Moreover,  it  does  not  provide  a  mechanism 
for  incorporating  knowledge  about  the  expected  region 
shapes  into  the  delineation  process;  in  other  words,  it 
does  not  integrate  the  processes  of  delineation  and  clas¬ 
sification. 

Another  EF  approach  is  typified  by  the  minimum  de¬ 
scription  length  (MDL)  algorithm  due  to  Leclerc  [11]. 
Here  the  function  involves  two  components;  a  “lan¬ 
guage”  component  which  represents  chain  code  descrip¬ 
tions  of  region  boundaries,  and  a  “noise”  component 
which  measures  the  degree  of  variability  of  the  pixel 
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intensities  within  the  regions.  The  function  is  applied 
globally  (i.e.  to  the  entire  image),  starting  with  the  par¬ 
tition  in  which  each  pixel  is  a  separate  region,  and  em¬ 
ploys  a  graduated  nonconvexity  algorithm  [1]  to  drive  the 
description  length  to  a  near  global  minimum.  This  ap¬ 
proach  is  computationally  expensive  and  provides  only 
a  global  mechanism  (the  non-convexity  parameter)  for 
guiding  the  search.  Fua  and  Hanson  [6,  7]  presented  an 
MDL-based  DRC  technique  using  a  high-level  compo¬ 
nent  which  incorporated  chain  code  models  for  object 
shapes.  Due  to  its  use  of  a  global  approach  their  ^go- 
rithm  could  not  easily  perform  local  description  adjust¬ 
ments  because  these  adjustments  had  little  effect  on  the 
global  cost. 

Another  possible  framework  for  addressing  the  DRC 
problem  has  emerged  from  the  work  of  Geman  and  Ge- 
man  [9].  By  pointing  out  the  duality  between  Markov 
Random  Fields  (MRFs),  which  are  defined  by  con¬ 
ditional  spatial  probabilities,  and  Gibbs  distributions, 
which  are  determined  by  an  EF,  Geman  and  Geman  pro¬ 
vided  a  domain  within  which  sub-processes  could  operate 
and  communicate.  Moreover,  the  resulting  algorithms 
are  naturally  parallelizable,  and  also  contain  a  natural 
mechanism  for  hypothesis  generation,  namely  the  Gibbs 
sampler. 

Geman  and  Geman  employed  a  MAP  estimation 
paradigm  (which,  together  with  some  assumptions  about 
the  noise  model,  resulted  in  the  EF  format)  to  perform 
image  restoration  by  simulated  annealing,  using  a  two- 
level  MRF.  At  the  first  level,  MRF  sites  were  pixel  in¬ 
tensities  with  spatially  homogeneous  clique  potentials; 
the  second  level  introduced  a  line  process  which  elimi¬ 
nated  the  cliques  at  suspected  contour  locations.  The 
line  process  did  not  provide  an  easy  way  of  incorporat¬ 
ing  information  about  arbitrary  region  shapes.  Also,  the 
computational  cost  of  implementing  such  an  MRF,  even 
for  a  64  X  64  image,  was  prohibitive. 

A  more  efficient  way  of  using  the  Gemans’  approach 
was  introduced  in  a  paper  by  Friedland  and  Adam  [4]. 
Here,  a  one-dimensional  cyclic  MRF  was  proposed, 
where  the  MRF  sites  were  radii  emanating  from  a  given 
center.  This  had  the  effect  of  reducing  the  size  of  the 
MRF  by  at  least  two  orders  of  magnitude.  The  paper 
presented  an  approach  to  determining  cavity  boundaries 
in  echocardiograms  using  an  EF  whose  components  rep¬ 
resented  edge  strength,  contour  smoothness,  and  cavity 
volume.  These  factors  were  considered  simultaneously 
since  they  all  resided  in  the  same  EF. 

This  paper  presents  an  integrated  approach  to  DRC 
based  on  EF  minimization.  Section  2  describes  a  DRC 
algorithm  for  what  we  call  “compact”  2-D  objects.  A 
compact  object  is  star-shaped,  i.e.  its  entire  contour  is 
visible  from  an  interior  point;  in  addition,  we  require 
that  the  distances  from  this  point  to  the  contour  of  the 
object  are  not  highly  variable,  so  that  the  contour  can  be 
represented  in  polar  form  by  a  smoothly  varying  radius 
function.  Compact  object  DRC  is  performed  using  a  1- 
D  cyclic  MRF  which  provides  a  framework  for  a  polar 
object  representation,  and  in  which  delineation  and  clas¬ 
sification  are  performed  by  appropriate  EF  components. 
Thus  local  and  global  criteria  are  combined  in  the  same 


EF,  which  results  in  a  single-stage  recognition  process. 
The  optimization  method  used  is  simulated  annealing, 
which  provides  the  means  of  driving  the  energy  to  its 
lowest  value,  and  hence  leads  to  an  MRF  state  with  the 
highest  probability. 

Sections  3  and  4  extend  the  approach  to  “compound” 
objects,  which  are  unions  of  small  numbers  of  compact 
objects;  here  the  contour  visibility  requirement  is  relaxed 
and  the  objects  are  allowed  to  have  deep  concavities  and 
lobes.  For  this  class  of  objects  we  use  a  “multipolar”  rep¬ 
resentation,  in  which  segments  of  the  object’s  contour  are 
represented  by  radius  functions  defined  over  sectors  that 
emanate  from  a  set  of  centers,  as  described  in  Section  3. 
As  we  shall  see  in  Section  4,  this  allows  us  to  perform 
DRC  using  a  set  of  sector  MRFs.  It  also  overcomes  the 
occlusion  problems  from  which  the  single-polar  represen¬ 
tation  suffers. 

2  Compact  Object  DRC 

2.1  The  1-D  Cyclic  MRF 

Let  R  =  (ri, . . . ,  r„)  be  a  vector  of  discrete  random  vari¬ 
ables  ri  which  represent  radii  emanating  from  a  given 
center,  1  <  «  <  n,  and  let  =  (ui, . .  a  vec¬ 

tor  of  radius  lengths,  define  a  possible  configuration 
(n  =  wi ,  r2  =  W2,  • . . ,  r„  =  Wn)  of  R;  the  set  fl  of  such 
w’s  is  the  MRF’s  sample  space.  A  IDCMRF  is  then 
defined  by 

P(R  =  u>)>0VwGn  (1) 

Pin  =  Wi  |r>  =Uj ;  j  ^  i)  =  P{ri=Wi  |r;  =  w, ;  j€Ni)  (2) 

W€(l,2,...,n);Vu;€fl, 

where  Ni  is  a  neighborhood  of  rj.  For  simplicity,  we 
take  Ni  to  be  (t  —  1,  t  +  1)  and  we  use  equispaced  radii 
at  angular  intervals  of  A0  =:  Since  we  are  dealing 

with  a  closed  contour,  r„+i  =  rt  and  the  neighborhoods 
are  defined  modulo  n.  We  also  assume  that  the  a;,-  take 
on  discrete  values  in  a  bounded  range 

Figure  1  illustrates  a  IDCMRF,  its  sites  (radii),  its 
sample  space  (radius  values),  its  center  location  and  its 
neighborhood  system.  Note  that  an  MRF  configuration 
R  =  w  defines  the  polar  representation  of  a  contour 
relative  to  the  given  center. 
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The  IDCMRF  is  initialized  by  choosing  a  center, 
which  is  assumed  to  lie  somewhere  inside  the  object’s 
contour.  (The  problem  of  object  detection  is  not  ad¬ 
dressed  here;  if  an  object  has  been  detected  but  its  loca¬ 
tion  is  known  only  approximately,  it  may  be  necessary 
to  try  a  set  of  initial  centers  in  order  to  insure  that  at 
least  one  of  them  h’j  inside  the  object.)  Each  of  the  n 
radii  emanating  from  the  center  is  assigned  a  randomly 
chosen  initial  value  in  the  range 

Our  implementation  of  the  IDCMRF  allows  the  center 
to  shift.  At  given  steps  in  the  optimization  process  we 
calculate  the  centroid  of  the  current  IDCMRF  configu¬ 
ration  u>.  This  centroid  becomes  the  new  center.  Using 
this  new  center  the  IDCMRF  is  regenerated;  is  trans¬ 
formed  into  u;'  defined  relative  to  the  new  center.  This 
ability  to  shift  the  center  eliminates  the  need  to  choose 
a  “good”  center  initially.  As  we  shall  see  (Figure  2),  the 
shifting  process  tends  to  eventually  shift  the  center  to  a 
position  close  to  the  object’s  centroid.  As  a  result,  the 
polar  representations  of  similar  objects  tend  to  be  simi¬ 
lar;  thus  it  is  meaningful  to  use  these  representations  for 
classification  purposes. 


2.2  The  Energy  Function 

Creating  an  EF  framework  for  compact  object  DRC  in¬ 
volves  defining  energy  components  to  perform  the  ap¬ 
propriate  subtasks.  Our  EF  consists  of  two  parts:  a 
Low  Level  (LL),  consisting  of  locally  computed  quanti¬ 
ties  that  are  used  to  delineate  the  object’s  contour;  and 
a  High  Level  (HL),  which  matches  the  polar  representa¬ 
tion  of  the  contour  to  a  database  of  polar  contour  models 
to  perform  classification.  Symbolically, 

E(u>)  =  WiX  +  W2X  EhM  (3) 

where  Wi ,  W2  are  weights  and  f^LL.  •E'hl  are  the  LL  and 
HL  energy  components,  respectively.  The  weight  W2 
in  Equation  (3)  is  allowed  to  vary  in  the  course  of  the 
optimization;  its  value  depends  on  the  difference  between 
the  current  polar  representation  and  its  best-matching 
model  in  the  database,  as  described  in  Section  2.2.2. 


2.2.1  The  Low  Level 

The  LL  component  of  the  EF  favors  a  delineation  of 
the  object  that  maximizes  contour  smoothness  and  edge 
sharpness.  These  properties  depend  on  very  localized 
regions  along  the  contour.  Edge  sharpness  is  measured 
along  each  radius  individually,  while  contour  smoothness 
is  measured  by  comparing  the  values  of  each  radius 
and  its  neighbors  r,--i  and  r,>i. 

The  contour  (non)smoothness  measure  is 


p  k»  -  -5  X  (a>i-i  -bu;,>i)| 

■^•moothncM^^J  “  ext  * 

isl 


(4) 


where  are  consecutive  radius  values.  The 

edge  sharpness  measure  along  each  radius  is  a  difference 
of  average  gray  levels;  thus 


.  n  /M  M-l  \ 

i:.t.p(‘-)  =  T7E  .  (5) 

1  =  1  \*  =  i  k=o  ) 


Figure  2:  An  example  of  our  approach  to  compact  object 
DRC. 


where  9(.)  is  the  gray  level  at  a  given  point.  These  two 
measures  are  combined  linearly  into  the  LL  component 
of  the  EF: 

Ell  =  ai  X  E.n.oothn.«(u>)+az  X  ( 1  - 

where  01,02  are  weights  and  MAXGRAY  is  the  gray 
level  range  in  the  image. 

The  LL  can  also  be  used  to  introduce  information 
about  the  contrast  between  the  object  and  the  back¬ 
ground.  If  we  have  prior  knowledge  about  what  the  con¬ 
trast  of  the  edges  should  be,  edges  with  improper  con¬ 
trasts  can  be  ignored  in  Equation  (5),  allowing  the  LL 
to  consider  relevant  edges  only. 

The  LL  component  of  the  EF  is  assigned  a  constant 
weight  Wi  throughout  the  optimization.  Its  initial  role 
is  to  construct  an  initial  approximation  of  the  object’s 
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contour;  later,  when  the  HL  component  acquires  more 
weight  (see  Section  2.2.2.),  the  LL  component  insures 
that  the  HL  does  not  “hallucinate”  matches  that  are  not 
corroborated  by  the  image  data. 


2.2.2  The  High  Level 

The  HL  component  is  responsible  for  object  classifica¬ 
tion.  Its  contribution  to  the  optimization  process  con¬ 
sists  of  two  parts:  a  fast  process,  which  actively  partici¬ 
pates  in  the  optimization  at  every  iteration,  and  a  slow 
process  which,  at  regular  iteration  intervals,  selects  an 
object  model  that  best  matches  the  current  MRF  con¬ 
figuration  (i.e.,  the  current  polar  contour  representation) 
and  adjusts  the  value  of  the  weight  W^.  Specifically, 

•  At  every  step  (for  some  preselected  k)  of  the  iter¬ 
ative  optimization  process,  we  compare  the  current 
MRF  configuration  R  =  w  with  a  database  of  polar 
contour  models  S"**^-*,  where  obj  is  an  object  and 
0  is  the  object’s  orientation.  The  models  are  stored 
in  normalized  form,  i.e.  each  of  them  is  divided  by 
the  mean  of  its  radius  values,  and  are  rescaled  to  oi, 
the  mean  of  w,  for  matching  purposes. 


In  the  matching  process,  the  match  error  is  defined 
by 


O/t  —  (J  X  B, 


obj,f  I 


U  X  8, 


obj.S 


(7) 


where  s"**^’*  is  the  i***  radius  value  of  the  model 
The  process  selects  the  model  T®*’'*'*  that 
has  the  lowest  Er  value.  This  error  value  can  be 
used  to  define  a  halting  criterion  by  comparing  it  to 
a  cutoff  threshold. 


The  weight  W2  assigned  to  Ehl  is  inversely  dependent 
on  the  lowest  Er  value: 


IV2  =  03  X  max(l  -  £;r‘’*’j-*).  (8) 

obj,S 


The  updated  HL  component  is  the  sum  of  the  quantities 


^^HL(wi)  = 


(9) 


The  weight  W2  controls  the  influence  of  the  selected 
object  model  on  the  optimization.  The  smaller  the  error, 
the  larger  W2  becomes  relative  to  Wi.  Thus  the  HL 
component  acquires  more  weight  as  its  confidence  in  its 
match  increases.  Since  all  the  models  in  the  database 
are  compact,  any  of  them  helps  the  LL  attain  contour 
smoothness  in  the  initial  stages  of  the  optimization.  As 
the  optimization  approaches  the  global  minimum  of  the 
energy,  and  the  MRF  configuration  approaches  one  of 
the  models  in  the  database,  W2  grows.  As  a  result  the 
HL  component  becomes  more  dominant  and  speeds  up 
the  final  convergence  of  the  optimization. 

Note  that  the  HL  is  an  active  participant  in  the  EF, 
not  merely  a  postprocess.  This  gives  the  system  top- 
down  qualities,  a  very  important  point.  Also,  by  provid¬ 
ing  a  halting  criterion,  the  HL  allows  a  result  of  “don’t 
know”  to  be  obtained. 


2.3  The  Compact  Object  DRC  Algorithm 

The  lUgorithm  is  initialized  by  choosing  a  center  and  es¬ 
tablishing  a  IDCMRF  R  by  choosing  n  radii  emanating 
from  this  center  at  angular  intervals  of  2ir/n.  A  random 
initial  guess  configuration  R  =  u/q  is  assigned  to  the 
IDCMRF  so  that  an  initial  Gibbs  density,  which  corre¬ 
sponds  to  the  probability  density  of  each  site  r,-  €  R,  can 
be  calculated  (see  [4]).  Then  the  optimization  process  is 
allowed  to  begin.  At  every  k*'*'  iteration  of  the  process, 
the  center  is  shifted  to  the  centroid  of  u;.  A  model  which 
best  matches  the  resulting  configuration  is  selected  from 
the  database.  This  model  is  subsequently  used  in  the  HL 
component  of  the  EF.  The  Gibbs  densities  are  also  re¬ 
determined.  This  process  continues  until  either  a  closely 
matching  model  has  been  found,  the  hsdting  criterion 
has  been  satisfied,  or  an  upper  bound  on  the  number  of 
iterations  has  been  reached. 

It  is  important  to  stress  that  the  process  uses  the  same 
EF  system  throughout,  though  the  EF  itself  undergoes 
changes  as  the  process  proceeds. 

2.4  Experimental  Results 

In  this  section  we  demonstrate  some  of  the  characteris¬ 
tics  of  the  compact  DRC  algorithm  through  an  example, 
using  an  image  of  a  tank  obtained  by  a  forward  look¬ 
ing  infra-red  (FLIR)  sensor.  Figure  2  shows  the  MRF 
configuration  at  every  tenth  iteration.  The  initial  image 
window  (85  by  85  pixels)  is  shown  on  the  upper  left;  the 
location  of  the  center  is  marked  with  a  small  -I-.  Note 
that  throughout  the  process,  the  +  remains  in  the  cen¬ 
ter  of  the  window;  as  the  center  shifts,  the  window  shifts 
with  it. 

The  images  have  had  their  gray  scales  compressed 
from  [0,255]  to  [60,195]  for  display  purposes.  The  white 
contour  (gray  level  255)  superimposed  on  each  image 
shows  the  current  MRF  configuration.  The  black  con¬ 
tour  (gray  level  0)  shows  the  current  best  matching 
model,  scaled  by  Q  and  rotated  by  9.  The  number  in 
the  top  left  corner  of  each  image  designates  this  object, 
and  the  number  in  the  upper  right  corner  represents  its 
degree  of  match,  1  —  Er.  The  two  lower  numbers  are  the 
contrast  between  the  object  and  the  background  (left) 
and  the  iteration  number  (right). 

We  see  that  in  the  first  70  iterations  the  algorithm 
thinks  it  has  detected  a  tank  (model  no.  1  or  2)in  the 
lower  left  part  of  the  target.  At  iteration  80,  the  LL 
begins  to  express  its  “doubts”  about  the  validity  of  this 
erroneous  target  identification.  As  the  MRF  begins  to 
acquire  the  correct  target  (no.  0),  center  shifting  allows 
the  MRF  to  generate  configurations  which  increasingly 
resemble  the  correct  result.  This  was  possible  because 
simulated  annealing  allowed  the  process  to  escape  the 
local  minimum  in  which  it  was  initially  trapped. 

More  details  about  our  MRF  approach  to  compact  ob¬ 
ject  DRC,  and  many  additional  examples,  can  be  found 
in  [5].  It  should  be  pointed  out  that  major  occlusions 
cannot  be  handled  by  this  approach,  because  they  in¬ 
troduce  major  distortions  in  the  polar  representation, 
which  interfere  with  finding  correct  model  matches.  In 
the  following  sections  we  will  introduce  a  more  flexible 
approach. 


780 


3  The  Multipolar  Representation 

3.1  Introductioa 

In  this  section  a  novel  multipolar  shape  representation 
scheme,  MPR,  is  presented.  MPR  generalises  the  polar 
representation  (PR)  used  in  Section  2,  which  consists  of 
a  center  (zo>  Vo)  snd  n  radii  spaced  at  angular  intervals 
of  2s'/n,  to  aform  which  allows  ashape  to  be  represented 
by  many  “centers”  (xj,  k),  t  €  (1, 2, ,  .A^),  ea^  with  n< 
radii,  spaced  over  an  anj^ar  sector  at  angular  intervals 
of  2‘ir/n,  which  define  a  polar  representation  of  a  segment 
of  the  shape’s  contour. 

MPR,  because  of  its  multiple  centers,  is  far  less  sensi¬ 
tive  to  occlusion  than  PR  because  matches  to  unoccluded 
contour  segments  can  still  be  found.  At  the  same  time 
MPR  shades  many  of  PR’s  strengths.  Its  shape  descrip¬ 
tions  are  relatively  concise  and  are  invariant  to  scaling, 
translation  and  rotation.  Also,  like  PR,  MPR  is  highly 
compatible  with  MRF  environments. 

Several  stages  are  involved  in  creating  an  MPR.  Ini¬ 
tially,  a  PR  of  the  shape  is  created.  Next,  contour  cur¬ 
vature  extrema  are  detMted.  Due  to  the  discrete  nature 
of  the  representation  an  approximating  “sag  function”  is 
used  to  estimate  the  curvature,  and  extrema  of  this  func¬ 
tion  are  found  (see  Section  3.2).  Segments  of  the  contour 
that  contain  extrema  define  candidate  MPR  centers.  In¬ 
ternal  centers,  i.e.  centers  which  are  inside  the  contour, 
are  defined  for  contour  segments  that  contain  curvature 
maxima,  and  external  centers,  lying  outside  the  contour, 
for  segments  that  contain  minima  (i.e.,  negative  max¬ 
ima). 

To  insure  a  compact  representation  a  candidate  MPR 
undergoes  a  minimum  description  length  (MDL)  test, 
and  is  accepted  only  if  it  is  more  compact  than  the  orig¬ 
inal  PR.  This  process  is  then  repeated  for  each  of  the 
centers  in  the  MPR  until  no  further  creation  of  new  cen¬ 
ters  is  accepted.  To  avoid  fragmentation,  a  lower  bound 
is  imposed  on  the  number  of  radii  associated  with  each 
center. 

3.2  Contour  Segmentaticm 

A  simple  function  which  we  call  the  “sag  function”  is 
used  as  an  estimate  of  the  curvature  of  the  contour. 
The  sag  at  r,-  is  calculated  using  ri(x,y),  r,_i(x,y)  luid 
ri>i(x,y),  the  Cartesian  coordinates  of  the  endpoints  of 
radii  r,-,  r,_i  and  rn.i,  where  ri_i  and  are  the  left 
and  right  neighbor  radii  of  r^.  (If  rj  is  the  first  or  last  ra¬ 
dius  of  its  contour  segment,  one  of  its  neighbors  is  taken 
to  be  the  last  or  first  radius  of  the  adjacent  segment.)  Let 
d3(ri)  be  the  length  of  the  chord  r,-i(x,  y)r,-.|.i(z,  y),  and 
let  di(r,  )  be  the  length  of  the  perpendicular  from  ri(x,  y) 
to  the  chord  (see  Figure  3).  Ssg(rj)  is  then  defined  as 

•fill's 

di  (10) 

0  otherwise 

where  c  is  a  small  quantity  used  to  eliminate  spurious  ex¬ 
trema  along  straight  contour  segments.  They  can  also  be 
eliminated  by  smoothing  the  contour  before  calculating 
the  sag. 
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Figure  3:  The  sag  function. 

Although  the  sag  function  is  not  a  true  iq>praxima- 
tion  to  curvature  (being  scale  invariant),  it  does  have 
the  following  properties: 

•  For  a  straight  line,  where  the  curvature  is  zero,  di 
is  also  zero,  so  the  sag  is  zero. 

•  At  curvature  extrema  di  is  maximized,  while  dj  is 
minimised;  hence  the  sag  function  has  maxims  at 
the  same  locations  as  the  curvature,  provided  the 
radii  are  spaced  closely  enough  to  capture  the  con¬ 
tour’s  behavior. 

After  the  sag  function  has  been  calculated  at  each  ra¬ 
dius  r,,  its  significant  extrema  are  extracted  using  the 
following  heuristics: 

•  A  maximum/minimum  location  must  have  a  posi¬ 
tive/negative  sag  value. 

•  The  nearest  neighbors  of  a  maximum/minimum 
must  also  have  positive/negative  or  zero  sag  values. 

•  The  next  nearest  neighbors  must  have  sag  values 
that  have  at  most  a  given  difference  from  the  values 
at  the  nearest  neighbors. 

Once  these  extrema  are  identified  the  contour  is  par¬ 
titioned  into  segments.  A  partition  point  r*  is  chosen 
between  each  pair  of  consecutive  extrema  as  follows:  If 
the  extrema  are  of  opposite  type,  r*  is  chosen  at  the  lo¬ 
cation  of  the  most  significant  sag  zero-crossing.  If  they 
are  of  the  same  type,  r*  is  chosen  at  the  location  of  the 
most  significant  instance  of  sag  of  the  opposite  type. 

For  each  contour  segment  (r*-extremum-r*),  we  esti¬ 
mate  the  coordinates  (xcf ,  Vcf)  of  the  center  of  a  circular- 
arc  fit  to  the  segment.  Note  that  if  the  extremum  is 
positive  (the  segment  is  convex),  the  center  will  be  in¬ 
ternal  to  the  contour,  while  if  it  is  negative  (a  concave 
segment),  the  center  will  be  external. 

This  method  of  choosing  candidate  centers  has  a  num¬ 
ber  of  benefits.  A  representation  based  on  a  partition  of 
the  object’s  contour  into  segments,  each  containing  a  sig¬ 
nificant  curvature  extremum,  is  quasi-invariant,  i.e.,  it 
does  not  change  greatly  under  modest  aspect  changes. 
Thus  the  MPR  does  not  vary  significantly  if  aspect 
changes  are  small.  Furthermore,  occlusion  nearly  always 
involves  a  situation  where  a  convex  part  is  missing;  thus 
a  representation  which  breaks  the  contour  into  convex 
and  concave  segments  provides  a  natural  framework  for 
dealing  with  occlusion. 

The  contour  segment  is  now  represented  in  a  polar 
coordinate  system  centered  at  angular 

support  of  the  segment  relative  to  this  center  is  tested  for 
a  systematic  counterclockwise  (for  internal  centers)  or 
clockwise  (for  external  centers)  progression.  A  portion  of 
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the  segment  that  violates  this  condition  is  merged  with  a 
neighboring  segment  of  opposite  type.  If  such  a  neighbor 
does  not  exist,  a  new  candidate  center  of  opposite  type 
(external  or  internal)  is  defined  for  that  portion. 

3.3  The  MDL  Criterion 

The  number  of  bits  required  to  encode  a  given  realization 
of  an  i.i.d.  Gaussian  random  vector  (zj , . . . ,  Zm)  is  given 
by 

y  log(2T«T2)  +  ^  £  (*>  -  (11) 

J=i 

where  /i  is  the  mean  of  the  vector,  is  its  variance,  and 
the  units  are  bits. 

Using  Equation  (11),  we  can  compute  the  number  of 
bits  needed  to  describe  a  candidate  center  as  follows: 

•  The  coordinates  of  the  center:  2x9  bits  (assuming 
a  512  X  512  image). 

•  The  directions  of  the  left  and  right  boundary  radii: 
2x5  bits  (assuming  32  radii  per  center). 

•  The  mean  radius  length  fi:  6  bits  (assuming  that 
the  range  of  radius  values  is  64). 

•  The  number  of  bits  needed  to  encode  the  variations 
from  /i:  f  log(2r<r^)+-^  (h  “  (assuming 
the  center  has  m  radii  of  lengths  fi , . . . ,  /m  which 
have  i.i.d.  Gaussian  distributions  around  /i). 

Therefore  approximately 

bi  =  34  +  Y  log(2ir<r’)  +  ^  -  fn)^  (12) 

;=i 

bits  are  needed  for  a  center  that  has  n,-  radii  with  mean 
length  fii.  The  variance  <r^  is  assumed  to  be  the  same  for 
all  centers;  it  needs  to  be  determined  from  observation. 

The  MDL  test  used  to  determine  whether  the  candi¬ 
date  centers  actually  become  new  centers  is  then 

K 

'^bi<bo  (13) 

i=l 

where  K  is  the  number  of  candidate  centers,  hi  are  these 
centers’  bit  values,  and  6o  is  the  bit  value  for  the  original 
center. 

3.4  Examples 

In  this  section  we  present  a  few  simple  examples  of 
MPRs.  In  the  figures,  angular  regions  of  support  ^ue 
indicated  by  black  lines  that  run  from  each  contour  seg¬ 
ment’s  endpoints  to  the  center  associated  with  the  seg¬ 
ment.  All  the  examples  are  initialized  by  a  PR.  If  the 
MDL  test  is  met  the  PR  is  partitioned  (repeatedly,  if 
necessary)  into  an  MPR.  The  total  description  length  in 
bits  of  the  PR/MPR,  as  calculated  for  each  center  by 
Equation  (12),  appears  in  the  lower  righthand  corner  of 
each  image,  rounded  to  the  nearest  integer. 

Figure  4  shows  a  four-Iobed  synthetic  object.  Here, 
although  the  PR  adequately  describes  the  contour,  the 
MPR  has  a  shorter  description  length  than  the  PR.  It 
also  explicitly  tells  us  that  we  have  a  four-lobed  object 
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Figure  4:  MPR  example:  four-lobed  object. 
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Figure  5:  MPR  example:  FLIR  tank. 


Figure  6:  MPR  example:  leaf. 


and  provides  descriptions  of  the  lobes.  Figures  5  and  6 
show  MPRs  of  two  real  objects:  a  tank  taken  from  a 
FLIR  image  and  a  leaf. 

These  examples  illustrate  how  the  MPR  representa¬ 
tion  provides  a  partition  of  an  object  into  simple  parts. 
The  MDL  test  and  the  heuristic  method  of  choosing  sig¬ 
nificant  extrema  prevent  excessive  fractionation  of  the 
object.  The  result  is  an  efficient  and  plausible  repre¬ 
sentation.  Numerous  additional  examples,  and  further 
details  about  MPR  construction,  can  be  found  in  [3]. 

4  Compound  Object  DRC 

4.1  Introduction 

The  PR  used  for  DRC  in  Section  2  changes  significantly 
if  a  significant  part  of  the  object  is  missing,  e.g.  occluded, 
because  the  rsidii  of  the  partial  object  are  all  measured 
from  a  different  centroid.  As  a  result  the  classification 
process  will  be  ineffective  even  if  the  delineation  is  per¬ 
formed  correctly.  This  problem  applies  to  all  global 
shape  recognition  techniques,  i.e.  techniques  in  which 
recognition  is  based  on  properties  of  the  entire  shape. 
Because  of  this,  when  dealing  with  potentially  occluded 
objects,  shape  recognition  techniques  which  examine  seg¬ 
ments  of  the  contour  are  usually  employed. 

In  this  section  we  describe  an  MRF-based  DRC  system 
which  makes  use  of  the  MPR  described  in  Section  3.  The 
MRF  structure  consists  of  a  set  of  radial  MRFs,  defined 
over  angular  sectors  emanating  from  a  set  of  centers; 
the  configurations  of  these  MRFs  define  polar  represen¬ 
tations  of  segments  of  the  object’s  contour.  This  MPR- 
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MRF  structure  can  handle  compound  objects,  parts  of 
which  may  be  occluded. 

4.2  The  MPRpMRF 

The  compound  DRC  process  is  initialised,  as  in  Sec¬ 
tion  2,  by  choosing  a  single  center  (xq,  j/b)  inside  the  ob¬ 
ject  and  constructing  a  IDCMRF.  The  initial  stage  of  the 
algorithm  compares  the  configurations  of  this  MRF  with 
PR  models  in  the  database  to  see  if  a  good  match  can  be 
found.  If  not,  the  algorithm  proceeds  to  the  next  level  of 
MPR  partitioning  and  attempts  to  find  matches  between 
the  configurations  of  the  sector  MRFs  and  parts  of  MPR 
models,  i.e.  polar  representations  of  contour  segments  of 
database  objects.  At  each  sector  MRF  for  which  a  match 
is  found  (as  defined  by  a  convergence  criterion),  the  op¬ 
timisation  process  is  terminated  and  the  sector’s  center 
is  not  considered  for  further  MPR  partitioning. 

Formally,  the  MPR-MRF  S  consists  of  a  set  of  sectors 
(  €  H  emanating  from  centers  (x{,y().  Each  sector  { 
consists  of  radial  sites  rf  €  R^.  As  in  Section  2.1,  each 
radial  site  is  a  discrete  random  variable  with  values  in  the 
range  The  radii  are  spaced  at  regular  angular 

intervjds  of  2ir/n,  and  the  neighbors  of  radius  t  are  t  —  1 
and  t  -I- 1.  Note  that  these  sector  MRFs  are  no  longer 
cyclic. 

To  allow  the  sectors  to  align  themselves  relative  to 
contour  segments  and  provide  initialization-independent 
polar  representations  of  the  segments,  the  MPR-MRF 
allows  the  centers  to  shift,  as  in  Section  2.1.  Ideally,  the 
angular  supports  of  the  centers  should  just  “cover”  the 
entire  contour.  However,  when  a  center  shifts,  its  angu¬ 
lar  support  changes,  possibly  resulting  in  incomplete  or 
overlapping  coverage  of  the  contour.  The  center  shift¬ 
ing  algorithm  tests  for  this  and  initializes  additional  ra¬ 
dial  sites,  or  deletes  redundant  sites,  as  needed.  If  this 
results  in  a  center  having  its  number  of  radial  sites  re¬ 
duced  below  the  fragmentation  threshold,  that  center  is 
eliminated  and  its  sites  are  assigned  to  nearby  centers. 

4.3  The  Energy  IHinction 

The  MPR-MRF  is  responsible  for  the  representation  por¬ 
tion  of  the  DRC  task,  while  the  delineation  and  classi¬ 
fication  portions  are  the  responsibility  of  the  EF.  As  in 
Section  2.2,  the  EF  is  defined  symbolically  as 

E{w)  i  X  4l(w)  -I-  Wl  X  (14) 

where  ( is  the  index  representing  the  MPR-MRF  sectors; 

HI*  components  for 

each  sector;  and  VPf,  are  weights  associated  with 
each  sector. 

4.3.1  The  Low  Level 

The  LL  components  of  the  EF  enable  the  system  to 
perform  object  delineation.  Their  definitions  are  similar 
to  those  in  Equations  (4-6)  of  Section  2.2.1. 

4.3.2  The  High  Level 

The  role  of  the  HL  is  to  periodically  classify  the  con¬ 
tour  segments  found  by  the  LL  delineator.  Each  sector 


i  finds  the  best  match  between  its  current  MRF  config¬ 
uration  and  a  compatible  (see  below)  contour  segment 
model  (i.e.,  part  of  an  MPR  model)  in  the  database. 
The  HL  periodically  resets  the  weight  to  reflect  the 
quality  of  this  best  match.  Sectors  (  for  which  the  best 
match  is  very  good  have  high  Wj  values,  so  that  the  HL 
has  high  weight  in  the  optimization  process,  while  cen¬ 
ters  with  only  poor  matches  have  low  values,  so  that 
the  LL  dominates. 

The  set  of  compatible  contour  segment  models  for  sec¬ 
tor  (  is  defined  by 

models(  =  {Vk:(|v^  —  t;*|<thr«)  A  (type*  =typej)}, 

(15) 

where  i;  is  a  segment  model;  type  is  internal  or  external;  v 
is  a  unit  vector  in  the  direction  of  the  middle  radius  of  the 
sector  (or  segment  model);  and  thr«  is  a  threshold  which 
depends  on  how  much  orientational  freedom  is  expected 
in  the  objects  to  be  recognized. 

The  match  between  (  and  model  k  (k  €  models^)  is 
defined  as 


^  1-J- 


ifm(,t>thrm;  (16) 


0  otherwise, 

where  n(  is  the  number  of  radii  in  (,  is  their  average 
length,  and  thrm  is  a  threshold  below  which  the  match 
is  considered  inconclusive.  The  wf  are  angularly  aligned 
with  their  counterparts  by  making  the  middle  radius 
of  k  coincide  with  the  middle  radius  of  (  or  with  one  of 
its  neighbor  radii.  Radii  of  k  that  fall  outside  sector  ( 
are  ignored,  while  radii  of  (  that  fall  outside  k's  sector 
are  regarded  as  having  corresponding  radii  of  k  that  have 
zero  length. 

Let  model  k^  give  rise  to  the  maximum  (nonzero)  value 
of  m(,i.  The  updated  HL  component  is  then  defined  by 


=  min 


k- j 
SIZE 


(17) 


where  SIZE  is  a  threshold,  and  the  i’s  in  the  numerator 
ue  offset  if  necessary  to  allow  for  the  angular  alignment. 
The  weight  assigned  to  this  component  is 

=  OHLmM{mf,t)  (18) 


4.3.3  Object  Classification 

Let  S'  be  the  set  of  sectors  (  for  which  best  matching 
models  k^  have  been  found,  and  let  n'  = 
total  angular  size  of  these  sectors.  We  assign  the  weight 

Wf  =  n(/n'  to  each  of  these  sectors,  and  zero  weights 
to  the  remaining  sectors.  This  weighting  scheme  gives 
higher  weights  to  classified  sectors  that  have  larger  an¬ 
gular  supports,  since  these  sectors  contain  more  contour 
information. 
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Figure  7:  The  original  ohjects. 


The  evidence  for  a  database  object  is  then  <lefined  as 

max{mt«  ,},  (19) 

f6H' 

where  the  max  is  taken  over  all  the  contour  segment 
models  /  that  are  parts  of  MPR  models  for  obj  in  the 
database,  and  th  is  a  match  measure  between  pairs  of 
models,  defined  as  in  Equation  (17).  Evidently,  if  we  are 
actually  dealing  with  a  database  object,  the  evidence  for 
that  object  should  be  very  high,  and  it  should  be  rela¬ 
tively  high  even  if  the  object  is  partly  occluded,  as  long 
as  significant  segments  of  its  contour  are  not  occluded. 

4.4  The  Compound  Object  DRC  Algorithm 

The  MPR-MRF  DRC  algorithm  consists  of  two  pro¬ 
cesses:  a  fast  process,  which  performs  simulated  anneal¬ 
ing  optimization  on  all  MPR-MRF  sites,  optimizing  the 
EF  defined  by  Equation  (14),  and  two  slow  processes — 
one  which  detects  matches,  updates  the  HL  weights,  and 
updates  center  locations,  and  one  which  performs  con¬ 
tour  partitioning  and  new  center  creation. 

The  algorithm  is  initialized  by  establishing  a  single¬ 
center  MPR-MRF  R°  =  w”  which  is  equivalent  to  a 
IDCMRF.  This  IDCMRF  configuration  is  used  to  de¬ 
termine  the  initial  Gibbs  density.  At  this  early  stage 
the  HL  component  performs  matching  only  between  the 
MRF  configuration  and  the  PR  models  in  the  database. 

In  general,  let  the  set  of  MPR-MRF  centers  be  5.  For 
all  ^  €  H,  the  fast  process  iteratively  visits  each  radial 
site  rj  of  (  and  attempts  to  modify  its  configuration 
This  is  done  by  generating  a  candidate  configuration  us¬ 
ing  the  Gibbs  sampler,  t  valuating  its  energy  using  Equa¬ 
tion  (14),  and  deciding  whether  to  accept  the  transition 

At  every  10th  iteration  of  the  fast  process,  the  first 
slow  process  updates  each  center’s  location  to  the  cen¬ 
troid  of  its  current  sector  MRF  configuration,  finds  the 
best  matches  for  these  configurations,  updates  the  s«-c- 
tors’  W2  values,  and  recalculates  the  centers’  Gibbs  sam¬ 
plers.  Sector  MRFs  whose  match  values  meet  a  conver¬ 
gence  criterion  are  declared  “converged”  and  the  fast 
process  is  no  longer  applied  to  them.  The  evidence  for 
each  database  object  is  checked  at  the  end  of  this  stage, 
using  Equation  (19),  and  compared  to  a  threshold  to 
determine  if  an  object  match  has  been  found. 
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Figure  8:  MPR-MRF  database. 


The  second  slow  process  partitions  the  contour  seg¬ 
ments  of  the  non-converged  sector  MRFs  and  creates 
new  centers,  if  the  MDL  criterion  is  satisfied;  this  is 
done  every  100th  iteration.  In  our  experiments,  we  lim¬ 
ited  the  total  number  of  iterations  to  290,  i.e.  the  second 
slow  process  was  applied  only  twice. 

4.5  The  Model  Database 

Ou  model  database  was  constructed  from  a  set  of  six 
“cu  ut”  objects  on  a  uniform  background  which  were 
sam|  ed  and  digitized.  The  use  of  the.se  artificial  objects 
was  necessary  because  it  provided  a  domain  in  which 
distortion  and  occlusion  could  be  controlled.  The  origi¬ 
nal  objects  are  shown  in  Figure  7.  They  are  (from  left 
to  right,  top  to  bottom):  tankl,  tank2,  truckl.  triick2, 
apcl,  and  apc2. 

An  MPR-MRF  delineation  algorithm  similar  to  the 
recognition  algorithm  was  used  to  create  PR  and  MPR 
representations  of  the  objects  starting  from  various  ini¬ 
tial  center  locations.  This  sometimes  gave  ri.se  to  more 
than  one  MPR  for  a  given  object,  because  the  sparse  .set 
of  radii  .sometimes  failed  to  detect  small  features  of  the 
contour,  resulting  in  variations  in  the  contour  si'gments 
The  database  consisted  of  the  PR  and  MPR  representa¬ 
tions  of  all  the  objects.  Figure  8  shows  all  the  MPR.s  in 
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Figure  9.  Distorted  versions  of  the  original  objects. 

the  database.  These  MPRs  all  correspond  to  plausible 
descriptions  of  the  objects.  Due  to  the  low  resolution 
(30-50  pixels  across  an  object),  the  possible  number  of 
variations  is  small. 

4.6  Experimental  Results 

The  compound  DRC  algorithm  was  tested  on  a  set  of 
real  FLIR  images  of  non-occluded  targets  resembling  the 
database  objects,  as  well  as  on  images  of  both  occluded 
and  non-occluded  distorted  versions  of  the  objects  (ex¬ 
amples  of  the  distorted  versions  are  shown  in  Figure  9). 
As  an  example,  Table  1  and  Figure  10  show  results  for 
a  distorted  version  of  apc2.  The  evidence  column  in¬ 
dicates  the  total  evidence  for  each  object.  The  num¬ 
bered  columns  list  the  evidence  contributions  on  a  sec¬ 
tor  by  sector  basis.  Figure  10  shows  the  optimization 
process  for  this  example  from  initialization  to  termina¬ 
tion  (which  occurred  at  150  iterations  because  all  sec¬ 
tor  MRFs  converged).  Iterations  00  through  90  show 
the  development  of  the  PR.  At  iteration  100,  the  black 
dots  show  the  radius  endpoints,  and  the  white  dots  show 
the  locations  of  significant  sag  extrema.  The  MPR  con¬ 
structed  using  these  extrema  reduced  the  description 
length  from  1016  to  283  bits  (see  iteration  110)  and  was 
accepted.  It  has  three  internal  centers  (two  of  which  co¬ 
incide)  and  one  external  center.  During  the  succeeding 
40  iterations,  the  sector  MRFs  all  converged  to  config¬ 
urations  that  were  good  matches  to  database  models. 
When  a  sector  MRF  has  converged  to  match  part  of  an 
MPR  model  for  a  database  object,  the  contour  segment 
represented  by  this  model  is  displayed  in  white,  together 
with  the  number  of  the  object  (see  iterations  140  and 
150). 

Table  1:  MPR-MRF  evidence  for  a  distorted  “apc2”. 


Index 

Target 

Evidence 

2 

3 

4 

5 

12 

apc2  1 

834 

445  1 

045 

000 

120 

224 

9 

truck2  3 

743 

.382  1 

000 

206 

154 

- 

0 

tank!  1 

082 

000  1 

082 

000 

000 

000 

The  next  group  of  figures  and  tables  shows  results  ob¬ 
tained  for  three  non-occluded  targets  in  real  FLIR  im¬ 
ages,  using  the  same  model  databa-se.  Each  figure  .shows 


Figure  10:  MPR-MRF  results  for  a  distorted  “apc2'’. 

only  iterations  00,  90  (with  the  PR  configuration  and 
best  matching  PR  model),  and  290  (with  the  MPR  con¬ 
figuration  and  best  matching  parts  of  MPR  models). 
Figure  11  and  Table  2  give  results  for  a  FLIR  truck. 
Although  the  database  does  not  include  this  truck,  the 
process  finds  very  high  evidence  for  truck  1,  and  little 
evidence  for  any  other  database  object.  The  PR  cla.s- 
sification  was  apc2;  however,  once  the  MPR-MRF  was 
initialized,  this  choice  was  discarded  in  favor  of  the  more 
correct  solution,  truck  1.  The  results  are  similar,  though 
less  clearcut,  for  a  second  truck,  as  shown  in  Figure  12 
and  Table  3.  In  Figure  13  and  Table  4,  results  for  a 
FLIR  tank  image  are  presented.  Tankl.l  and  tank2  2 
are  the  contenders;  the  actual  tank  has  a  turret  that  is 
more  similar  to  tank2’s,  and  a  body  that  is  slightly  more 
similar  to  tankl's. 

Not  surprisingly,  the  results  for  occluded  objects  tend 
to  be  more  ambiguous,  because  the  visible  parts  of  such 
an  object  often  resemble  more  than  one  databa,se  model. 


Table  2:  MPR-MRF  evidence  for  the  first  truck  in  a 
FLIR  image. 
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Figure  11:  MPR-MRF  results  for  the  first  truck  in  a 
FLIR  image. 
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Figure  12:  MPR-MRF  results  for  the  second  truck  in  a 
FLIR  image. 
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Figure  13:  MPR-MRF  results  for  a  tank  in  a  FLIR  im¬ 
age. 
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Figure  14:  MPR-MRF  results  for  an  occluded  “truck2”. 


Figure  14  shows  an  example  in  which  there  is  relatively 
little  ambiguity  (see  Table  5),  in  spite  of  the  fact  that 
over  half  of  the  object  is  occluded.  Many  other  exam¬ 
ples,  as  well  as  further  details  about  the  MPR-MRF  al¬ 
gorithm,  can  be  found  in  [3]. 

5  Concluding  remarks 

The  MPR-MRF  system  described  here  provides  a  pow¬ 
erful  framework  for  integrated  DRC.  The  EF  environ¬ 
ment  allows  versatility  in  integrating  the  DRC  modules. 
Delineation  and  classification  are  handled  by  appropri¬ 
ate  components  of  the  EF,  while  the  MPR  provides  the 
representation.  The  system  poses  the  DRC  task  as  an 
optimization  problem  and  achieves  a  near-global  opti¬ 
mum  using  simulated  annealing.  The  experimental  re¬ 
sults  demonstrate  the  ability  of  the  integrated  approach 
to  identify  a  variety  of  objects  under  conditions  of  oc¬ 
clusion  and  distortion.  Due  to  the  low  resolution,  the 
number  of  MPR  representations  is  limited,  resulting  in 


Table  3:  MPR-MRF  evidence  for  the  second  truck  in  a 
FLIR  image. 
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Table  4:  MPR-MRF  evidence  for  a  tank  in  a  FLIR  im¬ 
age. 
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Table  5:  MPR-MRF  evidence  for  an  occluded  “truck2”. 
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modest  sized  databases  and  correspondingly  small  search 
costs.  The  approach  is  also  efficient  because  of  the  small 
sizes  of  the  sector  MRFs  and  their  configuration  spaces. 
The  average  processing  time  for  our  implementation  of 
the  system  on  a  SPARCstation  IPX  was  on  the  order  of 
20  seconds. 

Though  the  results  obtained  so  far  are  quite  encour¬ 
aging,  the  system  is  still  a  prototype  and  many  improve¬ 
ments  could  be  made  in  it,  as  regards  both  further  DRC 
integration  and  more  efficient  implementation.  The  abil¬ 
ity  to  obtain  real  FLIR  data  representing  various  con¬ 
trolled  situations  would  greatly  enhance  developmental 
capabilities. 

The  EF  could  easily  be  modified  to  incorporate  phys¬ 
ical  information  about  the  expected  objects  and  back¬ 
ground,  e.g.  texturrd  information  about  the  background, 
or  the  expected  contrast  between  the  objects  and  the 
background  (for  FLIR,  this  depends  on  the  time  of  day). 
The  latter  modification  would  provide  a  further  basis  for 
differentiating  between  real  targets  and  decoys. 

The  system  could  be  extended  to  handle  compound 
objects  in  which  the  parts  have  different  gray  level 
ranges;  for  example,  an  MPR-MRF  could  first  delineate 
the  brightest  parts,  and  HL  information  could  then  di¬ 
rect  the  establishment  of  additional  centers  to  delineate 
other  parts.  Still  another  extension  would  involve  mod¬ 
eling  (simple  or  compound)  ribbon-like  objects,  using 
pieces  of  medial  axis  instead  of  centers,  and  “radii”  per¬ 
pendicular  to  these  axes. 
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The  system  could  be  generalised  to  2.5D  for  range  data 
and  to  3D  for  CT  images.  Another  possible  extension 
involves  using  time-varying  models  to  perform  DRC  on 
sequences  of  images. 
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Abstract 

Geometric  invariants  are  shape  descriptors  that 
remain  unchanged  under  geometric  transforma¬ 
tions  such  as  projection,  or  change  of  view¬ 
point.  A  new  method  of  obtaining  local  pro¬ 
jective  and  a£Sne  invariants  is  developed  and 
implemented  for  real  images.  Being  local,  these 
invariants  are  much  less  sensitive  to  occlusion 
than  global  invariants.  The  computation  of  in¬ 
variants  is  based  on  a  c^onical  method.  This 
consists  of  defining  a  canonical  coordinate  sys¬ 
tem  using  intrinsic  properties  of  the  shape,  in¬ 
dependently  of  the  given  coordinate  system. 
Since  this  canonical  system  is  independent  of 
the  original  one,  it  is  invariant  and  all  quanti¬ 
ties  defined  in  it  are  invariant.  The  method  is 
applied  without  the  use  of  a  curve  parameter 
by  fitting  an  implicit  polynomial  to  a  general 
curve  in  a  neighborhood  of  each  curve  point. 
Several  configurations  are  treated;  a  general 
curve  without  any  correspondence,  and  curves 
with  known  correspondences  of  one  or  two  fea¬ 
ture  points  or  lines.  Experimental  results  for 
real  2D  objects  in  3D  space  are  presented. 

1  Introduction 

Geometric  invariants  are  shape  descriptors  which  remain 
invariant  under  geometrical  transformations  such  as  pro¬ 
jection  or  viewpoint  change.  They  are  important  in  ob¬ 
ject  recognition  because  they  enable  us  to  obtain  a  signa¬ 
ture  of  an  object  which  is  independent  of  external  factors 
such  as  the  viewpoint.  In  this  paper  we  treat  projective 
(viewpoint)  and  affine  invariants  in  various  geometrical 
configurations. 

The  subject  of  invariants  has  recently  increased  in  im¬ 
portance  and  recognition  in  the  vision  community.  Pro¬ 
jective  invariants  were  a  very  active  mathematical  sub¬ 
ject  in  the  latter  half  of  the  19th  century.  However,  in 
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vision  only  one  projective  invariant,  the  cross  ratio  [Duda 
and  Hart  1973],  was  used  until  recently. 

Projective  invariants  of  curves  and  surfaces  were  first 
introduced  in  vision  by  the  second  author  [Weiss  1988]. 
In  that  paper  we  reviewed  both  algebraic  and  differential 
methods  for  obtaining  invariants  and  pointed  out  their 
usefulness  for  object  recognition. 

One  can  distinguish  between  two  kinds  of  invariants: 
global  and  local.  Global  invariants  describe  a  shape  as  a 
whole  so  they  require  knowledge  about  the  whole  shape. 
Examples  are  the  moment  invariants  often  used  in  the 
Euclidean  case.  Global  (algebraic)  projective  invariants 
were  described  in  [Weiss  1988].  They  have  been  applied 
successfully  by  Forsyth  et  al.  [1990,  1991]  to  industrial 
objects.  Like  any  global  descriptors,  these  quantities  are 
quite  sensitive  to  occlusion.  Local  (differential)  invari¬ 
ants  are  more  immune  to  this  problem.  They  have  been 
treated  in  [Weiss  1988,  1991].  So-called  “mixed”  invari¬ 
ants  were  developed  by  Van  Gool  ti  al.  [1990],  Barrett 
ti  al.  [1990]  and  Bruckstein  ti  al.  [1991].  In  this  paper 
we  develop  both  local  and  “mixed”  invariants  using  a 
new  approach  that  is  simpler  and  more  robust  to  noise 
than  previous  methods. 

Local  invariants  can  be  defined  at  each  point  of  a 
shape,  and  can  be  used  to  obtain  a  “signature”  of  that 
shiq>e.  In  the  Euclidean  case,  for  instance,  it  is  com¬ 
mon  to  plot  the  curvature  against  the  arclength,  both 
of  which  are  local  Euclidean  invariants.  Such  plots  or 
“signatures”  of  curves  can  then  be  matched  even  if  part 
of  a  curve  is  missing  due  to  occlusion.  We  obtain  such 
signatures  in  the  projective  and  affine  cases. 

One  can  build  an  object  recognition  system  that  uses 
invariant  “signatures”  of  curves,  rather  than  the  curves 
themselves,  for  matching.  Therefore  the  matching  does 
not  require  a  search  for  the  correct  point  of  view.  This 
is  possible  because  of  a  general  compleitntss  property. 

The  completeness  property  of  differential  invariants 
can  be  described  as  follows.  Given  a  plane  curve  and  a 
transformation  group,  there  are  two  independent  invari¬ 
ants  of  the  transformations  at  each  point  of  the  curve. 
These  invariant  functions  contain  all  the  information 
about  the  curve,  except  for  the  transformation  to  which 
they  are  invariant.  Accordingly,  given  two  invariants  for 
each  curve  point,  we  can  reconstruct  the  original  curve 
up  to  a  tra^ormation  belonging  to  the  group. 

More  accurately,  the  following  theorem  holds  [Guggen- 
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heimer  1963,  p.  144]:  All  differtniial  invariants  of  a 
(transitive)  transformation  in  the  plane  are  functions  of 
two  invariants  of  the  lowest  order  and  their  derivatives. 
Thus,  given  a  curve,  one  can  find  a  corresponding  invari¬ 
ant  curve,  which  we  call  its  signature,  that  describes  it 
uniquely,  except  for  the  relevant  transformation. 

This  method  applies  to  all  kinds  of  local  invari¬ 
ants.  Projective  and  affine  invariant  signatures  are  used 
in  [Weiss  1992]  (with  an  explicit  method)  and  [Weiss 
1992b].  At  each  point  of  the  given  curve  we  calculate  two 
invariants,  Ji,/].  We  plot  these  numbers  as  a  point  in 
an  “invariant  plane”  whose  coordinates  represent  invari¬ 
ants.  In  effect,  we  plot  one  invariant  against  the  other. 
In  this  way  the  given  curve  maps  into  an  invariant  signa¬ 
ture  curve  in  the  invariant  plane.  The  signature  uniquely 
identifies  the  curve  regardless  of  the  point  of  view. 

Global  invariants  are  often  associated  with  algebraic 
methods  and  require  no  differentiation  (although  inte¬ 
gration  may  be  used  for  finding  momenta).  Local  invari¬ 
ants  involve  some  form  of  differentiation.  Larger  trans¬ 
formation  groups  need  higher  orders  of  differentiation; 
projective  invariants  need  a  higher  order  of  differentia¬ 
tion  than  affine,  which  in  turn  need  a  higher  order  than 
Euclidean. 

We  deal  here  mainly  with  curves.  General  curves  can 
be  treated  in  several  ways.  Two  main  camps  exist  among 
geometers:  those  who  favor  an  explicit  representation 
and  those  who  prefer  an  implicit  one.  In  the  explicit 
method  a  curve  is  represented  as  functions  of  some  pa¬ 
rameter  along  the  curve,  e.g.  x(t),  y(t).  In  the  implicit 
approach  a  curve  in  represented  by  a  relation  /(x,  y)  =  0, 
without  a  parameter.  The  advantage  of  the  implicit  ap¬ 
proach  is  that  it  does  not  require  introduction  of  a  pa¬ 
rameter,  which  is  not  in  fact  part  of  the  geometry  of  the 
curve  itself.  The  relation  between  x  and  y  is  sufficient  to 
completely  characterize  the  curve.  The  explicit  method 
makes  it  easier  to  obtain  closed  form  formulas  for  general 
curves. 

In  finding  invariants,  the  parameter  is  undesirable  for 
the  following  reasons.  The  essence  of  finding  invariants 
is  the  elimination  of  unknowns  from  the  system,  such 
as  the  unknown  quantities  describing  the  point  of  view. 
The  parameter  is  also  in  general  unknown  since  it  can 
be  chosen  in  an  arbitrary  way.  It  has  to  be  eliminated 
so  that  the  invariants  will  not  depend  on  it.  The  more 
unknowns  we  have  to  eliminate,  the  more  information  we 
have  to  extract  from  the  image,  which  translates  in  the 
explicit  method  to  higher,  and  less  reliable,  derivatives. 

Another  reason  to  avoid  the  parameter  is  the  qual¬ 
ity  of  the  fitting.  In  fitting  a  curve  to  data  points,  we 
make  the  assumption  that  the  average  squared  distance 
is  minimal,  and  the  problem  arises  of  how  to  estimate 
the  distance  of  a  point  from  a  general  curve.  In  the 
explicit  method,  the  minimized  functions  are  x(f),y(f), 
measuring  distances  parallel  to  the  x,y  axes.  These  dis¬ 
tances  are  very  unstable  when  the  curves  themselves  are 
almost  parallel  to  the  axes,  and  can  introduce  substan¬ 
tial  errors.  We  also  have  to  obtain  two  fitted  functions 
x(f),  y(f)  rather  than  one.  An  implicit  fit  minimizes  the 
distance  (roughly)  perpendicular  to  the  curves,  and  it 
thus  seems  more  natural.  It  eliminates  the  parameter 


before  it  enters  the  invariant  expressions  and  adds  to  an 
accumulation  of  errors.  In  addition,  the  explicit  method 
assumes  the  existence  of  some  ordering  among  the  data 
points  so  that  a  parameter  can  be  assigned  to  them, 
which  is  not  always  possible. 

These  considerations  are  especially  important  in  the 
projective  case  in  which  there  is  no  naturrd  parameter 
such  as  arclength.  But  even  in  the  Euclidean  and  affine 
cases,  which  do  admit  a  natural  arclength,  it  needs  to 
be  obtuned  from  the  image  and  the  S2une  robustness 
considerations  apply.  The  implicit  method  avoids  the 
parameter  altogether  and  thus  increases  robustness. 

Most  previous  work  on  local  invariants  [Wilczynski 
1906]  was  done  using  the  explicit  approach.  An  im¬ 
plicit  approach  was  used  by  Halphen  [1880]  but  it  did 
not  provide  all  the  invariants  and  was  cumbersome  to 
implement.  We  present  here  a  simple  way  of  deriving  lo¬ 
cal  invariants  in  the  implicit  approach,  without  a  curve 
parameter.  The  approach  is  ba^  on  transforming  the 
shape  to  a  canonical  (intrinsic)  system  of  coordinates, 
rather  than  obtaining  closed  form  formulas  for  the  in¬ 
variants. 

The  canonical  approach  has  another  advantage  for  our 
purposes.  A  problem  that  arises  in  finding  invariants  is 
fitting  a  curve  to  the  data  points  in  an  invariant  way,  i.e. 
the  fitting  method  has  to  be  invariant  before  the  curve 
invariants  can  make  sense.  This  is  particularly  true  if 
the  fitting  error  arises  mainly  not  from  random  noise 
but  from  the  shape  itself,  for  instance  when  trying  to 
fit  a  conic  to  a  polygon.  In  previous  methods  [Forsyth 
1991]  invariant  fitting  could  be  done  only  in  the  affine 
case.  The  canonical  method  presented  here  is  capable  of 
obtaining  an  invariant  fit  in  the  general  case. 

Several  kinds  of  situations  will  be  treated  here.  The 
first  involves  general  plane  curves  without  any  correspon¬ 
dence  information.  These  require  the  highest  number  of 
derivatives  so  their  signatures  are  the  hardest  to  obtain. 
Next,  shapes  consisting  of  a  curve  and  one  known  fea¬ 
ture  point  will  be  treated.  For  the  feature  (or  reference) 
point  it  is  assumed  that  a  correspondence  can  be  estab¬ 
lished  between  two  images.  This  enables  us  to  eliminate 
some  of  the  transformation  parameters  and  reduce  the 
amount  of  information  needed  from  the  curve  itself,  i.e. 
the  orders  of  the  derivatives.  Using  a  curve  and  two 
reference  points  reduces  this  amount  even  further.  The 
authors  mentioned  earlier  treated  these  situations  with 
the  explicit  approach,  using  derivatives  with  respect  to 
a  curve  parameter.  We  will  treat  them  here  without  a 
parameter.  We  will  also  treat  curves  with  reference  lines, 
which  have  not  been  previously  treated,  to  our  knowl¬ 
edge. 

2  Finding  Local  Invariants — A 
Canonical  Approach 

In  principle,  one  can  find  invariants  of  a  curve  /  by  one  of 
the  known  methods.  However,  these  invariants  will  not 
be  local;  they  will  depend  on  the  window  size  and  the 
curve  order.  In  addition,  the  common  methods  are  very 
cumbersome  for  high  order  curves  and  require  the  use  of 
a  symbolic  manipulation  program.  Their  robustness  is 
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questionable. 

Here  we  obtain  these  local  invariants  in  a  quite  sim¬ 
ple  and  intuitive  way.  The  basic  idea  is  to  transform 
our  coordinate  system  to  a  canonical  one,  i.e.  a  standard 
system  which  is  defined  by  the  intrinsic  characteristics 
of  the  shape  itself.  Since  this  system  is  intrinsic,  all 
quantities  measured  in  it  are  independent  of  the  initial 
system  and  are  therefore  invariants.  One  can  give  a  sim¬ 
ple  example  as  follows:  Given  an  image  of  a  r^,  we  can 
calculate  its  length,  which  is  a  Euclidean  invariant,  by 
applying  the  formula  for  Euclidean  distance.  An  alter¬ 
native  approach  is  to  transform  the  coordinate  system 
into  a  canonical  one,  in  which  the  rod  lies  along  the  x 
axis  and  the  origin  is  at  one  end  of  the  rod.  Then  the 
X  coordinate  of  the  other  end  is  the  rod’s  length.  We 
see  that  by  moving  to  a  canonical  system  we  have  ob¬ 
tained  the  invariant  length  without  an  explicit  formula. 
The  canonical  system  was  determined  by  the  properties 
of  the  shape  rather  than  by  some  external  factors. 

An  important  differential  example  is  finding  Euclidean 
invariants  of  curves.  We  can  move  the  coordinate  system 
so  that  the  x  axis  is  tangent  to  the  curve  at  some  point 
that  we  choose  on  it,  i.e.  y'  =  0  there.  The  second  deriva¬ 
tive  y"  at  this  point  is  now  equal  to  the  curvature  and 
is  invariant  since  we  obtain  the  same  canonical  system 
regardless  of  which  system  we  started  with.  We  see  that 
by  determining  some  of  the  properties  of  the  system,  the 
others  are  also  determined  and  become  invariant. 

We  generalize  this  approach  to  larger  transformation 
groups.  In  general,  the  factors  in  a  transformation  can 
be  eliminated  by  using  the  same  kind  of  transformation, 
with  the  same  number  of  factors,  to  go  over  to  the  canon¬ 
ic^  coordinates.  The  Euclidean  invariants  can  be  ob¬ 
tained  by  using  a  Euclidean  transformation  to  obtain  a 
Euclidean  canonical  system,  etc. 

The  general  projective  transformation  can  be  decom¬ 
posed  into  simpler  transformations:  translation,  rota¬ 
tion,  skewing,  scaling  (making  up  the  affine  group),  tilt 
and  slant.  We  will  use  these  to  canonize  the  coordi¬ 
nates  step  by  step.  At  each  step  some  of  the  viewpoint 
parameters  will  be  eliminated  until  we  are  left  with  a 
coordinate  system  independent  of  the  original  viewpoint 
and  defined  by  the  shape  itself. 

There  are  two  basic  requirements  that  the  canoniza¬ 
tion  process  has  to  meet:  it  has  to  be  tnvanani,  i.e. 
produce  a  result  that  is  independent  of  the  original  sys¬ 
tem,  and  it  has  to  be  local,  i.e.  independent  of  the  exact 
fitting  details  such  as  the  window  size  or  fitted  curve. 

The  Euclidean  example  above  meets  these  require¬ 
ments.  The  requirement  of  tangc-ncy  is  an  invariant  one, 
because  the  tangency  property  is  unchanged  under  a  pro¬ 
jective  transformation.  The  locality  requirement  is  also 
met,  because  the  tangency  means  that  the  first  deriva¬ 
tive  dy/dx  vanishes.  A  derivative  is  a  local  property  and 
is  independent  of  the  size  of  the  window  in  which  it  was 
calculated.  It  is  also  independent  of  exactly  what  curve 
was  fitted  (as  long  as  the  fit  is  good),  because  any  fit¬ 
ted  function  can  be  expanded  in  a  Taylor  series  with  the 
same  first  derivatives. 

For  the  Euclidean  case  we  used  the  tangent  to  obtain 
a  canonization  process  that  met  our  requirements  of  in¬ 


variance  and  locality.  We  can  generalize  the  method  by 
using  an  osculating  curve,  which  is  a  generalization  of  the 
tangent.  A  tangent  is  a  line  having  at  least  two  points 
in  common  with  the  curve  in  an  infinitesimal  neighbor¬ 
hood,  i.e.  two  "points  of  contact” .  This  can  be  expressed 
as  a  condition  on  the  first  derivative.  Similarly,  a  higher 
order  osculating  curve  has  more  (independent)  contact 
points,  and  the  condition  on  the  derivatives  can  be  writ¬ 
ten  as 

^(r(*,»)-/(*,y))  =  0,  i  =  0...n  (1) 

with  f*  being  the  osculating  curve,  /  the  given  curve, 
and  n  the  order  of  contact.  Since  the  derivatives  van¬ 
ish,  this  condition  is  invariant  to  the  parameter  t.  (We 
will  derive  the  osculating  curve  without  this  parame¬ 
ter.)  Since  it  has  a  geometric  interpretation  with  points 
of  contact,  the  condition  is  also  projectively  invariant. 
And  since  it  is  expressed  as  derivatives,  it  is  also  local. 
Thus  all  the  independence  requirements  set  forth  earlier 
are  met.  (The  derivatives  will  be  calculated  analytically 
from  /.) 

In  the  following  sections  we  will  use  an  osculating  im¬ 
plicit  curve  f*  satisfying  the  above  condition.  This  curve 
will  be  chosen  as  the  simplest  one  that  meets  our  needs; 
its  shape  is  thus  known.  Thus  it  will  be  easier  to  handle 
than  the  original  /  which  can  be  any  function  that  fits. 
According  to  our  needs  we  find  either  a  cubic  or  a  conic 
which  osculates  our  fitted  curve.  We  then  transform  the 
coordinates  so  that  this  cubic  or  conic  takes  on  a  partic¬ 
ularly  simple,  predetermined  form,  i.e.  we  eliminate  all 
its  coefficients.  In  this  new  (canonical)  system  all  quan¬ 
tities  are  invariants  and  we  pick  the  ones  that  best  suit 
our  needs. 

We  will  describe  the  correspondenceless  case  in  full 
and  summarize  the  other  cases. 

3  Local  Projective  Invariants  Without 
Correspondence 

We  use  the  osculating  curve  method  to  eliminate  all  the 
projective  unknowns  and  obtain  two  local  invariants  at 
any  curve  point.  The  outline  of  our  method  is  as  follows: 

•  Repeat  the  following  steps  for  each  pixel  that  be¬ 
longs  to  the  curve  to  obtain  two  independent  in¬ 
variants  at  that  point  of  the  curve: 

-  Define  a  window  around  the  pixel  and  fit  an 
implicit  polynomial  curve  to  it,  say  a  cubic  or  a 
quartic.  All  the  following  stages  ue  performed 
analytically. 

-  Derive  a  canonical,  intrinsic  coordinate  sys¬ 
tem  based  invariantly  on  the  properties  of  the 
shape  itself,  independently  of  the  given  coor¬ 
dinate  system.  By  doing  so  we  eliminate  all 
the  unknown  quantities  of  the  original  system 
(the  viewpoint).  To  accomplish  this:  define  an 
“auxiliary  curve”  which  osculates  the  original 
fitted  curve  with  a  known  order  of  contact.  The 
canonical  system  is  defined  so  that  in  it  the  os¬ 
culating  curve  has  a  particularly  simple,  prede¬ 
termined  form. 
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-  IVansform  the  original  fitted  curve  to  this  new 
system.  Since  the  system  is  canonical,  all  shape 
descriptors  defined  in  it  are  independent  of  the 
original  coordinate  system  and  are  therefore  in¬ 
variants.  Pick  two  invariants  that  are  indepen¬ 
dent  of  the  window  size  or  the  order  of  the  fitted 
curve,  and  depend  only  on  the  shape  itself. 

•  Plot  one  invariant  against  the  other  to  obtain  an  in¬ 
variant  signature  curve.  This  is  based  on  the  com¬ 
pleteness  property  discussed  above. 

•  If  an  invariant  fit  is  needed,  we  repeat  the  previous 
steps,  i.e.  redo  the  curve  fitting  in  the  new  canonical 
system,  and  iterate  until  convergence. 

In  the  following  sections  we  describe  the  above  steps 
in  more  detail. 

3.1  Curve  fitting 

The  method  described  above  involves  fitting  an  implicit 
curve  to  the  available  data  points.  To  do  so  we  have  to 
determine  parameters  such  as  the  order  of  the  curve  and 
the  window  size. 

To  determine  the  curve  order,  we  need  to  know  the 
minimum  number  of  coefficients  needed,  or  the  amount 
of  information  that  needs  to  be  obtained  from  the  image. 
To  find  invariants,  we  have  to  eliminate  the  information 
in  the  image  which  is  specific  to  the  coordinate  system. 
For  example,  given  a  pencil  that  can  move  or  rotate  on 
a  table,  the  position  of  the  pencil  and  its  orientation 
are  not  invariant  but  its  length  is  a  Euclidean  invariant. 
Given  the  coordinates,  say  of  the  ends  of  the  pencil,  we 
can  eliminate  the  position  and  orientation  and  calculate 
the  length.  Thus  from  the  four  measured  coordinates 
we  have  eliminated  the  three  Euclidean  transformation 
parameters  uid  found  one  invariant. 

Similar  arguments  apply  to  other  transformations.  In 
the  projective  case,  we  want  to  eliminate  eight  parame¬ 
ters  of  the  trauisformation,  so  the  number  of  coefficients 
to  be  obtained  from  the  image  should  exceed  eight.  Since 
we  need  two  independent  invariants  at  each  pixel,  we 
need  ten  independent  quantities.  A  cubic  has  nine  coef¬ 
ficients,  but  we  also  have  the  position  of  the  point  on  the 
cubic  for  a  total  of  ten  quantities.  Thus  it  is  sufficient 
from  purely  geometrical  considerations  to  fit  a  cubic  to 
our  data.  However,  other  considerations  push  us  towards 
a  higher  order  curve. 

We  can  see  here  an  aulvamtage  over  the  explicit  method 
that  requires  differentiation  of  z(f),  y{t)  with  respect  to 
the  curve  parameter  t.  The  elimination  argument  above 
applies  to  this  unknown  parameter,  i.e.  this  parameter 
has  to  be  eliminated  along  with  the  coordinates,  so  that 
the  invariants  will  be  independent  of  it.  This  increases 
the  amount  of  data  that  needs  to  be  extracted  from  the 
image,  e.g.  ihe  orders  of  the  derivatives.  In  Wilczynski’s 
method,  the  eighth  derivatives  of  both  x  and  y  were 
needed,  a  total  of  18  quantities.  This  reduced  the  re¬ 
liability  of  the  invariants.  Thus  avoiding  the  parameter 
from  the  outset  reduces  the  number  of  quantities  we  need 
to  obtain  from  the  image  and  improves  reliability. 

Regarding  the  window  size,  we  have  found  [Weiss  1991] 
that  the  wider  the  window,  the  more  reliable  the  fitting 


becomes.  This  is  especially  important  in  our  case  be¬ 
cause  of  the  relatively  large  number  of  independent  quan¬ 
tities  that  we  need  to  obtain,  at  least  in  the  hardest  case 
(the  projective  correspondenceless  case).  To  maintain 
the  accuracy  of  the  fit  in  a  wide  window,  a  higher  order 
curve  has  to  be  fitted.  This  prompts  us  to  use  quartic  or 
higher  curves,  even  though  a  cubic  has  enough  param¬ 
eters.  The  increased  number  of  parameters  needed  for 
the  higher  order  curves  is  not  a  problem  because  they  do 
not  all  need  to  be  independent;  at  most  ten  independent 
ones  are  needed. 

In  practice  we  have  found  it  convenient  to  restrict  our¬ 
selves  to  fourth  order  (quartic)  curves  although  higher 
orders  may  be  worth  investigating.  In  the  sequel  we  will 
deal  with  the  fitted  quartic 

/(*.y)  =  no  +  Oi*  +  Oay  +  <*3*^  +  +  Osy*  (2) 

+  asx^  +  OTX^y  -h  agzy^  -t-  ogy®  ■+■  aiox* 

+  oii**y  +  +  ai3xy^  +  ai4y*  =  0 

with  the  cubic  being  the  special  case  in  which  coefficients 
Oio, . . .,ai4  vanish. 

Once  the  curve  order  and  window  size  have  been  cho¬ 
sen,  the  fitting  itself  can  be  done  by  standard  methods. 
Simple  least  square  fitting  is  quite  ill  conditioned  be¬ 
cause  of  the  relatively  large  number  of  unknowns.  The 
SVD  (Singular  Value  Decomposition)  method  [Press  et 
ttl.  1986]  is  very  successful  in  overcoming  this  problem 
and  we  obtain  a  quite  reliable  fit. 

We  have  thus  obtained  a  local  algebraic  (parameter¬ 
less)  representation  for  the  data.  We  will  now  find  its 
invariants  (analytically). 

3.2  Deriving  a  canonical,  intrinsic  coordinate 
system 

3.2.1  Euclidean  canonization 

First  we  detail  the  Euclidean  canonization  stage.  As 
a  convention,  we  denote  the  new  coordinates  after  each 
canonization  step  by  z,  y  and  drop  the  bars  before  going 
to  the  next  step,  and  similarly  for  other  quantities. 

The  first  step  is  translation,  moving  the  origin  to  our 
curve  point.  Our  pixel  zo,  yo  does  not  necessarily  lie  on 
the  fitted  curve  but  it  is  close  to  it.  Thus,  we  find  a  point 
Zo,  yS  which  does  lie  on  the  curve,  i.e.  we  solve  eq.  (3)  for 
yS,  given  zq.  This  is  easy  to  do  with  Newton’s  method 
because  yo  is  a  close  initial  guess.  We  now  translate 
the  origin  to  zo,yo-  (We  could  simplify  the  solution  by 
first  translating  so  that  zq  =  0  and  then  solving  for  yg .) 
We  drop  the  star  from  y* .  We  now  transform  the  curve 
coefficients  to  the  new  system  and  obtain  new  d,  .  This 
is  done  by  expressing  the  old  coordinates  in  terms  of  the 
new,  X  =  x-I-xq,  substituting  in  eq.  (3)  and  rearranging. 
In  this  new  system  we  have  do  =  0  which  can  be  seen  by 
simply  substituting  (0,0)  in  eq.  (1). 

The  next  step  is  to  rotate  the  coordinates  so  that  the 
z  axis  is  tangent  to  the  curve.  It  is  easy  to  see  that 
in  the  rotated  system  we  must  have  di  =  0  (because 
d/(z,y)/dz  =  0).  To  satisfy  this  condition  we  again 
express  the  old  coordinates  in  terms  of  the  new,  with 
the  rotation  factor  Uf  -. 

X  =  X  +  Urif  y  =  y  —  tirZ  (3) 
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Now  oi  is  transformed  to 

di  =  fli  —  UfO] 

To  make  this  vanish  we  have  to  rotate  by  the  amount 

Ur  =  ai/oj 

Since  translation  and  rotation  make  up  the  Euclidean 
transformations,  we  have  reached  a  Euclidean  canonical 
system.  All  quantities  defined  in  it  are  Euclidean  in- 
variants.  The  curvature  at  Xo  is  now  simply  the  second 
derivative,  d^y/dz^.  The  arclength  is  |dz|  since  dy  =  0. 

3.2.2  Eliminating  the  projective  unknowns 
Of  the  eight  parameters  of  the  general  projectivity  we 
have  already  eliminated  three  by  translation  and  rota¬ 
tion,  so  our  osculating  curve  should  have  five  coefficients, 
while  passing  through  the  origin  and  being  tangent  to  the 
X  axis.  Following  [Halphen  1880]  we  .  hoose  the  “nodal 
cubic” 

/*  =  coi^  +  ciy®  +  cjzy*  -I-  caz’y  -f  c^y’  +  zy  =  0  (4) 
This  curve  intersects  itself  at  the  origin  so  it  has  two 
tangents  there,  one  lying  along  the  z  axis.  The  other 
tangent  is  called  the  “projective  normal”  [Lane  1942). 
Our  treatment  of  the  nodal  cubic  differs  from  Halphen ’s 
and  yields  the  full  range  of  invariants.  (We  also  had  the 
advantage  of  a  symbolic  manipulation  program.) 

Our  goal  is  now  to  transform  the  coordinates  so  that 
this  nodal  cubic  takes  on  the  simple  coefl5cient-free  form 

+  y®  +  zy  =  0  (5) 

It  is  known  [Bronshtein  1985]  as  the  folium  (leaf)  of 
Descartes  (Figure  1).  In  a  nutshell,  we  obtain  it  as  fol- 


Figure  1:  Osculating  nodal  cubic  (left),  folium  of 
Descartes  (right). 

lows.  We  skew  the  coordinates  so  that  the  projective 
normal  becomes  perpendicular  to  the  z  axis,  thus  pro¬ 
viding  a  canonical  y  axis.  This  eliminates  c^.  We  scale 
the  axes  to  eliminate  co,  C] ,  obtaining  an  affine  canonical 
system  with  new  C2,C3.  These  are  now  affine  invariants. 
We  tilt  and  slant  to  eliminate  them  too,  obtaining  the 
projective  canonical  system. 

We  now  find  the  nodal  cubic  /*  using  the  osculation 
condition,  i.e.  the  equality  of  the  first  n  derivatives  of 


/  and  /*,  eq.  (1).  The  first  derivative  (and  the  zeroth) 
vanish  because  of  the  tangency  to  the  z  axis.  To  deter¬ 
mine  the  five  coefficients  a  we  need  five  more  derivatives 
to  be  equal,  i.e.  up  to  the  sixth  one.  The  condition  of 
equal  derivatives  ensures  the  locality  of  the  treatment 
and  also  its  invariance,  as  discussed  earlier. 

To  go  further,  we  need  to  calculate  the  derivatives 
d".y/dz"  of  the  fitted  curve.  This  is  done  analytically 
from  /(z,  y).  To  do  it  we  use  the  fact  that  ail  the  deriva¬ 
tives  of  /  vanish,  since  /  vanishes  identically  (eq.  1).  The 
first  derivative,  for  example,  is 

dz  dz  dy  dz 

This  is  a  linear  equation  for  dyjdx.  It  is  superfluous 
because  we  have  already  demanded  its  vanishing  (tan¬ 
gency).  However,  each  successive  differentiation  gives 
one  linear  equation  for  one  higher  j/")  in  terms  of  lower 
derivatives.  The  calculation  is  tedious  and  we  used  a 
symbolic  manipulation  program  to  calculate  up  to 
in  terms  of  the  a; . 

Setting  02  =  1  and  denoting  d„  =  ;7i^if-(0)  we  have 


di  =  —03  (6) 

d3  =  —06  —  d3a4  (7) 

d4  =  — Oio  —  d207  —  d^Os  —  d304  (8) 

ds  =  —djoii  —  d^os  ~  dsoj  —  2djd3as  —  o^d^  (9) 


dg  =  “d^oiz  ~  d30ii  —  d^ag  —  2d2d308  (10) 

— d407  —  04d5  +  (— 2d2d4  —  d3)os 

Given  the::e  derivatives  we  find  the  coefficients  Cn  of 
the  nodal  cubic  (4)  as  follows.  We  write  the  nodal  cubic 
as 

y(*)  =  S 

n=0 

and  substitute  it  in  the  cubic  expression,  eq.  (3).  Col¬ 
lecting  terms  with  the  same  power  z"  we  obtain  five 
equations  for  the  five  a  in  terms  of  the  d„.  Their  solu¬ 
tion  is 


co=-d2  (11) 

•^3-  dld,-3did3dt+3d^dl 

dld,-3d^d3d4+3d3d’ 

r-  did4+diC-*d,dt-3di)+l0d3did4-5di 


Having  found  the  coefficients  c,-,  we  proceed  to  elim¬ 
inate  them.  First,  we  orthogonalize  the  axes,  i.e.  skew 
the  system  so  that  the  two  nodal  tangents  become  per¬ 
pendicular.  This  will  eliminate  the  C4  term  in  the  nodal 
cubic  (4).  Our  skewing  transformation  is 


z  =  X  -t-  u,y 
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with  u,  =  —C4  being  the  skewing  factor,  y  remains 
unchanged.  Substituting  the  above  equation  in  the  cu¬ 
bic  (4)  and  rearranging  we  obtain  new  coefficients 


Cl 

=  — C0C4  +  cacj  —  C2C4  -h  Cl 

(16) 

C2 

=  3C0C4  —  2caC4  -H  C2 

(17) 

Ca 

=  Ca  —  3coC4 

(18) 

We  again  drop  the  bars  from  Ci  and  x. 

One  advantage  of  the  orthogonalization  is  that  it 
makes  it  possible  to  decouple  the  next  transformations, 
i.e.  the  slantings  and  scalings  in  the  x  and  y  directions. 
We  can  now  proceed  with  these  transformations  in  any 
order  to  eliminate  the  remaining  Cj. 

We  next  scale  the  axes  with  the  scaling  factors  Sx.Sy: 

X  =  x/sg,  y  =  y/sy  (19) 

where  s,  =  Sy  =  Substituting  this  in 

the  orthogonalized  cubic  we  obtain 

-I-  -I-  C2iry*  -f-  caic^y  -f-  iy  =  0  (20) 

with 

-  C]  _  C3 

C2  =  — ,  ca  =  — 

Sx 

These  quantities  are  local  affine  invariants  because  we 
have  reached  an  affine  canonical  system.  We  have  used 
all  possible  affine  transformations  (translation,  rotation, 
skewing,  scaling)  to  eliminate  all  possible  affine  trans¬ 
formation  factors  and  arrive  at  the  above  form  of  the 
cubic,  so  the  remaining  coefficients  are  uniquely  defined 
regardless  of  which  system  we  started  with. 

A  projective  canonical  system  is  obtained  by  elimi¬ 
nating  the  last  two  coefficients  using  slants,  which  are 
purely  projective,  in  the  x  and  y  directions.  To  do  this, 
we  drop  the  bars  from  the  last  cubic  form  (20),  and  sub¬ 
stitute  z,y  in  terms  of  the  projective  canonical  x,y: 

^  =  . -  y  =  -; - - r  (21) 

l  +  <TtX  +  <ryy  I  + +<TtX ffyy 

with  the  x~  and  y-slant  factors 

<Tx  —  ~Ca,  <Ty  —  ~C2 

This  finally  brings  us  to  Descartes’  folium,  eq.  (5). 

This  concludes  the  elimination  of  the  cubic  coefficients 
and  brings  us  to  the  projective  canonical  system.  This 
system  was  defined  invariantly  by  intrinsic  properties  of 
the  curve  such  as  the  shape  of  the  osculating  nodal  cubic, 
which  is  independent  of  the  original  coordinate  system. 

3.3  Projective  invariants 

We  now  have  an  invariant  canonical  system  and  affine 
invariants,  but  still  no  projective  invariants.  To  obtain 
them,  we  transform  the  original  fitted  curve  /,  eq.  (1), 
to  our  canonical  system.  We  collect  all  the  transforma¬ 
tions  that  were  performed  during  the  canonization  pro¬ 
cess.  We  have  already  translated  and  rotated  /  (with 
the  factors  xo,yo,Ur);  we  now  perform  the  rest  of  the 
transformations  making  up  the  projectivity  (with  factors 
»y)  on  /.  The  coefficients  of  /  transform  to 
new  ones  at,  which  are  now  all  invariants  because  they 
represent  a  fitted  curve  defined  in  the  invariant  system. 


The  only  remaining  question  is  how  to  select  functions 
of  the  invariants  Oj  which  best  suit  our  needs. 

As  mentioned  before,  the  condition  of  locality  dictates 
that  we  use  derivatives  of  the  curve  rather  than  some 
arbitrary  functions  of  the  Uj.  The  first  six  derivatives  at 
xo  are  already  determined  by  the  canonization  process 
(as  dof.fde  =  0,0, —1,0,0, 1,0).  Thus  we  need  the 
seventh  and  eighth  derivatives.  These  can  be  obtained 
in  this  particular  system  similarly  to  eqs.  (6)-(10).  With 
the  above  values  of  d„  we  have  (dropping  the  bars) 

Ii  =  dj  =  flia  -07-1-  205  (22) 

li  =  da  =  —014  —  ®ii  +  2o8  —  04d7  (23) 

These  quantities  are  our  local  projective  invariants. 

In  conclusion,  we  started  with  a  curve  fitted  to  data 
points  around  xq,  yo,  and  after  a  series  of  transformations 
of  this  curve  we  arrived  at  local  invariants  which  are 
independent  of  the  fitting  details  or  the  point  of  view. 
We  can  repeat  the  process  for  other  points  to  obtain  an 
invariant  signature.  No  correspondence  is  needed. 

3.4  Experimental  implementation 

The  above  method  was  implemented  to  extract  local  in¬ 
variants  from  a  set  of  real  images.  Each  image  was  pro¬ 
cessed  to  obtain  a  contour  curve  for  the  relevant  object, 
using  st2Uidard  techniques  of  edge  detection  and  thin¬ 
ning.  We  used  a  window  about  50  pixels  wide  around 
each  contour  point  and  fitted  an  implicit  curve  to  it,  min¬ 
imizing  the  square  distances  with  SVD.  The  coefficients 
of  this  fitted  curve  were  used  to  calculate  the  invariants. 

Figure  2  shows  two  views  of  a  hanger.  Effects  of 
perspective  distortion  can  be  seen.  Figure  3  shows  the 
hanger  under  partial  occlusion.  Figures  4,  5,  and  6  show 
the  local  invariants  for  the  three  hanger  images.  A  good 
match  of  the  signatures  is  obtained.  A  check  for  a  match 
is  demonstrated  in  Figure  7.  The  match  is  between  the 
hangers  in  Figure  4  and  Figure  6,  where  it  is  partially 
occluded.  It  can  be  seen  that  the  occlusion  does  not 
prevent  us  from  obtaining  a  good  signature.  We  should 
mention  that  symmetry  helps  in  getting  a  full  signature 
for  the  hanger.  For  asymmetric  objects  only  part  of  the 
signature  is  obtained. 

Figure  8  shows  a  different  object,  a  coat  rack,  from 
two  different  viewpoints.  We  used  the  parts  of  the  rack 
on  which  coats  hang.  These  parts  are  somewhat  similar 
in  character  to  the  hanger  (under  projectivity).  Accord¬ 
ingly,  the  signature  has  some  similarity  to  the  previous 
one  but  it  is  different  enough  to  distinguish  the  hanger 
from  the  coat  rack. 

The  invariant  signatures  are  presented  (one  on  top  of 
the  other)  in  Figure  9.  The  local  invariants  obtained 
from  the  coat  rack  (Figure  8)  are  compared  with  those 
of  the  first  hanger  image  (Figure  2).  The  result  of  this 
comparison  is  presented  in  Figure  10. 

4  Local  Invariants  With  Some 
Correspondence 

While  the  previous  process  does  not  require  correspon¬ 
dence,  it  leads  to  fitting  rather  high  order  curves,  which 
may  be  sensitive  to  noise.  This  problem  is  discussed  in 
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Figure  2:  Two  views  of  a  hanger. 


[Weiss  1991]  and  it  is  shown  that  one  way  of  overcoming 
it  is  to  use  a  wide  window. 

Another  approach  to  increasing  robustness  is  to  use 
some  reference  features,  e.g.  points  or  lines,  for  which 
the  correspondence  is  known.  For  example,  a  silhouette 
of  an  airplane  can  contain  both  curved  parts  and  straight 
lines.  We  can  use  this  information  to  eliminate  some  of 
the  parameters  of  the  projective  or  alBne  transformation; 
fewer  curve  descriptors  will  be  needed  for  the  elimination 
of  the  remaining  ones.  Invariants  involving  both  deriva^ 
lives  and  reference  points  were  found  by  Barrett  tt  al. 
[1990]  and  Van  Gool  et  al.  [1990].  However,  they  still 
use  a  curve  parameter  t  which  also  has  to  be  eliminated, 
and  this  reduces  the  robustness  of  their  method. 


Figure  3:  The  hanger  under  partial  occlusion. 


Figure  4:  The  invariant  signature  of  the  first  hanger  im¬ 
age. 


Figure  5:  The  invariant  signature  of  the  second  hanger 
image. 


The  “parameterless”  method  described  above  is  per¬ 
fectly  suited  for  this  situation,  and  again  leads  to  a  re¬ 
duction  in  the  number  of  quantities  needed  from  the  im¬ 
age  and  to  increased  reliability.  Here  we  use  a  canonical 
method  similar  to  that  used  in  the  correspondenceless 
case  in  order  to  find  local  invariants  while  avoiding  the 
curve  parameter.  This  makes  the  method  more  robust 
as  there  are  fewer  unknowns  to  eliminate.  In  addition,  as 
in  the  previous  case,  the  canonical  frame  makes  it  possi¬ 
ble  to  obtain  an  invariant  fit  using  an  iterative  process, 
which  should  increase  the  robustness  further. 

The  first  stage  is  similar  to  the  previous  case:  fit  a 
high  order  curve  over  some  window  around  some  xo,ya 
and  then  translate  and  rotate  until  the  origin  is  at  zo,  tfo 
and  the  z  axis  is  tangent  to  the  curve.  We  need  a  smaller 
window  than  before  and  a  lower  order  curve  because  we 
need  only  lower  derivatives. 

Again  we  obtain  an  auxiliary  osculating  curve  that  will 
help  us  find  the  canonical  system.  However,  we  do  not 
need  the  nodal  cubic;  a  conic,  with  three  parameters, 
will  suffice  in  all  cases: 

f*  =  c(z,  y)  =  coz’  +  ciV*  -I-  cjzy  -1-  y  =  0  (24) 

The  exact  process  of  finding  the  conic  and  canonising 
differs  in  each  case.  However,  the  principles  of  invari¬ 
ance  and  locality  must  be  maintained.  In  the  following 
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Figure  10:  The  invariant  signature  of  the  second  object 
(the  coat  rack)  presented  on  top  of  the  signature  of  the 
first  object  (the  hanger). 


•  A  Curve  and  Two  Feature  Points: 

This  case  requires  only  the  second  derivative  to  de¬ 
termine  the  osculating  conic,  rather  than  the  fourth 
as  before.  We  first  find  the  conic  that  osculates 
the  fitted  curve  with  second  order  contact,  and 
also  passes  through  the  two  reference  points.  This 
uniquely  determines  the  conic.  We  then  find  the  line 
that  passes  through  the  two  reference  points.  This 
brings  us  to  the  same  situation  as  before,  namely  a 
conic  plus  a  line,  but  with  two  fewer  derivatives. 

•  A  Curve  and  Two  Feature  Lines: 

This  case  requires  only  the  second  derivative  to  de¬ 
termine  the  osculating  conic,  rather  than  the  fourth 
as  before.  We  first  find  the  conic  that  osculates  the 
fitted  curve  with  second  order  contact,  and  is  also 
tangent  to  the  two  reference  lines.  We  then  find  the 
intersection  point  of  the  reference  lines.  This  brings 
us  to  the  case  of  a  conic  plus  a  point  that  we  dealt 
with  before,  but  with  two  fewer  derivatives. 

•  A  Curve,  a  Point  and  a  Line: 

As  before  we  require  that  the  conic  osculate  the  fit¬ 
ted  curve  up  to  second  order  contact.  In  addition 
we  require  that  the  reference  line  be  polar  to  the  ref¬ 
erence  point  w.r.t.  the  conic.  This  provides  enough 
conditions  to  determine  the  conic.  Achieving  this 
brings  us  again  to  the  situation  of  a  conic  plus  a 
point,  to  be  canonized  as  before,  again  with  two 
fewer  derivatives. 

In  what  follows  we  describe  the  above  processes  in 
detail,  and  also  give  experimental  results  for  some  of  the 
cases. 


4.1  Ikansforming  to  a  Euclidean  canonical 
system 

In  all  of  the  above  processes  the  reference  points  and 
lines  need  to  be  tranrformed  to  the  Euclidean  canonical 
system.  For  a  feature  point  z\,y\  the  transformation  is 

*1  =  (*i  -  *0  -  «r(yi  -  W)))/(l  +  «?)  (25) 

»i  =  (yi -yo  +  Ur(*i -*o))/(l  +  u?)  (26) 


Figure  11:  On  the  left,  an  osculating  conic.  On  the  right, 
the  canonical  conic  and  point. 


(This  involves  the  inverse  of  the  rotation  of  the  curve  /, 
eq.  (3),  because  points  transform  with  the  inverse  of  the 


curve  transformation.) 

The  reference  (feature)  line  bQ+bix+b2yis  translated 
and  rotated  as 

6o  = 

bo  +  bixo  +  621/0 

(27) 

b  = 

61  —  Urb2 

(28) 

h  = 

62  +  “r6l 

(29) 

We  agtun  drop  the  bars  from  all  quantities. 


Figure  12:  The  relation  between  a  polar  line  and  a  point. 
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Figure  6;  The  invariant  signature  of  the  occluded  hanger 
image. 
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Figure  7:  The  invariant  signature  of  the  occluded  hanger 
image  presented  on  top  of  the  signature  of  the  unoc¬ 
cluded  hanger. 
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we  will  briefly  describe  the  processes  for  the  different 
possible  combinations.  Each  known  feature  point  or  line 
reduces  the  number  of  derivatives  needed  by  two,  be¬ 
cause  it  eliminates  two  transformation  factors. 

•  A  Curve  and  One  Feature  Point: 

We  draw  a  line  joining  the  given  reference  point 
to  the  curve  point  zo,yo  (Figure  11).  This 
is  obviously  a  projectively  invariant  operation.  We 
use  this  line  as  our  new  y  axis.  As  before  we  skew 
the  system  so  this  line  becomes  perpendicular  to  x. 
We  thus  obtain  an  orthogonal  system  which  we  can 
scale  and  slant  as  before. 

To  do  this,  we  obtain  an  osculating  conic  to  our 
fitted  curve  /.  We  need  only  fourth  order  contact, 
rather  than  sixth  as  before. 

After  fitting  the  conic,  our  goal  will  be  to  transform 
to  a  canonical  system  in  which  the  conic  is  a  unit 
parabola  x^  +  y  =  0,  and  the  distance  between  the 
curve  point  amd  the  reference  point  is  unity  (right 
hand  side  of  Figure  11). 

•  A  Curve  and  One  Feature  Line; 

We  convert  to  the  previous  case  by  finding  the  polar 
point  of  the  given  line  with  respect  to  the  osculating 


Figure  8:  Two  views  of  a  coat  rack. 


conic.  Polarity  of  a  point  and  a  line  is  an  invariant 
relation.  Given  a  point,  we  can  draw  from  it  two 
tangents  to  the  conic,  creating  two  points  at  which 
these  tangents  touch  the  conic.  The  line  joining 
these  two  points  is  the  polu  line  of  the  given  point 
w.r.t.  the  conic  (Figure  12). 

The  conic  is  found  in  the  same  way  as  in  the  previ¬ 
ous  case,  requiring  osculation  in  the  fourth  deriva¬ 
tives.  After  obtaining  the  polar  point  in  a  Euclidean 
canonical  system,  we  are  in  the  same  situation  as  in 
the  previous  case,  having  a  conic  and  a  point,  and 
we  can  proceed  to  find  invariants  as  before. 


Figure  9:  The  invariant  signatures  of  the  coat  rack.  The 
signatures  are  presented  one  on  top  of  the  other. 
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4.2  A  Curve  and  one  feature  point 

We  first  find  the  first  four  derivatives  of  /  using  eqs.  (6)- 
(10).  From  these  we  find  the  coefficients  of  the  osculating 
conic  by  the  same  method  we  used  for  the  nodal  cubic. 
The  result  is 


Co  = 

— d2 

(30) 

Cl  = 

-(d2d4-d§)/<^ 

(31) 

C2  = 

-da/dz 

(32) 

To  orthogonalize  the  system,  we  want  to  obtain  it  =  0. 
This  is  achieved  by  skewing  (eq.  19)  with  the  skewing 
factor  u,  =  xi/yi.  The  orthogonalization  changes  the 
conic  coefficients  to 

Cl  =  Cl  +  couj  +  C2«,  (33) 

C2  =  Cj  +  2cou*  (34) 

We  drop  the  bars  from  the  Cj.  The  reference  point  coor¬ 
dinates  are  now  (0,yi). 

For  the  affine  case  we  only  need  scaling,  eq.  (19). 

It  is  easy  to  obtain  a  distance  of  unity  between  the 
origin  and  the  reference  point  by  scaling  the  y  axis  with 
Sy  =  ±l/yi .  (The  sign  is  taken  to  be  the  same  as  the  sign 
of  cq.)  Scaling  in  the  x  direction  is  done  by  requiring  cq  = 
1,  which  is  achieved  by  Sx  =  Substituting  the 

scaling  transformation  (19)  in  the  conic  (24)  we  obtain 
(dropping  bars) 

J.2  ^  ^  ^y-Q 

Sy  Sx 

The  two  remaining  coefficients,  ci/sy  and  cs/s,,  are 
affine  invariants.  (The  conic  here  is  not  a  unit  parabola 
but  has  these  two  invariant  coefficients.) 

For  projective  invuiants,  we  first  have  to  slant  the 
shape  in  the  x  and  y  directions.  (This  has  to  be  done 
before  scaling.)  The  terms  containing  ci  and  C2  are  elim¬ 
inated  using  the  transformation  (21)  with  the  x,  y  slant 
factors  being  ffg  =  — C2,  ffy  =  — Cf. 

As  in  the  affine  case  we  use  the  reference  point  for 
scaling,  but  now  its  distance  has  changed  because  of  the 
slant.  The  new  distance  is  now  =  yi/(l  —  <ryyi). 
We  want  to  scale  y  so  that  this  distance  is  unity,  so 
Sy  =  ±1/]^  (again  with  the  sign  of  co). 

At  this  point  the  conic  is  reduced  to  cqx^  +  y/sy  =  0. 
To  obtain  a  unit  parabola  and  get  rid  of  cq  we  scale  in 
the  x-direction  with  Sg  =  ^coSy. 

We  have  thus  obtained  the  projective  canonical  sys¬ 
tem.  To  obtain  the  invariants,  we  have  to  transform 
the  original  fitted  curve  /  to  this  system.  Again  all  the 
transformed  a,  are  invariants,  but  we  need  the  ones  that 
are  local  in  nature  and  independent  of  the  fitting  details, 
namely  derivatives.  Since  we  have  used  up  the  first  four 
derivatives  we  need  the  fifth  and  sixth  (two  less  than  in 
the  correspondenceless  case).  To  obtain  them  we  sub¬ 
stitute  in  the  expressions  (6)-(10)  the  canonical  values 
do, . . . ,  (^4  =  0, 0,  — 1, 0, 0  and  obtain 

/i  =  ds  =  On  —  as  (35) 

/2  =  ds  =  09  —  Oi2  —  <l4d5  (36) 

These  are  our  local  projective  invariants. 


4.3  A  curve  and  one  feature  line 

The  conic  is  found  in  the  same  way  as  in  the  previous 
case,  requiring  osculation  in  the  fourth  derivatives.  The 
polar  line  is  found  as  follows: 

Given  a  point  x}  in  homogeneous  coordinates,  we  can 
write  the  coefficients  6,-  of  its  polar  line  with  respect  to 
a  homogeneous  conic 

C  =  Co(x*)®  +  ci(y*)’  -i-  C2X*y*  +  /z*  =  0 


‘0=  ^lx‘  =»?  (37) 

61  =  =  2coXi -I- C2yi  (38) 

62  =  =  2ciyi  +  C2X1  -I-  zi  (39) 

(C  is  first  differentiated  and  then  the  point  coordinates 
X)  are  substituted  in  the  right  hand  side.)  In  our  case  we 
know  the  line  bi  and  the  conic  C  in  the  above  equation, 
so  we  have  a  set  of  linear  equations  for  the  point  Xj . 
Solving  these  equations  we  obtain 

Xj  =  —bi  -I-  C260  (40) 

yj  =  —2cobo  (41) 

Zj  =  biC2  —  2cob2  -f  (4coCi  —  C2)6o  (42) 

Going  back  to  regular  coordinates  we  have  the  polar 
point  in  our  Euclidean  canonical  system 

=  yi=yf/«i 

We  are  now  in  the  same  situation  as  in  the  previous 
case,  having  a  conic  and  a  point  in  a  Euclidean  canonical 
system,  auid  we  can  proceed  to  find  invariants  as  before. 

4.4  A  curve  and  two  feature  points 

We  need  here  the  formula  for  the  conic  coefficients  in 
terms  of  the  second  derivative  and  the  reference  points; 


-d2  (43) 

co(xix^yi  -  xfpi/2)  -  yi(gay2  -  xiya) 

xjyiyz  -xiyiyi  ^ 

-co(xjy?  -  xfyf)  -H  yii^  -  y?y2  .  . 


where  xi,yi,  X2,y2  are  the  reference  point  coordinates 
in  the  Euclidean  canonical  system. 

Here  (43)  is  the  same  condition  on  cq  as  in  all  previous 
cases,  and  (44)-(45)  are  obtained  by  substituting  the 
reference  points  in  the  conic  (24)  and  solving  for  ci  and 
C2- 

The  affine  invariants  are  calculated  from  the  c,  as  in 
the  previous  case.  The  projective  invariants  are  now 
the  third  and  fourth  derivatives,  two  lower  than  before. 
Substituting  dz  =  —  1  in  eq.  (6)  we  obtain 

da  =  —<*6  +  04  (46) 

d4  =  — Oio  -b  07  —  as  —  a4d3  (47) 

which  are  our  local  projective  invariants. 
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4.5  A  curve  and  two  feature  lines 

The  only  new  thing  here  is  finding  the  conic.  The  tan¬ 
gents  to  a  conic  satisfy  the  equations  of  the  ‘line  conic”, 
which  is  the  dual  of  a  regular  conic.  When  representing 
the  conic  in  matrix  notation,  the  line  conic  matrix  is  the 
inverse  of  the  point  conic  matrix.  The  inverse  matrix 
of  (24)  is 

10  -cj  \ 

0  0  2co  I 

— Cj  2co  €2  ~  4coCl  / 

Co  is  determined  as  before  by  the  second  derivative  dj. 
The  reference  lines  satisfy  the  equations  bC~^b*  =  0, 
from  which  ci  and  cj  can  be  found.  We  obtain  the  conic 


Co  =  — dj  (48) 

+  2c(42W’  - 

- 

b\  —  2c3io6i  -b  Acobohi  -1-  e^bl 

where  are  the  coefficients  of  the  two  reference  lines 
in  the  Euclidean  canonical  system. 


Figure  13:  The  invariant  signature  of  the  coat  rack  im¬ 


age. 


Figure  14:  The  invariant  signatures  of  the  images  of  the 
hanger  and  the  coat  rack  presented  on  top  of  each  other. 


Figure  15:  The  affine  invariant  signatures  of  the  hanger 
and  the  coat  rack. 


Experiments 

The  images  of  the  hanger  and  the  coat  rack  were  used 
to  derive  local  signatures  using  two  feature  lines.  The 
signature  obtained  from  the  coat  rack  image  is'presented 
in  Figure  13.  A  comparison  of  the  two  signatures  for  the 
hanger  and  the  coat  rack  is  presented  in  Figure  14. 

The  curve  and  two  feature  lines  method  was  used  to 
achieve  affine  invariants  for  the  same  objects.  The  re¬ 
sults  of  the  invariant  computation  are  presented  in  Fig¬ 
ure  15.  A  comparison  of  the  invariants  of  the  two  objects 
is  shown  in  Figure  16. 

4.6  A  curve,  a  point  and  a  line 

We  require  that  the  conic  osculate  the  fitted  curve  up 
to  second  ord«  contact.  In  addition  we  require  that  the 
reference  line  be  polar  to  the  reference  point  w.r.t.  the 
conic.  This  provides  enough  conditions  to  determine  the 
conic.  Achieving  this  will  bring  us  again  to  the  situation 
of  a  conic  plus  a  point,  to  be  canonised  as  bdbre,  again 
with  two  fewer  derivatives. 

As  before,  the  osculation  condition  leads  to  co  =  — dj. 
Setting  rf  =  1,  the  first  of  the  polar  equations  (37)  leads 
to  pi  =  and  the  line  coefficimts  have  to  be  normal¬ 
ised  so  that  this  equation  is  satisfied.  This  leads  to  the 
substitution 


bi*—biyi/bo,  b2*—b2yi/ho 
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Figure  16:  The  affine  invariant  signatures  of  the  hanger 
and  coat  rack  images  presented  on  top  of  each  other. 


The  remaining  two  polar  equations  are  then 

2coii  +  cjyi  =  bi  2ciyi  +  cjii  +  1  =  62 
which  are  satisfied  by  the  conic  coefficients 

Cl  =  ((62-l)yi  +  2coxJ-6ij:i)/(2j/J)  (51) 

ca  =  — (2coXi  —  6i)/j/i  (52) 

The  affine  and  projective  invariants  are  calculated  as  in 
the  previous  two  cases. 

5  Conclusions 

We  have  presented  a  method  for  finding  local  projective 
and  affine  invariants,  and  have  applied  it  to  real  images. 
The  method  consists  of  defining  a  canonical  coordinate 
system  using  intrinsic  properties  of  the  shape,  indepen¬ 
dently  of  the  given  coordinate  system.  Since  this  canoni¬ 
cal  system  is  independent  of  the  original  one,  it  is  invari¬ 
ant  and  all  quantities  defined  in  it  are  invariant.  The 
method  is  general  and  can  be  used  locally  or  globally, 
implicitly  or  explicitly.  We  have  applied  the  method  to 
find  local  invariants  of  a  general  curve  without  any  corre¬ 
spondence,  and  of  curves  with  known  correspondences  of 
one  or  two  feature  points  or  lines.  Experimental  results 
for  both  cases  are  presented. 

Our  method  combines  the  advantages  of  the  algebraic 
and  differential  methods.  The  differential  method,  being 
local,  is  much  less  sensitive  to  occlusion.  The  algebraic 
(implicit)  method  has  an  advantage  in  robustness  be¬ 
cause  it  does  not  need  to  eliminate  an  unknown  curve  pa¬ 
rameter.  Our  experiments  with  real  images  have  shown 
that,  by  using  our  local  implicit  method,  we  can  find 
an  invariant  signature  which  is  both  insensitive  to  occlu¬ 
sion  and  relatively  reliable.  We  have  also  demonstrated 
that  these  signatures,  while  unchanged  under  changes  in 
viewpoint,  do  differ  for  images  of  even  slightly  different 
objects.  Thus  they  have  enough  descriptive  power  to  dis¬ 
tinguish  between  many  different  kinds  of  objects.  There¬ 
fore  they  can  be  used  in  an  automated  object  recognition 
system  that  can  distinguish  and  identify  objects  regard¬ 
less  of  the  point  of  view  from  which  they  are  observed. 
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Abstract 

This  pai>er  tlescrihes  an  object -oriented  machine  vi¬ 
sion  software  system  for  industrial  part  inspection  and 
manufacturing  process  parameter  understanding  and 
monitoring  being  developed  at  the  GE  Corporate  Re¬ 
search  A'  Development  Center.  Key  concepts  in  this 
system  have  been  adapted  from  defense-related  im¬ 
age  understanding  research,  including  geometric  con¬ 
straint  programming  and  object-oriented  design  in 
machine  vision.  IVt*  describe  the  implementation  of  a 
working  system  for  part  inspection  from  X-ray  images. 
UV  outline  some  new  results  in  the  area  of  dimen¬ 
sion  (parameter)  estimation  in  the  presence  of  mea¬ 
surement  uncertainty  using  Bayesian  techniques  to 
compute  part  dimeirsion  distributions  from  image  se¬ 
quences.  IIV  indicate  .some  new  directions  for  our  work 
which  address  central  issues  in  application-driven  com¬ 
puter  vision. 

Keywords:  object-oriented  methodology,  constraint 
processing,  deformable  templates,  automated  2D  im¬ 
age  analysis,  dual-use  technology. 

1  Introduction 

This  paper  describes  an  on-goiiig  effort  at  GE  Cor¬ 
porate  Research  .t:  Development  to  transfer  recent  ad¬ 
vances  in  inotlel-based  vision  and  object  recognition  to 
industrial  machine  vision  applications.  Industrial  part 
design  and  manufacture  offers  a  wealth  of  opportuni¬ 
ties  to  a<ldress  many  of  the  central  issues  in  computer 
vision  research  including  defect  classification  i>roblems 
(feature  enhancement  and  signal  proces.sing),  part  tol- 
erancing  and  measurement  error  analysis  (modelling 
using  constraints  with  uncertainty),  object-orient e<l 
design  for  machine  vision  .software  (IP  standards), 
multi-modality  image  analysis  and  fusion,  aiul  perfor¬ 
mance  a.s.st'ssment  of  image  analysis  algorithms  (in¬ 
spection/recognition  system  accuracy).  \\’e  have  fo¬ 
cussed  on  two  of  thest'  themes,  geometric  const raint- 
proces.sing  and  object-oriented  machine  vision  .soft¬ 
ware,  and  are  a|>|>iying  them  to  build  an  X-ray  in¬ 
spection  software  system  for  iiulustrial  parts.  \\e  have 
also  begun  to  look  at  part  tolerancing  in  the  presence 
of  measurement  error  as  a  step  toward  understanding 
mannfaclure  proce.ss  variability  from  images.  Both  of 


these  topics  are  di.scussed  in  the  paper. 

1.1  Inspection  and  Continuous  Product 
Improvement 

The  traditional  model  for  industrial  part  (|uality 
<i.s.sessment  involves  two  steps;  iu-procoss  and  post¬ 
process  or  final  inspection.  Most  machine  vision  in¬ 
spection  systems  built  in  the  pa,sl  .lO  years  have  been 
built  along  these  lines.  One  problem  with  validating 
such  systems  is  that  flaw  occurence  ratios  can  be  vc'ry 
low  and  hence  the  practical  advantages  of  using  an  au¬ 
tomated  system  (repeatability,  accuracy,  speed)  is  not 
readily  realized  until  after  a  long  validation  period. 
The  notion  that  part  geometry  can  be  \ised  to  lielp 

ipide  industrial  image  interpretation  has  been  around 
or  some  time.  Early  "geometry-driven"  approaches 
altem)>ted  to  use  an  actual  copy  of  the  part  ils<*lf  as 
a  reference  or  "golden  template"  [Decker.  15)8;i].  'I'he 
immediate  objection  is  that,  the  specific  part  may  not 
represent  the  ideal  dimensions.  Further,  a  major  prob¬ 
lem  is  how  to  int  erpret  differenci^s  Ix'tween  I  lie  ref  r- 
ence  part  and  the  i)art  to  be  inspected.  Thes«*  tliffer- 
eiices  can  arise  from  irrelevant  variations  in  intensity 
cairsed  by  illumination,  uniformity  or  shadows.  Even 
if  the  image  acquisition  process  can  be  controlled, 
there  will  be  unavoidable  part-to-part  variations  which 
naturally  arise  from  the  manufacturing  proci’ss  itself, 
but  are  irrelevant  to  the  cpiality  of  the  part.  Although 
the  part  reference  approach  has  proven  highly  suc¬ 
cessful  in  the  case  of  \'LSI  photolithographic  m.nsk 
inspection  [Huang.  198:1.  Okamoto  (t  al..  198-f]  it  is 
difficult  to  see  how  to  e.xtend  this  simple  approach 
to  the  inspection  of  more  comple.v.  Ilin-e-dimeiisional. 
manufactured  parts  without  introducing  som<'  slrnc- 
ture  defining  various  regions  aiul  bonmlaries  of  the 
part  geometry. 

An  alternative  approach  that  became  popular  in 
the  l!)80'.s  was  the  idea  of  using  compnier-aided-design 
(CAD)  models  to  derive  the  necessary  information  to 
automatically  program  visual  inspection  [Wi-sl  d  al.. 
liiyi.  Chen  and  Mulgaonkar.  l!lt)i].  The  advantage  of 
this  approach  is  that  the  g<'omelry  of  the  object  to  be 
inspected  and  its  toIeranc<»s  can  be  specified  by  lln- 
(  AD  model  and  used  to  deri>e  optimum  lighting  and 
viewiii).'  configurations  as  w<'ll  as  prmide  conli'xl  for 
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the  application  of  image  analysis  processes. 

On  the  other  haiul,  the  ( 'AD  approach  has  not  yet 
been  hroailly  successful  because  images  result  from 
complex  physical  phenomena,  such  as  siiecular  re- 
llection  ami  mutual  illumination.  A  more  siguiticant 
j>roblem  limiting  the  use  of  CAD  models  is  that  the  ac¬ 
tual  manufactured  parts  may  differ  significantly  from 
the  idealized  motlel.  During  product  development  a 
part  tiesign  can  change  rapidly  to  accommodate  the 
realities  of  manufacturing  proces.ses  and  the  original 
(.'AD  representation  can  quickly  become  ol)solete.  Fi¬ 
nally.  for  curved  objects,  the  derivation  of  tolerance 
offset  surfaces  is  ipiite  complex  ami.  at  the  very  least, 
retpiires  the  solution  of  high  degree  polynomial  e<ina- 
tions  [Farouki. 

During  the  l!,)80‘s.  manufacturing  industries  at¬ 
tempted  to  introiluce  processes  for  coiitiuuotis  prod- 
net  iuiproveiiieiit  or  total  qnality.  The  idea  be¬ 
hind  this  was  to  achieve  incremental  improvements 
in  product  tiesign  and  manufacture  by  monitoring 
ami  controlling  critical  manufacturing  process  param¬ 
eters  throughont  part  production  and  to  move  away 
from  the  one  or  two  step  inspection  motlel  to  multi¬ 
ple  step  (i.e.  near  continnons)  inspection.  Although 
relatively  simple,  specialized  vision-based  monitoring 
devices  have  been  succe.ssfully  demonstrated,  these 
are  not  often  cost-i'ffective  solutions  in  the  long  run 
and  rapidly  become  out-dateil  as  manufacturing  pro¬ 
cedures  change. 

It  is  clear  (hat  the  key  to  long  term  acceptance  of 
industrial  vision  systems  for  part  ((uality  assessment 
is  the  introduction  of  generic  methodologies  that  can 
In-  rea<lily  re-|)rogrammetl  to  new  but  similar  part  ge¬ 
ometries  and  can  also  cope  with  both  inherent  imaging 
distortions  and  part  geometry  variations.  To  address 
tlu’se  issues  we  have  chosen  constraint  templates  as  a 
medium  for  industrial  part  inspection  and  under.stand- 
ing/monitoring  manufacturing  proce.ss  ])arameters  to¬ 
gether  with  lU  standards  as  a  way  to  standardize  con¬ 
cepts  and  software  protocols  acro.ss  applications,  ami 
imaging  modalities  (currently  x-ray.  ultrasound,  opti¬ 
cal). 

We  have  currently  reduce^l  our  ideas  to  practice  in 
the  form  of  an  X-ray  image  analysis  .system  for  part 
inspection.  This  system,  which  is  callerl  (he  Image 
Interpretation  Foundations  (I'F)  system  is  described 
in  detail  below. 

2  Geometric  Constraint  Templates 

The  focus  in  inspection  ami  part  monitoring  is 
on  the  geometry  of  a  part.  Our  analysis  is  there¬ 
fore  largely  ba.sed  on  geom<'try-center<’d  representa¬ 
tions  which  we  call  tomplatos. 

In  our  ap|)roach  a  template  provides  the  context  for 

•  monitoring  gi'ometric  measurements  acr<»ss 
batches. 

•  shai>e-ba.sed  material  characterization. 

•  nomitial  and  variational  part  geometry  extraction 
(manufactured  part  tolerancing).  and. 

•  intensity  signal  verification  (Haw  analysis). 


Figiin*  1:  Gfomdry  lii(  rarcliy. 


We  repre.sent  critical  part  geometry  by  a  ct)nst  raiMt- 
model  which  is  defined  by  an  inspector  during  inspec¬ 
tion  template  construction.  These  teniplati’-delineil 
relationships  provide  a  Hexible  structure  which  can 
adapt  to  mamifacturing  process  variations  while  at 
the  same  tinif  insisting  on  the  primary  constraints 
required  for  a  good  qnality  part.  Wi*  accommodate 
these  part-to-part  variations  by  solving  each  time  for 
an  instance  of  the  template  which  satisfies  all  of  tin' 
specified  constraints,  while  at  the  same  time  accom- 
modatingfor  the  observeil  image  features  which  define 
(he  actual  j)art  geometry.  The  statistics  of  inciilental 
variations  can  be  accumulated  over  a  large  niimlnr 
of  parts.  The  resulting  parameter  distributions  can 
then  be  used  to  rleline  normal  ranges  of  part  g<'ometry. 
Since  the  template  is  designeil  directly  from  inspection 
images,  the  template  features  are  consistent  with  both 
part  geometry  and  inherent  imaging  distortions. 

Our  current  geometry  hierarchy  is  slK)wn  in  figure  1 
and  includes  2D/3D  points,  as  well  as  repr»>sen(ations 
for  '2D/‘iD  curves  and  surfaces  which  are  commonly 
used  in  manufacturing  and  design.  A  get)metric  ile- 
scription  is  built  up  from  a  set  of  primitive  geometri*- 
cla.sses  such  as  point.  line  and  conic. 

The  primitives  are  imbedd<'d  in  a  to|>ological  n<  t- 
work  which  tlefmes  various  connection  relations  be¬ 
tween  the  primitives.  For  example,  a  set  i>f  line  seg¬ 
ments  may  be  joined  at  common  endpoints  to  form  a 
polygonal  chain,  (ieneral  geometric  relations  or  con¬ 
straints  may  be  ilefiin'd  betwi'en  the  primit ives  such 
as  parallel,  ecpiiilistant.  and  tangont  eontiniinity. 
The  latter  "ondition  ensures  that  two  i-iirve  segments 
have  e<|ual  tangents  at  a  common  intersection  imint. 
It  is  possible'  lu  build  up  quite  complex  and  Hexible 
models  from  these  primitives  and  relations. 

F.xcept  for  scalar  measnn’s  such  as  li'iiglh  and  an¬ 
gle.  each  geometric  entity  is  represented  by  a  configu¬ 
ration.  which  is  an  alfim' t ransforimit ion  matrix  rep¬ 
resenting  the  translation,  rotation,  and  scaling  from 
the  local  coordinate  frame  (.V.)')  of  the  sh:q>e  to  thi' 
image  frame  (j-.y).  Symbolically,  in  t W(7-dimensions, 
a  configuration  has  3  parameti'rs.  location,  orieiita- 
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tion  aiul  size  reiiresented  by  2D  vectors  of  variables. 

The  locatiuii  of  a  shape  is  described  by  1  = 
This  loca(ioii  is  usually  the  center  or  the 
origin  of  the  local  frame  of  the  primitive  shape. 

The  orientation  of  a  shape  in  the  jdane  is  de¬ 
scribed  by  an  angle  (?.  or  by  a  unit  vector  o  = 
(Oj-.Oy)  =  (costf,sin^)^.  The  latter  is  used  to  avoid 
trigonometric  functions  and  u.se  only  polynomial  func¬ 
tions  of  integer  powers. 

The  size  of  a  shape  like  an  ellipse  is  represented 
by  a  vector  having  2  scale  factors,  (k\r.k‘y)  along  the 
major  and  minor  axes  (a,  6)^.  To  avoid  division  of 
polvnomials.  the  inverse  of  the  size  is  represented:  k  = 

{k,\k-yf  = 

The  description  of  the  primitives  and  the  geomet¬ 
ric  relationship  between  primitives  is  then  represented 
in  a  uniform  manner  as  a  system  of  polynomials  in 
the  configuration  and  constraint  parameters.  We  have 
fouinl  that  this  generic  representation  is  ade(iuate  for 
many  applications  such  as  template  editing,  and  tem¬ 
plate  solving  and  allows  new  algorithms  to  be  read¬ 
ily  prototyped  with  minimal  effort.  In  particular, 
the  use  of  a  uniform  jmlynomial  .system  for  the  con¬ 
straints  permits  the  development  of  an  efficient  con¬ 
straint  .solver  which  is  described  in  .section  d.3. 

2.1  Template  Variants 

Templates  are  defined  in  terms  of  a  set  of  geomet¬ 
ric  entities  and  a  set  of  spatial,  functional,  or  descrip¬ 
tive  objects.  Spatial  relationships  between  geometric 
objects  are  defined  by  a  constraint  (see  section  2). 
Functional  objects  define  the  context  within  which 
geometry  is  used  in  an  analysis  algorithm.  For  ex¬ 
ample.  a  checker  is  used  for  local  intensity  signal 
analysis  where  a  geometric  primitive  defines  the  image 
con  tour/ region  of  interest  extracted  from  the  image. 
(  'urve-based  checkers  include  contour-checker  ( this 
extracts  an  intensity  signal  along  a  contour),  and  a 
hohi-checker  (this  enhances  and  extracts  the  signal 
profile  along  low  contra.st  linear  features).  These  are 
generic  image  proce.ssing  representations  that  are  also 
useful  for  building  flaw-detection  algorithms,  (for  an 
example  see  section  1.1).  A  monitor  can  be  attached 
to  a  configuration  and  is  used  to  record  (i.e.  moni¬ 
tor)  a  configuration  parameter  or  dimension  over  im¬ 
age  se(|uences.  De.scriptive  labels  can  also  be  attached 
to  geometric  primitives,  for  example  a  hole  to  a  line 
primitive,  a  row  to  a  group  of  circles  or  a  cavity  to 
a  polygon/face.  This  provides  a  higher-level  interface 
to  anaijsis  which  is  more  in  line  with  the  terminology 
used  in  inspection. 

3  Object-Oriented  Design  in  Machine 
Vision 

'file  main  advantages  of  object-oriented  design  are 
flexibility  and  code  sharing.  The  definition  of  generic 
object  cla.sstsi  pr()vi<|es  standard  interface.s  so  that  new 
code  can  be  cpiickly  developed  and  integrated  since  the 
important  data  structures  and  variables  are  already  in 
place.  Two  major  vision  systems  have  already  been 
implementeil  along  object-oriented  design  princii>les. 
the  Cartographic  Modeling  Environment  [Hanson  and 


Quam.  1988]  and  Power  Vision  [McConnell  ainl  Law- 
ton.  1988],  The  Image  Understanding  Environments 
(lUE)  Project  funded  by  DARPA  [Mundy  (t  al..  1992] 
has  also  made  a  major  contribution  to  this  area. 

The  I*F  System  is  an  image  understanding  soft¬ 
ware  system  designed  using  object-oriented  metluxl- 
ology  and  implemented  using  the  C-(-f  language  and 
the  X-toolkit,  Interviews  [Linton  tl  al.,  1988].  Key 
features  of  our  system  include: 

•  The  use  of  the  subject-view  paradigm  for  pro¬ 
viding  the  relationships  between  an  object,  i.e. 
the  subject,  and  the  graphics  display  or  ( he  view 
of  the  object.  This  approach  permits  the  develop¬ 
ment  of  object  libraries  which  are  not  de|)endent 
on  .specific  disi)'ay  mechanisms. 

•  Adherence  to  f  he  conce))ts  and  formats  defined 
by  the  PIK  (Programmers  Imaging  Kernel)  stan¬ 
dard  [ansi,  1990].  It  is  expected  that  many  im¬ 
age  accelerator  manufacturers  will  su])j)ort  the 
PIK  standard  so  that  th<'  code  developed  in  I'F 
will  transparently  run  with  increased  throughput 
on  a  PIK  standard  accelerator. 

•  An  extensive  .set  of  image  feature  cla.s.ses  which 
are  closely  integrated  with  geometric  primitives 
to  facilitate  the  geometric  represenl  ation  of  image 
events. 

•  New  object  concepts  to  support  geometric  cmi- 
straint  programming.  A  hierarchy  of  gtHuuelric 
primitives  and  parameterized  transformations  of 
the  primitives  is  provided  to  allow  the  d<'scrip- 
tion  of  curved  shapes  and  to  accomit  for  global 
geometric  relations  between  primitives. 

Over  a  2  year  period  the  1-F  system  has  matured 
into  a  collection  of  libraries  containing  approximately 
300  classes  dedicated  to  image  analysis  algorithms, 
display  utilities  and  interactive  tools.  (Figure  2).  'fhe 
software  is  divided  into  2  major  library  groups;  thi' 
lU  Standards,  which  contains  functionality  like  seg¬ 
mentation,  image  filtering,  geometry,  topology,  that 
are  common  to  all  (i.e.  not  just  industrial  image  in¬ 
terpretation)  lU  ajiidications,  and  the  l-'F  .Standards 
which  are  built  on  toj)  of  the  IU  Standards  and  in¬ 
clude  template-specific  representations  and  function¬ 
ality.  As  illustrated  in  Figure  3.  individual  applica¬ 
tions  are  built  on  toi>  of  these  two  library  groups. 

4  The  I^F  Inspection  System 

In  this  section  we  discuss  the  current  iniplenieul:i- 
tion  of  an  inspection  system  which  Idcusc's  on  X-ray 
image  analysis  of  |)arts.  A  flow  chart  of  the  system  is 
shown  in  Figure  4. 

4.1  Template  Creation 

A  typical  example  is  shown  in  Figure  The  general 
template  creation  |)rocess  involves  first  specifying  a  set 
of  geometric  primitives  and  then  establishing  the  nla- 
tionships  between  them.  In  our  system  this  is  achievetl 
using  a  graphical  template  editing  tool  implemented 
using  InterN'iews  [Linton  d  al..  1988].  1  h<'  tool  allows 
the  user  to  build  a  temjilate  composed  of  a  selection 
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Figure  4:  FUnr  diagram  of  ibt  f’F  systtm. 


of  tite  4  types  of  geometric  primitive  specifietl  in  sec¬ 
tion  1.1  whicli  are  relattnl  by  any  of  the  12  possible 
constraint  types.  A  “snap-shot"  view  in  tlte  proc<>ss 
of  creating  a  template  is  illiist  rated  in  Figure  5a.  The 
set  of  configurations  in  the  tem])late  contains  a  num¬ 
ber  of  lines  ami  a  number  of  |>oints.  The.se  geometric 
entities  (or  rather  their  counterparts  in  the  image)  are 
subjected  to  a  number  of  constraints.  The  template 
after  deformation  is  shown  in  Figure  5b. 

4.2  Feature  Extraction  and  Correspon¬ 
dence 

fn  the  current  implrmentation  feature  e.\tractit>n 
is  achieved  by  either  a  combination  of  morphological 
signal  enhancement  and  seginentation  [Noble.  l!tS)2]or 
the  t’aiuiy  edge  detector  [t  'amiy.  l!lt<(3].  \Ve  use  i-igeii- 
vectors  of  the  feature  jioint  scatter  matri.x  to  estimate 
the  parameters  of  experimental  gwmetric  primitives 
from  the  segmentation  output.  A  feature  correspon- 
dt'iice  step  tlieii  performs  local  adjustment  to  register 
the  image  features  with  the  timiplate. 

Our  philosophy  is  to  keep  the  local  feature  corri'- 
spondence  simple  and  to  rely  on  a  global  ti'iiiplate 
registration  step  to  place  the  template  in  closi*  approx¬ 
imation  to  the  image  fi-atno's.  This  is  done  by  ri mis¬ 
tering  on  one  or  more  ger)melric  I'ealiires  on  a  part 
that  are  invariant  to  jiarl-to-part  variations,  rinst' 
features  are  determined  in  an  t'xperimeni  run  on  a  set 
of  good  parts  using  a  monitor  template  to  determine 
registration  features  which  do  not  shift  belw«'eii  im¬ 
ages.  The  geometric  transformation  is  comptited  for 
each  new  image  and  applied  to  the  inspect  ion  template 
prior  to  applying  the  inspection  algorithm. 

4.3  Solving  Constraints 

The  objective  of  the  constraint  .solver  is  to  find  an 
instance  of  the  inspection  tetnplate  which  satisfies  all 
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Figure  !>:  TtniplaU  coiistruclioii  and  aolring. 


of  the  geometric  constraints  defined  by  tlie  template 
and  at  the  same  time,  minimizes  tlie  mean-s<|iiare  er¬ 
ror  between  the  template  primitives  and  the  image 
features.  The  mean-square  error  can  be  exiiressed  as 
a  convex  function  of  the  template  parameters  and  a 
geometric  description  of  tlie  image  features. 

I'he  two  goals  of  finding  the  global  minimum  of  a 
convex  function,  V /(x)  =  0,  and  satisfying  (he  con¬ 
straints.  h(x)  =  0,  are  combined  to  give  a  constrained 
minimization  problem.  A  linear  approximation  to  (his 
optimization  problem  is: 


/  V-7(x)<lx  =  -V/(X). 
I  Vli(x)dx  =  -li(x). 


(I) 


Since  the  two  goals  cannot  in  general  be  simulta¬ 
neously  satisfied,  a  least-sipiare-error  satisfaction  of 
V/{x)  =  0  is  sought.  The  constraint  e<iuatiou.s  are 
multiplierl  by  a  factor  \/c.  which  determines  (  lie  weigin 
given  to  satisfying  the  constraints  Vi'rsus  iniiiiiniziug 
the  cost  function.  Each  iteration  of  (1)  has  a  line 
search  that  minimizes  the  least-sciuare-error: 

m(x)  =  |V/(x)|-  +  c  |h(x)|‘’  (2) 

which  is  a  merit  function  similar  to  the  objective  of 
the  standard  penalty  metliod. 

Starting  with  7c  =  0,  system  (1)  converges  to  (he 
unconstrained  global  minimum  first,  and  so  avoids  lo¬ 
cal  minima  and  singularities  on  the  constraint  sur¬ 
face.  This  convergence  is  efficient,  because  the  fit¬ 
ting  error  /(x)  is  well  approximated  by  a  (piadratic. 
and  so  the  Hessian  V'f  is  almost  constant.  W’hi'U 
the  Hes.sian  V'/  is  constant,  the  best-lit  surface  is 
a  linear  subspace  with  zero  curvature  and  no  sin¬ 
gularity,  a  lot  simpler  than  the  constraint  surface. 
Note  that  this  approach  is  very  similar  to  the  con¬ 
struction  of  a  Tiklfonov  stabilizing  functional  [Bert- 
ero  ft  fl/..  1988].  The  convex  fitting  function  /(x) 
can  be  viewed  as  a  regularizer  in  the  solution  f>f  dx. 
It  makes  the  constraint  problem  well-po.sed  by  us¬ 
ing  empirical  data  whenever  additional  constraints 
are  needed  to  pin  ilown  free  variables  in  h(x)  =  0. 
Levenberg  and  Marcpiardt  have  shown  that  varying 
y/c  bv  factors  of  10  is  an  effective  method  to  forc.> 
convergence  for  nonlinear  systems  [Marcpiardt,  lOOd. 
Press  »/  (?/..  1988]. 

4.4  Part  Verification  and  Flaw  Decision- 
Making 

The  out|>ut  from  the  constraint  soKm-  is  a  set  of  <le- 
formed  primitives  which  can  be  u.sed  to  further  o'line 
the  ]>arameter  values  and  tolerances  of  (In'  inspection 
template,  or  for  flaw  decision-making.  For  example, 
the  parameters  derived  from  the  cleformed  primitives 
can  be  comi)ared  to  the  teiuplate  parameters  to  detect 
geometric  flaws  such  as  inaccurate  drilled  h()li'  diame¬ 
ters  [Noble  al..  1992]. 

'I'he  adapted  inspection  template  primitives  c.nu 
also  provide  the  context  for  applying  specialized  algo¬ 
rithms  to  characterize  shape  and  intensity-based  prop¬ 
erties  of  subtle  flaws.  In  figure  0  we  illustrate  one  i-x- 
ample  in  which  a  checker  is  associated  with  each  of 
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3  ilrilltHl-liol<>s.  Ill  physical  terms,  flaws  can  arise  if 
the  ilrilling  process  goes  wrong  and  siirjilns  material. 
In  an  x-ray  image  an  absence  of  material  appears  as 
an  unexplained  low  intensity  region.  Automated  flaw 
inspection  involves  extracting  ID  flaw  signals  in  the 
direction  along  the  hole  and  cla.ssifying  the  signal  by 
comparing  the  flaw  profile  against  the  expected  profile 
in  intervals  along  the  signal  length.  We  are  currently 
implementing  a  databa.se  of  generic  flaw  algorithms  of 
this  kind  that  can  be  selected  through  an  inspection 
algorithm  editor  by  a  user  to  build  specialized  inspec¬ 
tion  algorithms  for  parts  with  ilifferent  geometries.  In 
practice,  a  typical  part  may  have  to  be  checked  for 
between  1  and  20  different  flaw  types. 

5  Toward  Continuous  Product  Im¬ 
provement 

By  computer  vision  applied  to  continuous  prod¬ 
uct  improvement  we  mean  using  image-based  analysis 
techniques  to  understand  manufacturing  process  vari¬ 
ability  (how  a  design  part  differs  from  a  manufactured 
part)  and  to  implement  vision  sy.stems  for  monitoring 
part  manufacture  on-line.  To  date,  our  research  in 
this  area  has  focussed  on  develoinng  techniques  to  re¬ 
liably  extract  manufactured  part  tlimensions  aiul  mod¬ 
els  from  images  in  the  presence  of  measurement  error. 
The  issue  here  is  that  what  you  measure  in  an  image 
is  a  nominal  dimension  -I-  manufacturing  tolerance  -I- 
imagiug  noise.  However,  to  verify  .nat  a  part  meets 
design  specs,  you  need  to  know  that  a  part  dimen¬ 
sion  lies  within  manufacturing  tolerance  bounds.  Put 
another  way,  you  can  not  taxe  the  estimates  of  the 
parameter  distiibiitions  computed  from  the  itnag(  as 
the  parameter  distributions  of  the  pari  liecause  imag¬ 
ing  the  part  has  introduced  some  measurement  error. 
(Note  that  since  you  are  typically  dealing  with  a  2D 
projection  of  a  3D  object,  a  correction  may  also  have 
to  made  for  pose  variation  between  images.  However 
we  ilo  not  consider  this  problem  here).  So  how  do  you 
distinguish  between  variability  due  to  manufacturing 
and  variability  due  to  imaging  noise  ?  Our  approach  to 
■solve  this  problem  is  based  on  the  analysis  of  variance 
components  using  Bayesian  techniques  and  is  briefly 
described  below. 

6.1  Bayesian  Part  Tolerancing  with  Mea¬ 
surement  Error 

Part  tolerancing  is  generally  perceivetl  as  the  prob¬ 
lem  of  determining  the  variations  in  dimensions  for 
an  object  where  (he  errors  originate  from  the  manu¬ 
facturing  process.  On  the  other  hand  measurement 
error  analysis  typically  refers  to  the  prolilem  of  (piaii- 
tifying  the  variations  in  measurements  due  tci  the 
sensing/imaging  process.  Although  the  solid  mod¬ 
elling  comnnmity  is  lieginning  to  make  (Progress  in  the 
first  area  [Parkinson,  l!)t<4.  Light  and  (Jossard.  l!tf*2. 
rnrner.  tflS8,  Heipiirha.  1!I83.  Reqnicha  and  Chan. 
I!.>8b,  Fleming,  I!t88].  and  then'  has  been  considi’rable 
work  in  the  latter  area  (though  surprisingly  not  so 
in  computer  vision),  little,  if  any  attention  lias  been 
given  to  tackle  the  problem  of  part  tolerancing  tnih 
measurement. 

Suppose  we  can  take  a  number  of  measurements  of 
each  part,  figure  7.  Manufacturing  error  (part  toler 


Figure  7:  Error  niodd. 


ance)  is  added  when  the  part  is  made.  After  imaging 
the  part,  further  measurement  error  is  introduced.  If 
we  a.ssume  an  additive  noise  model  t  hen  the  jih  mea¬ 
surement  on  the  Hh  part.  Xjj.  can  be  modeled  l)y: 

Xij  =  -f-  -f-  ipj  (3) 

Here,  x  is  the  true  (unknown)  dimension,  ,  is  (he 
perturbation  dues  to  manufacturing  error  (tolerance) 
and  »/,j  is  the  perturbation  due  to  measurement  er¬ 
ror.  This  is  called  a  random  effect  model  [Box  and 
Tiao,  1973]  and  provides  a  statistical  model  for  part 
tolerancing  with  measurement  error.  Note  also  that 
we  could  go  one  step  further  liy  adding  another  ‘pro¬ 
cess'  layer  to  the  hierarchical  in  figure  7  and  represent 
l>atcli-to-batch  variations  by  a  third  variance  compo¬ 
nent. 

However,  consitler  the  case  of  two  variance  compo¬ 
nents,  where  onr  goal  is  to  separate  out  the  part  toler¬ 
ance  (process  variation)  from  the  measurement  error. 
It  turns  out  that  the  solution  to  this  problem  for  the 
case  of  a  single  dimension  can  be  found  by  Bayesian 
variance  component  analysis  of  randon  effect  moil- 
els  u.sing  the  Gibbs  sampling  methoil[flastings,  197(1. 
Geman  and  Geman,  1981.  Gelfand  li  al..  199t).  Tan¬ 
ner.  1991].  The  Bayesian  philosophy  allows  you  t«> 
input  prior  knowledge  (the  “design"  dimension)  and 
the  data  samples  modify  this  to  refli'ct  mannfact  nring 
process  varialulity.  Further,  an  attractive  practical 
advantage  of  this  ajiproach  is  that  it  can  be  realized 
very  simply  with  an  ex(»erimental  se(up  whereby  .sep¬ 
arate  images  are  taken  of  different  samples  of  a  part, 
figure  8.  Finally,  an  advantage  in  using  Gibbs  sam¬ 
pling  is  that  it  provides  a  mechanism  for  estimating, 
numerically,  a  parameter  tlistribution  rather  than  just 
a  “best"  estimate  for  (he  nominal  value  of  a  parame¬ 
ter.  This  is  desirable  from  a  manufacturer's  perspec¬ 
tive  as  it  has  been  shown  that  tolerance  analysis  using 
parameter  distributions  (statistical  tolerancing)  leads 
to  larger  allowable  tolerance  limits  on  design  variabh's 
than  can  be  achieved  using  npp<'r/lower  limit  analysis 
(worst  ca.se  tolerancing)  [Michael  and  Siddall.  l!l8l]. 
Ib'iic*'.  parts  can  be  made  using  less  precise  (anil  hence 
cheaper)  manufacturing  processes  provided  that  the 
pro!>ability  distributions  of  critical  p;irl  parameters 
can  !>«•  monitored. 

We  have  performed  preliminary  experiments  using 
a  (Hbbs  .sanqiler  for  single  dimension  part  toleranc¬ 
ing  with  measurement  error  and  have  recently  st.nted 
looking  at  variance  component  analysis  for  constraint 
tem|>lales.  V\’e  ran  analyze  the  ra.se  of  a  linear  con¬ 
straint  (eg  a  linear  size  gradient)  using  t  he  Gibbs  sam¬ 
pling  solution  to  the  Bayesian  linear  model  [Lindley 
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Figure  6:  Templale-based  flaw  signal  d€eiston-nittking. 


Figure  8:  Variaiia  rompoiuni  analysis  for  a  singU 
paraniftfr. 


ajid  Smith.  1972].  The  extension  to  nonlinear  con¬ 
straints  is  not  so  straiglitforward  and  is  the  subject  of 
our  current  research.  Our  ultimate  goal  is  to  he  able 
to  derive  a  tolerance  template  from  sec|uences  of  im¬ 
ages  of  good  parts  where  tolerances  capture  proci^s 
variability  and  to  u.se  this  to  guide  part  verification 
and  flaw  decision-making. 

6  Summary 

In  this  paper  we  have  |>resented  an  overview  of 
the  I"F  System,  an  object-oriented  image  analysis  be¬ 
ing  developed  for  machine  vision  industrial  inspection 
and  process  monitoring.  We  have  described  aspects 
of  the  system  design  and  a  software  inspection  sy.s- 
tem  for  detecting  flaws  in  jiarts.  We  have  also  dis¬ 
cussed  some  preliminary  work  in  the  area  of  part  tol- 
erancing.  There  are  many  topics  we  plan  to  study 
further  including;  multi-scale  feature  corn>spoiuleiue; 
multivariable  extensions  of  variance  component  anal¬ 
ysis  for  constraint  templati's  of  parts:  algorithm  per¬ 
formance  a.sses.sment:  application  to  other  modalities 
(nltrasound  and  infra-red  imagt>s)  and  part  integrity 
verfication  for  3D  (volumetric)  imag(>s. 
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Abstract 

We  describe  three  components  of  our  Vision 
Algorithm  Compiler  for  recognizing  rigid  ob¬ 
jects  in  range  images:  realistic  sensor  model¬ 
ing,  a  novel  hypothesis-generation  algorithm 
and  a  robust  localization  method. 

We  use  sensor  modeling  to  build  prior  models 
of  object  appearances  that  account  for  con¬ 
straints  due  to  sensor  and  feature-extraction 
algorithm  characteristics  in  addition  to  model 
geometry.  We  approach  hypothesis  genera¬ 
tion  as  a  search  for  the  most  likely  set  of  hy¬ 
potheses  based  on  our  prior  knowledge.  The 
Markov  random  field  (MRF)  formalism  and 
Highest  Confidence  First  estimation  provide 
us  with  an  efficient  and  effective  technique  for 
performing  this  search.  Our  algorithm  has 
shown  the  ability  to  recognize  objects  while 
limiting  the  number  of  hypotheses  requiring 
verification. 

Our  pose  refinement  algorithm  uses  a  robust 
estimator  to  deal  with  the  problem  of  abun¬ 
dant  outliers  from  our  models  in  images  due 
to  occlusion  and  noise.  The  algorithm  has 
been  found  to  be  much  less  sensitive  to  out¬ 
liers  than  the  leiist  squares  solution. 


1  Introduction 

Recognizing  known  objects  in  images  is  a  fundamen¬ 
tal  problem  in  computer  vision.  Much  work  has 
been  done  on  object  recognition  in  intensity  images 
as  well  as  range  images.  Several  researchers  have 
developed  model-based  vision  systems  for  object 
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recognition  in  range  images  [Bhanu,  1984,  Bolles  et 
ai,  1987,  Crimson  and  Lozano-Perez,  1987,  Ikeuchi 
and  Kanade,  1988,  Fan,  1990,  Kim  and  Kak,  1991, 
Stein  and  Medioni,  1992]. 

Most  model-based  vision  systems  do  not  utilize  re¬ 
alistic  prior  models  of  the  appearance  of  the  ob¬ 
jects.  This  affects  both  efficiency  and  robustness 
of  the  recognition  algorithm.  Without  an  accurate 
model  of  sensor  and  segmentation  characteristics, 
the  hypothesis-selection  procedure  must  compensate 
for  the  inaccuracies  by  loosening  the  constraints  and, 
thus,  increasing  the  number  of  incorrect  hypothe¬ 
ses  that  are  generated.  Our  solution  is  to  use  sen¬ 
sor  modeling  to  build  accurate  prior  models  of  con¬ 
straints  due  to  sensor  and  segmentation  characteris¬ 
tics  in  addition  to  model  geometry. 

A  requirement  that  image  features  be  separated  into 
groups  belonging  to  single  objects  is  a  weak  point 
common  to  many  recognition  systems.  This  group¬ 
ing  operation  is  not  in  genereil  possible  with  a  purely 
data  driven  segmentation  algorithm.  Our  algorithm 
does  not  require  that  image  features  are  segmented 
into  groups  belonging  to  single  objects. 

When  unknown  objects  are  present  in  the  image,  the 
performance  of  most  systems  degrades  drastically. 
A  recognition  system  should  not  expend  precious 
resources  on  unlikely  possibilities.  Our  hypothesis- 
generation  algorithm  is  intended  to  reduce  the  num¬ 
ber  of  hypotheses  requiring  verification  by  accurate 
selection  of  hypotheses  and  optimal  ordering  of  hy¬ 
potheses  for  verification.  We  formulate  the  task  of 
“optimal”  hypothesis  selection  as  a  search  for  the 
most  likely  set  of  matches  based  on  our  prior  knowl¬ 
edge.  This  is  accomplished  by  integrating  observed 
image  features  and  our  prior  knowledge  (constraints) 
in  the  formalism  of  Markov  random  fields  (MRF). 
With  the  MRF  formulation,  our  search  for  the  most 
likely  hypotheses  is  phrased  as  a  maximum  a  poste¬ 
riori  (MAP)  estimate  of  the  MRF.  The  number  of 
verifications  is  further  reduced  by  ordering  the  hy¬ 
potheses  in  terms  of  their  likelihood  based  on  our 
prior  knowledge. 

The  effects  of  partial  occlusion  are  not  explicitly 


811 


modeled  by  most  recognition  systems.  The  assump¬ 
tion  that  enough  unoccluded  features  will  be  visi¬ 
ble  to  perform  recognition/localization  is  not  always 
true.  Our  localization  algorithm  explicitly  models 
the  effects  of  partial  occlusion  by  using  an  error  dis¬ 
tribution  that  is  relatively  insensitive  (compared  to 
least-squares  approaches)  to  outliers  due  to  occluded 
features  or  noise. 

In  this  paper,  we  present  three  components  of 
our  recognition  system  which  take  steps  to  re¬ 
duce  or  eliminate  the  above  problems.  We  present 
the  sensor-modeling  approach  to  generating  real¬ 
istic  constraints  for  object  recognition,  an  effi¬ 
cient  hypothesis-generation  algorithm,  and  a  robust 
method  for  localizing  hypothesized  models  in  range 
images.  Sensor  modeling,  hypothesis  generation  uti¬ 
lizing  MRFs,  and  localization  are  key  elements  of  our 
Vision  Algorithm  Compiler  (VAC)  for  object  recog¬ 
nition  [Wheeler  and  Ikeuchi,  1992].  Our  current 
system  is  designed  to  recognize  polyhedral  models 
in  range  images.  There  are  three  distinct  compo¬ 
nents  of  our  VAC  system;  user-defined  modules,  the 
compiler,  and  the  executable  recognition  program. 
The  user-defined  modules  consist  of  object  models, 
sensor  models,  image  processing  and  segmentation 
modules,  and  the  feature  modules.  The  VAC  uses 
the  user-defined  modules  and  sensor  modeling  to 
generate  the  recognition  program  for  the  specified 
models  and  sensors.  The  compiled  prior  models  are 
integrated  with  the  hypothesis-generation  and  local¬ 
ization  algorithms  to  form  the  executable  recogni¬ 
tion  program. 

In  section  2,  we  describe  the  sensor-modeling  process 
for  generating  realistic  prior  models  of  the  object’s 
appearance.  We  present  our  MRF  formulation  for 
hypothesis  generation  in  Section  3.  Section  4  de¬ 
scribes  our  robust  localization  algorithm.  In  sec¬ 
tion  5,  we  present  some  results  from  recognition  and 
localization  experiments.  Section  6  summarizes  the 
advantages  of  our  methods. 

2  Sensor  Modeling 

The  constraints  used  by  the  hypothesis-generation 
process  of  most  model-based  vision  systems  are 
based  solely  on  the  geometric  models  of  the  objects 
and  do  not  account  for  sensor  or  feature-extraction 
characteristics.  Without  an  accurate  model  of  sensor 
and  segmentation  characteristics,  the  hypothesis- 
selection  procedure  must  compensate  for  the  inac¬ 
curacies  by  loosening  the  constraints  and,  thus,  in¬ 
creasing  the  number  of  incorrect  hypotheses  that  are 
generated.  Our  solution  is  to  use  sensor  modeling 
to  build  accurate  prior  models  of  constraints  due 
to  sensor  and  segmentation  characteristics  in  addi¬ 
tion  to  model  geometry.  The  inclusion  of  imaging 
and  processing  effects  is  the  essential  difference  be¬ 
tween  our  prior  models  and  the  view-variation  dis¬ 
tributions  used  by  [Burns  and  Riseman,  1992]. 


Our  current  system’s  sensor  modality  is  range  data 
and  our  low-level  vision  routines  supply  us  with 
segmented  planar  surfaces.  In  our  application,  a 
hypothesis  is  a  match  between  a  planar  region  R, 
of  the  image  and  model  face  Mj,  and  each  re¬ 
gion  Ri  is  described  by  a  vector  of  feature  values 
fri  =  (/ri./ri.  •••-/"<)•  The  features  for  this  sys¬ 
tem  are  specified  over  3-D  surfaces  corresponding 
to  planar  regions  extracted  from  images.  Our  first- 
order  features  include  region  area,  maximum  second 
moment,  minimum  second  moment,  and  maximum 
axis  length.  Second-order  features  include  simulta¬ 
neous  visibility,  relative  orientation,  and  maximum 
distance  between  surfaces. 

The  constraints  used  by  our  hypothesis  generation 
algorithm  are  in  the  form  of  probability  distribu¬ 
tions  of  the  appearance  of  model  faces  represented 
by  conditional  distributions  R(/"i|A/j).  The  sam¬ 
ple  range  images  are  generated  using  an  appearance 
simulator  developed  by  [Fujiwara  et  al.,  1991].  We 
can  use  these  sample  segmented  images  to  compute 
a  prior  distribution  P(ffi\Mj)  for  each  feature  and 
each  model  face  Mj.  The  prior  distributions  are  ap¬ 
proximated  by  generating  many  sample  images  (320 
images)  of  our  object  models  and  segmenting  the 
images  using  our  low-level  vision  routines.  The  sim¬ 
ulated  images  are  segmented  and  the  features  of  each 
image  region  are  calculated.  Figure  1  shows  an  ex¬ 
ample  iteration  of  this  process.  In  this  work,  the 
viewing  directions  are  “uniformly”  distributed  on 
the  unit  sphere;  however,  it  would  be  easy  to  modify 
the  distribution  to  reflect  real  world  constraints  and 
biases  for  particular  objects  (i.e.,  the  bottom  of  a 
stapler  is  rarely  visible). 

Since  this  is  a  simulation,  we  know  the  correspon¬ 
dence  between  the  model  surfaces  and  the  segmented 
regions.  Thus,  we  can  build  a  list  of  the  sampled 
feature  values  for  each  model  surface.  The  feature 
values  for  each  model  face  are  tabulated  and  used 
to  form  the  prior  distribution.  Figure  2  shows  a 
sample  distribution  of  a  model  face’s  area  value  as 
computed  using  our  sensor  model.  The  simulated 
distributions  are  not  normally  distributed,  and  they 
are  biased  due  to  inherent  characteristics  of  the  seg¬ 
mentation  algorithm.  Additional  bias  occurs  from 
self  occlusion  when  viewing  some  object  from  cer¬ 
tain  directions.  There  are  secondary  modes  corre¬ 
sponding  to  oversegmentations,  where  a  single  ob¬ 
ject  surface  is  segmented  into  multiple  regions.  The 
sensor-modeling  approach  builds  models  of  the  in¬ 
formation  that  the  recognition  algorithm  will  actu¬ 
ally  have  available  when  viewing  a  known  object. 
This  information  is  dependent  on  the  imaging  pro¬ 
cess  and  segmentation  algorithm,  in  addition  to  the 
known  geometric  characteristics  of  the  model.  An 
additional  benefit  of  this  approach  is  that  model  sur¬ 
faces  that  are,  because  of  geometric  properties  of  the 
object,  not  detectable  by  the  segmentation  program 
will  not  affect  the  hypothesis  generation. 
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Figure  1;  Generation  of  feature  distributions.  An  object  model  and  viewing  direction  are  selected.  The 
simulator  is  used  to  produce  a  range  image  of  the  object  which  is  then  segmented  into  regions  which  are 
used  to  compute  the  feature  distributions. 


Figure  2:  An  example  distribution  of  a  given  fea¬ 
ture  value  (area)  over  a  model  face.  The  distribu¬ 
tion  was  generated  by  sampling  resulting  area  values 
from  synthetic  images  of  views  of  the  object.  A  nor¬ 
mal  distribution  centered  around  the  actual  model 
area  value  is  shown  to  demonstrate  the  difference  be¬ 
tween  the  usual  assumption  of  performance  and  the 
actual  performance  of  the  segmentation  program. 


3  Hypothesis  Generation 

Given  a  set  of  primitive  features  (i.e.,  planar  sur¬ 
faces  or  regions)  extracted  from  the  input  image  by 
a  feature-extraction  algorithm  (i.e.,  segmentation  or 
edge  detection),  the  hypothesis  generation  proce¬ 
dure  produces  a  set  of  possible  model  primitive  to 
image  primitive  matches  (hence  referred  to  simply 
as  hypotheses).  Optimally,  the  generated  hypothe¬ 
ses  include  all  of  the  correct  correspondences  and 
exclude  as  many  incorrect  hypotheses  as  possible. 
To  exclude  incorrect  matches,  we  must  apply  con¬ 
straints  derived  from  our  prior  knowledge. 

We  are  considering  many  hypotheses  simultaneously 
and  wish  to  choose  the  most  likely  subset  of  these. 
We  can  think  of  the  hypotheses  as  forming  a  random 
field  of  variables  each  of  which  can  be  assigned  a 
discrete  value  of  on  or  off.  A  hypothesis  labeled 
on  indicates  that  the  hypothesis  is  assumed  to  be 
correct. 

The  hypotheses  display  Markovian  characteristics. 
For  example,  if  two  hypotheses  provide  mutual  sup¬ 
port  for  each  other,  and  one  of  them  is  correct,  it  is 
more  likely  that  the  other  is  correct.  A  similar  de¬ 
pendency  exists  between  contradicting  hypotheses. 
These  dependencies  can  be  thought  of  in  terms  of 
conditionally  dependent  probability  distributions. 

3.1  Formulation  of  Hypothesis 
Generation  using  Markov 
Random  Fields 

MRFs  are  used  to  represent  the  probability  distribu¬ 
tion  of  values,  ut,  of  a  set  of  random  variables,  Xi, 
each  of  which  may  be  conditionally  dependent  on  a 
set  of  net9A6or  variables.  Given  a  set  of  independent 
observations,  Oi,  the  most  likely  state  of  the  MRF 
variables  can  be  found  by  minimizing  its  posterior 
energy  function 

CA(u;|0)=5^14(w)-T2;iogmk)  (1) 

ciC  i 

where  C  is  the  set  of  cliques  of  related  (neighbor) 
variables  in  the  MRF.  The  posterior  distribution  is 
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in  terms  of  things  we  may  be  able  to  calculate  or 
specify;  clique  potentials  Vc(u)  (represent  higher- 
order,  prior  constraints  among  related  variables)  and 
prior  distributions  for  our  observations  /’(Ojjwj). 

By  representing  our  hypothesis  space  using  a  MRF, 
we  can  formulate  the  search  for  the  most  likely  hy¬ 
potheses  as  a  maximum  a  posteriori  (MAP)  estimate 
of  the  MRF.  For  a  review  of  MRFs  and  their  ap¬ 
plications  to  computer  vision,  we  refer  the  reader 
the  descriptions  found  in  [Chou  and  Brown,  1990, 
Cooper,  1989]. 

We  phrase  our  search  for  the  most  likely  hypotheses 
as  a  MAP  estimation  problem  by  defining  our  con¬ 
straints  in  terms  of  clique  potentials  and  likelihoods 
in  the  MRF  framework.  With  this  formulation,  we 
can  apply  a  MAP  estimation  procedure  to  our  MRF 
with  the  result  being  the  set  of  hypotheses  with  the 
highest  probability  of  occurring  based  on  our  prior 
knowledge.  Each  variable  in  the  MRF  represents  a 
match  hypothesis,  (Ri,Mj),  between  region  Ri  and 
model  face  Mj .  The  variables  can  be  labeled  either 
on  or  off  indicating  our  belief  or  disbelief  in  the  hy¬ 
pothesis. 

The  ith  region  is  described  by  a  vector  of  feature 
values  fri  =  (f)i,  frit  For  computational  rea¬ 

sons,  we  assume  that  these  features  are  independent 
for  a  given  model  face.  If  the  features  are  not  in¬ 
dependent,  then  we  have  redundant  features  which 
are  not  providing  new  information  and  should  be 
removed.  The  independence  assumption  gives  us: 

logP(/;,lM,)  =  X;iogP(/?,lM,).  (2) 

n 

We  need  to  determine  the  likelihood  that  an  image 
region,  Ri,  arose  from  the  presence  of  a  model  face, 
Mj ,  in  the  scene.  This  is  the  probability  of  observing 
Ri  assuming  that  the  match  hypothesis  (Ri,Mj)  is 
correct.  We  model  this  using  a  prior  distribution.  In 
terms  of  our  label  set,  we  equate 

P(R,\{R.,Mj)  =  ON)  =  P{R.\Mj)  =  P(fr.lMj),  (3) 
and  we  can  easily  calculate  it  with  equation  2.  To 
calculate  the  likelihood  that  a  hypothesis  {Ri,Mi) 
is  incorrect,  we  equate  this  to  the  likelihood  that  Ri 
actually  arose  from  any  of  the  other  model  faces: 

P{R.\{R„M,)  =  OFF)  =  P(R.|A/*)  = 

(4) 

Equations  3  and  4  provide  us  with  the  prior  prob¬ 
abilities  of  the  observations  P(Oj|a;,)  required  for 
the  posterior  energy  function  of  Equation  1.  Next, 
we  need  to  specify  the  clique  potentials  which  first 
requires  a  definition  of  the  neighborhoods  of  the  hy¬ 
potheses  (variables). 

The  two  neighborhoods  over  the  hypotheses  are  A'*' 
for  supporting  hypotheses  and  N~  for  contradictory 
hypotheses.  The  rule  that  determines  the  N~  neigh¬ 
borhood  is: 

>fR„Mrn,M„  ji  Mm  ((R. ,  Afm),  ( R. ,  A/„ ))  €  A" 

(5) 
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Figure  3:  Clique  potentials  for  the  six  possible  con¬ 
figurations  of  hypothesis  labels  and  neighbor  types 
in  2-cliques  and  the  two  possible  1-cliques. 

The  above  rules  essentially  state 
that  hypotheses  corresponding  to  the  same  region 
are  contradictory — we  would  like  one  hypothesis  per 
region.  The  rule  that  determines  the  A"*"  neighbor¬ 
hood  is: 

^Ri,RjjiR.,Mm,Mn^Mm 
(model(Mm)  =  model(M„))A  ,  , 

consistent{(Ri,Mm),(Rj,M„))  '  ^ 

=>((R.,A/„),(Rj,A/„))e  A+ 

where  model{Mm)  is  the  model  in  the  model  base 
to  which  the  face  Mm  belongs,  and  consisteni()  de¬ 
termines  whether  the  two  hypotheses  are  spatially 
and  geometrically  consistent  based  on  the  relational 
features.  This  rule  specifies  that  if  two  hypotheses 
are  consistent  with  respect  to  our  prior  constraints 
then  they  provide  mutual  support  for  each  other. 

If  it  is  possible  to  group  regions  into  sets  belong¬ 
ing  to  the  same  object,  a  neighborhood  specification 
rule  can  easily  be  added  to  enforce  the  required  con¬ 
straints.  We  do  not  use  this  constraint  since  we  do 
not  assume  that  grouping  regions  belonging  to  the 
same  object  is  possible  from  a  purely  data-driven 
approach  to  segmentation. 

For  efficiency  concerns,  we  limit  our  energy  func¬ 
tion  to  1-cliques  and  2-cliques.  The  clique  poten¬ 
tials  (corresponding  to  Ve(u)  in  equation  1)  used 
in  our  experiments  appear  in  Figure  3.  For  exam¬ 
ple,  the  first  clique  in  the  figure  shows  that  when 
the  hypothesis  is  on  and  a  consistent  (A'*’)  neigh¬ 
bor  hypothesis  is  on,  -5.0  is  the  potential  of  that 
2-clique.  The  potentials  were  determined  experi¬ 
mentally  to  conform  with  our  sense  of  consistency 
and  mutual  support  among  hypotheses;  of  course,  a 
systematic  method  would  be  preferred.  We  are  able 
to  compute  distributions  over  relations  between  re¬ 
gions,  but  at  this  point  have  not  integrated  these 
distributions  into  our  formulation.  Instead,  we  gen¬ 
erate  thresholds  from  these  distributions  to  compute 
the  relation  con8istent{). 

We  can  now  construct  a  MRF  and  search  for  the 
most  likely  hypotheses.  To  help  the  reader  visualize 
a  typical  resulting  MRF,  a  very  simple  example  is 
shown  in  figure  4.  This  is  an  example  MRF  con¬ 
structed  for  a  model  base  containing  two  similar  ge¬ 
ometric  models  and  an  image  of  the  first  model.  In 
this  case,  a  hypothesis  is  generated  for  all  pairs  of 
regions  Ri  and  model  faces  Mj  that  have  nonzero 
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Figure  4:  An  example  MRF  produced  from  a  sim¬ 
ple  scene  containing  two  regions  with  a  model  base 
containing  a  tetrahedron. 


conditional  probabilities  P{Iii\Mj).  The  neighbor¬ 
hood  relation  of  these  hypotheses  is  simply  that  hy¬ 
potheses  for  the  same  region  are  inconsistent,  and 
hypotheses  for  the  same  model  are  consistent. 


3.2  Highest  Confidence  First  Search 

At  run  time,  the  recognition  program  segments  the 
image  and  computes  the  first-order  features  over  all 
regions  and  relational  features  over  all  pairs  of  re¬ 
gions.  The  segmentation  algorithm  used  here  is  the 
same  as  used  in  the  sensor  modeling  phase.  The 
segmented  regions  are  used  to  build  our  MRF,  which 
represents  our  hypotheses  and  prior  constraints.  Us¬ 
ing  equations  2  and  4,  we  first  compute  the  log- 
likelihoods  of  the  observation  of  Ri,  given  that  hy¬ 
pothesis  (Hi,  Mm)  is  correct,  logP(J?i|(/2,-,  A/m)  = 
ON),  and  incorrect,  log  P{Ri\{Iii,  Mm)  =  OFF). 
When  computing  P{fri\Mm),  we  use  a  lower  bound 
of  the  probability  to  cut  off  the  computation  and 
essentially  throw  away  highly  unlikely  hypotheses. 
This  lower  bound  is  chosen  such  that  reasonable  hy¬ 
potheses  are  not  thrown  away  and  seems  to  have 
very  little  effect  on  the  final  results,  while  reducing 
the  size  of  the  MRF  considerably.  With  the  thresh- 
olded  hypotheses,  we  then  determine  the  neighbor¬ 
hood  systems  AT'*’  and  N~  using  the  rules  specified 
in  Equation  6  and  5. 

Once  the  MRF  is  created,  we  wish  to  find  the 
most  likely  set  of  hypotheses  based  on  the  con¬ 
straints  of  the  image.  We  use  an  estimation  proce¬ 
dure  called  Highest  Confidence  First  (HCF)  devel¬ 
oped  by  [Chou  and  Brown,  1990].  HCF  is  a  sub- 
optimal  estimation  procedure  but  was  chosen  be¬ 
cause  of  its  efficiency  and  evidence  of  good  perfor¬ 
mance  in  other  applications  [Chou  and  Brown,  1990, 
Cooper,  1989].  HCF  performs  a  steepest-descent 
search  in  an  augmented  state  space.  The  MRF  vari¬ 
ables  are  placed  in  a  heap  ordered  by  the  confidence 
in  the  variable’s  current  value.  The  confidence  is  de¬ 
fined  as  the  energy  difference  between  the  current 
and  best  value.  The  variable  at  the  top  of  the  heap 
has  its  value  changed  to  the  best  possible  (in  the 
current  state  of  the  MRF).  The  confidence  values  of 


the  first  variable  and  its  neighbor  variables  are  recal¬ 
culated,  and  the  heap  is  adjusted.  The  use  of  confi¬ 
dence  values  essentially  forces  the  algorithm  to  start 
with  variables  where  a  value  is  “most  obvious”  or  has 
the  least  competition  among  possible  values.  The 
behavior  of  the  HCF  search  for  the  most  likely  set 
of  hypotheses  is  very  similar  to  the  idea  of  the  “focus 
feature”  method  of  [Holies  et  al.,  1987].  When  there 
are  obvious  matches  available,  the  HCF  search  dives 
in  by  turning  on  the  most  obvious  match  first.  This 
creates  a  ripple  effect  for  matches  consistent  with 
obvious  matches.  Given  a  MRF  with  N  variables, 
the  HCF  algorithm  takes  0{N  log  A)  to  initially  cre¬ 
ate  the  heap  and  O(logA^)  to  adjust  the  heap  after 
modifying  a  variable’s  value,  assuming  the  size  o^ 
the  neighborhoods  is  constant.  In  practice,  [Chou 
and  Brown,  1990]  found  that  the  variables  are  mod¬ 
ified  slightly  more  than  once  on  average  (consistent 
with  our  experience  in  this  application)  giving  an 
0{N  log  N)  performance. 

After  HCF  estimation  is  completed,  the  hypotheses 
labeled  on  are  considered  for  verification.  From  the 
results  of  HCF,  we  can  create  a  list  of  consistent 
cliques  (of  order  3  and  lower)  of  matches  using  the 
active  hypotheses  and  their  neighbor  relations.  The 
verification  phase  must  determine  which  of  these  hy¬ 
potheses  describe  objects  that  are  in  the  scene. 

A  successful  verification  of  a  hypothesis  elimi¬ 
nates  other  competing  hypotheses  from  considera¬ 
tion.  Therefore,  a  good  ordering  of  the  hypotheses 
for  verification  can  reduce  the  number  of  hypothe¬ 
ses  requiring  verification.  IVaditionally,  hypothe¬ 
ses  are  ordered  based  on  the  saliency  (discrimina¬ 
tion  ability)  or  size  of  the  image  features.  These 
heuristics  have  proven  useful  for  hypothesis  order¬ 
ing;  however,  to  minimize  verifications,  we  want  to 
first  verify  the  most  likely  hypotheses — not  neces¬ 
sarily  those  with  the  most  salient  or  largest  image 
features.  We  order  the  hypothesis  cliques  by  the 
average  of  the  likelihood  ratios,  P(Ri\{Ri,Mj)  = 
ON)/P(Ri\{Ri,Mj)  =  OFF)  (see  section  3.1),  of 
their  constituent  match  hypotheses — checking  the 
most  likely  first.  Thus,  our  hypothesis-generation 
method  is  “optimal”  in  terms  of  ordering  for  verifi¬ 
cation  with  respect  to  our  prior  knowledge. 

4  Localization 

Several  factors  exacerbate  the  localization  prob¬ 
lem:  we  may  not  have  enough  constraints  from  our 
matches  to  determine  the  location  of  the  model  ac¬ 
curately,  inaccuracies  in  our  region  data  due  to  noise 
and  partial  occlusion  will  lead  to  errors  in  location 
estimates,  and  our  objects  may  vary  slightly  from 
the  models  causing  errors  in  alignment  along  edges 
and  surfaces.  Using  primitive  matches  alone,  we 
are  able  to  get  crude  estimates  of  rigid  transforma¬ 
tion  parameters.  Thus,  localization  based  on  our 
matches  is  assumed  to  be  inaccurate  but  can  serve 
as  a  good  starting  point  for  a  local  search  for  the 
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Figure  5:  Comparison  of  Gaussian  and  Lorentzian 
distributions  and  their  effect  on  outliers.  The 
Lorentzian  is  in  bold,  (a)  Gaussian  and  Lorentzian 
distributions,  (b)  p{z)  of  the  Gaussian  and 
Lorentzian  distributions,  (c)  the  derivative  of  p{z) 
which  is  the  magnitude  of  the  “force”  corresponding 
to  the  data  error,  (d)  the  weight  of  the  error  vector 
as  a  function  of  the  error  magnitude  for  Gaussian 
and  Lorentzian  distributions. 


best  set  of  model  parameters. 

We  can  define  a  parameterized  template  to  model 
our  object  and  specify  an  energy  function  over  the 
model  parameters  which  relates  how  closely  the 
model  matches  the  image  data.  Then,  we  can  per¬ 
form  a  search  over  the  parameter  space  to  find  the 
best  parameters  by  minimizing  the  energy  func¬ 
tion.  Since  we  are  dealing  with  3-D  images,  we  de¬ 
fine  the  template  of  a  model  to  be  a  set  of  points 
sampled  from  the  surface  of  the  model.  Our  con¬ 
straint  on  the  templates  is  that  visible  points  on  the 
model  surface  match  range  data  points  in  the  image. 
The  template  of  a  rigid  model  is  parameterized  by 
rigid- body  transformation  parameters  (rotation  and 
translation). 

We  assume  that  parts  of  the  object  surfaces  are  of¬ 
ten  occluded.  Occlusion  can  be  due  to  self-occlusion, 
nearby  objects  in  the  scene,  and  even  sensor  shad¬ 
ows  (visible  portions  of  the  scene  which  don’t  receive 
light  from  the  light-striper  in  a  light-stripe  range 
finders).  Occluded  points  are  considered  to  be  out¬ 
liers  as  are  noisy  points  due  to  illumination  irreg¬ 
ularities  and  sensor  error.  If  outliers  are  likely,  a 
least-squared-error  estimation  procedure  is  not  de¬ 
sirable  since  the  estimated  parameters  will  be  af¬ 
fected  more  by  noise  than  the  actual  data.  The 
shape  of  the  error  distribution  determines  how  likely 
outliers  are  assumed  to  occur.  Least-squares  esti¬ 
mates  are  very  sensitive  to  outliers  since  all  errors 
are  equally  weighted  proportional  to  their  magni¬ 
tude.  Instead,  we  would  like  an  estimation  proce¬ 
dure  which  throws  out  (or  gives  low  weight  to)  the 
true  outliers.  This  simply  corresponds  to  a  MAP 
estimate  using  a  distribution  where  large  errors  are 
more  likely  than  in  a  normal  distribution. 


In  this  work,  we  use  a  Lorentzian  distribution 


P(i)a 


1 


1  +  5(7) 
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to  perform  the  MAP  estimate  of  our  model  param¬ 
eters.  The  Lorentzian  is  similar  in  shape  to  the 
Gaussian  distribution,  but  the  tail  of  the  distribu¬ 
tion  is  much  larger  indicating  that  outliers  are  as¬ 
sumed  to  occur  with  a  higher  (relative)  probability 
than  in  the  Gaussian  noise  model.  Figure  5  com¬ 
pares  the  (unnormalized)  Gaussian  distribution  with 
the  Lorentzian  distribution.  The  important  graph 
is  Figure  5(d)  which  shows  the  weighting  (relative 
to  magnitude)  of  the  error  vectors  under  the  Gaus¬ 
sian  and  Lorentzian  distributions.  The  effect  of  the 
Lorentzian  is  to  eventually  give  zero  weight  to  the 
true  outliers  hence  improving  the  estimation  of  pa¬ 
rameters.  It  is  thus,  in  some  sense,  providing  robust 
estimate  of  parameters. 

The  goal  is  to  improve  our  model  parameter  estimate 
using  the  range  data  of  our  image.  We  define  q  to 
be  the  the  vector  of  model  parameters  (rigid  body 
translation  and  rotation  parameters).  Using 


p(J)  =  -logP(J)  =  log(l-hi(J)2),  (7) 

a  <T  z  <T 

we  can  find  the  MAP  estimate  of  P{q)  =  ■  P(  ^4*^) 

by  minimizing  the  energy  function 

^(9)=E^(^)  W 


where  V{q)  is  the  set  of  visible  model  points  for  the 
given  model  parameters  q,  Zi(q)  is  the  error  of  the 
ith  model  point  given  the  model  parameters  q,  and  a 
is  the  normalizing  factor  for  the  Lorentzian  function 
which  specifies  the  width  of  the  distribution. 

We  define  the  error  to  be  the  distance  between  the 
model  point  and  the  data  point  nearest  the  model 
point; 

2:^(9)  =  min||fi(g)-o  II  (9) 

where  D  is  the  set  of  three  dimensional  data  points 
in  the  image,  and  Xi{q)  is  the  world  coordinate  of  the 
ith  model  point  transformed  using  the  model  param¬ 
eters  q.  The  calculation  of  the  nearest  data  point  a  is 
optimized  by  using  a  k-dimensional  nearest-neighbor 
search  [Friedman  et  al.,  1977]. 

We  use  eispect  information  computed  offline  to  define 
an  efficient  approximation  of  V{q)  for  determining 
the  visibility  of  each  point.  With  the  definition  of 
the  energy  E,  we  utilize  a  form  of  gradient-descent 
search  to  minimize  the  energy.  In  order  to  reduce 
the  effect  of  discontinuities  in  E  produced  from  us¬ 
ing  sampled  aspects  to  determine  model  point  visi¬ 
bility,  we  perform  all  gradient-descent  line  searches 
utilizing  the  same  set  of  visible  points  for  the  en¬ 
tire  line  search.  Thus,  the  energy  function  is  kept 
smooth  throughout  the  line  search. 
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Figure  6:  Example  recognition  results:  intensity  im¬ 
age  (left),  segmented  regions  (middle),  wire  frame 
overlay  of  models  (right).  The  table  lists  the  number 
of  hypotheses  generated  and  verified  for  each  itera¬ 
tion  of  the  algorithm  using  the  specified  occlusion 
parameter  for  computing  the  log-likelihoods. 


5  Recognition  Re  mits  and 
Experiments 

To  evaluate  the  performance  of  our  hypothesis- 
generation  algorithm,  we  are  interested  in  the  num¬ 
ber  of  hypotheses  requiring  verification  since  the  ver¬ 
ification  stage  requires  localization  which  is  the  ex¬ 
pensive  component  of  our  algorithm.  Experiments 
were  conducted  using  a  model-b2ise  of  8  polyhedral 
objects  including  a  stapler,  hole-cube,  rolodex,  cas¬ 
tle,  tape  dispenser,  stick,  note  dispenser,  and  pencil 
box. 

On  tested  images  containing  known  models,  the  cor¬ 
rect  hypotheses  were  consistently  high  in  the  list  of 
selected  hypotheses  ordered  for  verification.  When 
this  occurs,  the  number  of  hypotheses  verified  is 
greatly  reduced.  After  the  known  objects  are  recog¬ 
nized,  all  that  is  left  are  hypotheses  for  regions  that 
do  not  correspond  to  known  objects.  Unfortunately, 
we  must  verify  all  of  the  hypotheses  generated  for 
these  regions  since  the  verification  will  not  succeed. 


Figure  7:  Comparison  of  the  results  of  the  lea.st- 
squares  formulation  and  the  robust  formulation  of 
the  error  distribution  on  three  localization  problems: 
initial  model  location  (top),  final  localized  model  lo¬ 
cation  using  least-squares  (left),  and  final  localized 
model  location  using  robust  formulation  (right). 


while  limiting  the  size  of  the  search.  It  also  shows 
that  our  method  can  function  in  spite  of  slight  par¬ 
tial  occlusions  and  segmentation  “errors”  (see  the 
stapler). 

We  performed  a  test  on  an  image  containing  no 
known  models.  The  algorithm  greedily  selected  107 
hypothesis  cliques  that  it  thought  were  likely  enough 
to  justify  verifying  out  of  an  absolute  worst  case  of 
«  2338^.  In  the  case  of  no  known  models  in  an 
image,  each  verification  fails,  which  means  that  all 
hypothesis  cliques  generated  by  the  hypothesis  gen¬ 
eration  phase  must  be  verified.  “Greedy"  hypoth¬ 
esis  generation  is  a  beneficial  attribute  because  al¬ 
most  every  scene  will  most  likely  contain  unknown 
objects,  and  the  performance  of  the  system  will  de¬ 
grade  quickly  if  the  system  attempts  to  verify  every 
possibility.  For  all  of  the  images  tested,  model  sym¬ 
metry  is  the  major  source  of  unnecessary  hypothesis 
verification. 

We  performed  tests  to  compare  the  performance  of 
our  localization  algorithm  using  both  the  Gaussian 
and  Lorentzian  distributions.  Figure  7  compares 
their  solutions  for  an  example  localization  task.  The 
least-squares  solution  is  noticeably  occlusion  sensi¬ 
tive  while  the  results  using  the  Lorentzian  distribu¬ 
tion  are  relatively  insensitive  to  occlusion.  In  ex¬ 
periments,  we  have  found  that  the  initial  (rough) 
location  estimates  can  be  perturbed  by  a  few  cm  in 
translation  and  around  20  degrees  in  rotation  with¬ 
out  affecting  the  restilting  solution. 


Figure  6  shows  the  recognition  results  of  our  algo¬ 
rithm  on  a  sample  image.  'I'his  example  is  typical 
of  other  tests  and  demonstrates  the  ability  of  o»ir 
hypothesis  generation  to  select  accurate  hypotheses 


Our  prototype  recognition  program  was  imple¬ 
mented  in  Gommon  Lisp  for  a  Sun  1  workstation. 
The  approximate  execution  time  was  2  minutes  for 
the  segmentation.  2-.')  minutes  for  building  tin-  MHF. 
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and  2-5  seconds  to  perform  HCF  and  order  the  hy¬ 
potheses  for  verification.  The  prototype  of  the  local¬ 
ization  procedure  takes  approximately  1-3  minutes 
per  localization.  We  have  not  concentrated  on  mak¬ 
ing  the  implementation  efficient.  Instead,  we  have 
opted  for  fast  a  development  environment  in  which 
to  test  our  ideas. 

6  Conclusions 

We  have  introduced  the  use  of  sensor  modeling  for 
hypothesis  generation  for  object  recognition.  The 
sensor- modeling  approach  has  the  following  advan¬ 
tages: 

•  it  provides  realistic  (accurate)  constraints  for 
“optimal”  hypothesis  generation  by  explicitly 
modeling  the  effects  of  the  sensor,  the  seg¬ 
mentation  algorithm,  the  geometry  of  the  ob¬ 
jects  (including  self-occlusion),  and  feature  de¬ 
tectability, 

•  it  can  be  used  to  model  the  effect  of  partial 
occlusion  through  simulation, 

•  it  builds  prior  models  that  are  robust  with  re¬ 
spect  to  segmentation  capabilities,  and 

•  real  world  constraints  about  likely  viewing  di¬ 
rections  for  particular  objects  can  be  utilized  by 
the  sensor-modeling  approach  to  improve  the 
hypothesis  generation  performance. 

The  MRF  formalism  combined  with  sensor  modeling 
provides  a  framework  for  “optimal”  hypothesis  gen¬ 
eration  with  respect  to  the  prior  knowledge  from  our 
sensor  model.  HCF  estimation  provides  an  efficient 
and  effective  method  of  performing  the  estimation 
over  our  MRF.  Our  algorithm  does  not  require  that 
image  features  are  grouped  into  sets  belonging  to 
single  objects.  In  experiments,  our  hypothesis  gen¬ 
eration  algorithm  has  demonstrated  the  ability  to  re¬ 
duce  the  number  of  hypotheses  requiring  verification 
by  accurately  selecting  hypotheses  and  “optimally” 
ordering  the  hypotheses  for  verification. 

Our  localization  algorithm  is  driven  by  the  range 
data  and  does  not  le'v  on  matches  between  high- 
level  image  features  and  model  features.  The  pose 
estimate  is  refined  through  energy  minimization  in  a 
manner  similar  to  deformable  templates,  active  con¬ 
tours,  and  snakes  [Kass  et  ai,  1987].  The  novelty 
of  our  algorithm  is  the  use  of  a  robust  estimator  for 
the  energy  function  being  minimized.  The  robust 
estimator  is  insensitive  (relative  to  least-squares  ap¬ 
proaches)  to  outliers  occurring  due  to  partial  occlu¬ 
sion  of  the  object  as  well  as  other  effects  of  compli¬ 
cated  scenes. 
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Abstract 


Geometric  invariants  and  quasi-invariants  have 
become  effective  methods  for  real  vision  prob¬ 
lems.  Quasi-invariants  extend  invariants  to 
widespread  occurrence  in  computer  vision. 

This  paper  presents  analytic  results  for  quasi¬ 
invariants  and  practical  methods  for  their  use. 

The  mathematical  definition  of  quasi-invariants 
has  been  used  for  more  than  20  years.  We  in¬ 
troduce  a  number  of  quasi-invariants  and  prove 
their  quasi-invariance.  We  compare  the  taxon¬ 
omy  of  invariants  with  that  of  quasi-invariants. 
Several  common  quasi-invariants  are  nearly  in¬ 
variant. 

The  intent  in  using  quasi-invariants  is  make 
approximate  measurements  of  a  class  of  Eu¬ 
clidean  invariants,  3-space  shape  parameters. 

We  discuss  examples  of  the  use  of  invariants 
and  quasi-invariants  in  computer  vision,  in 
grouping  and  hypothesis  generation  for  recogni¬ 
tion.  Generalized  cylinders  (GCs)  express  pow¬ 
erful  quasi-invariants  for  grouping  image  fea¬ 
tures  that  define  a  mechanism  for  discrimina¬ 
tion  of  GC  parts,  figure-ground  discrimination. 
Quasi-invariants  permit  estimates  of  shapes  of 
GC  parts  and  objects  formed  of  GC  parts  that 
enable  generation  of  hypotheses  that  are  ver¬ 
ified  by  globally  coherent  interpretation  by  a 
Bayes  net  for  evidential  inference. 

Quasi-invariants  are  inexact;  a  quantitative 
probabilistic  interpretation  enables  efficient  use 
of  quasi-invariants  in  recognition  in  a  paradigm 
of  hypothesis  generation  and  verification.  For 
a  majority  of  cases,  quasi-invariant  observables 
are  approximations  to  corresponding  body 
measurements  in  space;  those  quasi-invariants 
generate  accurate  hypotheses.  Distributions  of 
deviations  are  known  or  can  be  derived  for 

'This  research  vas  supported  in  part  by  a  contract 
from  the  Air  Force,  <  .tM02.92-C-0105  throngh  RADC  from 
DARPA  SISTO,  “Mod ‘■'  based  Recognition  of  Objects  in 
Complex  Scenes:  Spatial  Organization  and  Hypothesis 
Generation”. 


use  in  evidential  inference.  For  a  minority  of 
cases,  quasi-invariant  observables  give  bad  es¬ 
timates  of  corresponding  body  measurements; 
those  quasi-invariants  generate  false  hypothe¬ 
ses  that  are  rejected  by  a  verification  phase,  in 
most  cases  at  low  cost.  This  hypothesis  gen¬ 
eration  and  verification  mechanism  depends  on 
making  some  accurate  measurements  and  some 
accurate  hypotheses;  the  mechanism  tolerates 
local  errors.  Computational  complexity  can  be 
low.  Limited  completeness  results  are  shown. 


1:  INTRODUCTION 

Geometric  quasi-invariants  are  one  of  the  oldest 
threads  in  computer  vision,  dating  to  the  late  1960s. 
Binford  introduced  them  to  extend  geometric  invari¬ 
ants  in  shape  perception.  The  intent  and  use  of  quasi¬ 
invariants  is  somewhat  different  from  that  of  invariants. 

Geometric  invariants  have  become  a  major  topic  in 
computer  vision,  beginning  about  1980.  Invariants  have 
a  long  tradition  in  mathematics;  many  of  the  classic 
results  in  algebraic  invariants  date  from  the  last  cen¬ 
tury.  Invariants  have  potential  advantage  in  computa¬ 
tional  complexity  over  view-sensitive  methods,  e.g.  as¬ 
pect  graphs.  Invariants  enable  view-insensitive  classifi¬ 
cation  of  objects.  Aspect  graph  methods  have  very  high 
computational  complexity,  ^th  aspect  graphs  and  in¬ 
variants  accomodate  very  little  variability  in  object  class. 

Invarizmts  computed  on  one  view  of  an  object  can  be 
used  as  a  key  to  index  into  a  structured  database  of 
object  models.  A  number  of  important  invariants  can 
be  measured  well  from  experiniental  image  data.  Where 
they  exist,  invariants  have  great  utility.  In  many  cases, 
there  are  no  invariants  available.  Invariants  are  known 
for  special  cases:  e.g.  4  or  more  points  on  a  line;  5  or 
more  points  in  the  plane;  algebraic  curves  in  the  plane; 
algebraic  surfaces  in  3-space;  multiple  views  of  objects. 
There  are  known  to  be  no  invariants  for  single  views  of 
general  3d  point  sets. 

Invariants  are  valuable  in  important  special  cases. 
Their  limitations  are:  1.  There  are  no  invariants  for 
many  important  cases;  invariants  are  relatively  few  in 
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practical  vision.  2.  Invariants  are  useful  for  individual 
objects  identical  to  models  in  the  database;  they  do  not 
tolerate  variation,  e.g.  stores  on  aircraft  or  open  tank 
hatch,  i.e.  articulation.  3.  Current  use  of  invariants  is 
computationally  expensive,  e.g.  computing  and  index¬ 
ing  two  invariants  for  all  combinations  of  5  points  in  the 
plane. 

The  pose  estimation  problem  is  to  identify  classes  of 
objects  from  known  classes  and  their  poses  from  images 
of  scenes  of  objects;  an  object  class  is  composed  of  iden¬ 
tical  objects  without  variation,  i.e.  no  surface  marks,  no 
articulation,  no  shape  variation.  Typical  pose  estimation 
methods  are  practical  only  with  few  objects. 

Pose  estimation  is  a  very  limited  problem  compared 
to  biological  vision.  The  generic  interpretation  prob¬ 
lem  is  to  discriminate  objects,  identify  their  classes  and 
estimate  their  shapes  and  their  poses  2is  completely  as 
possible  with  available  information;  object  classes  are 
composed  of  objects  with  considerable  variation,  e.g.  hu¬ 
mans  and  trees;  some  object  classes  may  not  be  known. 

View-sensitive  methods  for  pose  estimation  match  a 
dense  set  of  views  of  all  objects  with  all  combinations  of 
image  features.  The  computational  complexity  of  aspect 
graph  methods  is  proportional  to  the  number  of  com¬ 
binations  of  image  features  times  the  number  of  views 
summed  over  all  objects.  One  great  contribution  to  com¬ 
putational  complexity  in  recognition  by  view-sensitive 
methods  is  matching  the  number  of  views  of  objects. 
The  dominant  contribution  to  computational  complex¬ 
ity  is  matching  all  combinations  of  image  features.  Scene 
complexity  dominates  over  the  number  of  views  of  ob¬ 
jects.  Both  aspect  graph  methods  and  invariant  methods 
introduce  simple  grouping  or  ad  hoc  grouping  to  reduce 
combinatorics  of  sets  of  image  features.  This  work  pro¬ 
vides  principled  mechanisms  for  grouping. 

Each  view  is  a  different  object  in  the  aspect  graph 
paradigm.  There  are  no  object  classes,  i.e.  all  objects  in 
a  class  are  identical'  there  are  no  relations  among  objects 
other  than  name.  E  g.,  the  class  ’’cars”  has  no  meaning 
other  than  a  name  that  includes  individual  cars.  There 
is  no  idea  of  similarity.  Invariants,  where  available,  re¬ 
duce  this  complexity;  they  are  view-invariant  and  enable 
indexing  for  object. 

Grouping  to  avoid  matching  all  combinations  of  im¬ 
age  features  is  the  figure-ground  discrimination  prob¬ 
lem.  From  the  point  of  view  of  computational  com¬ 
plexity,  figure-ground  discrimination  (grouping)  is  a  cen¬ 
tral  problem  in  recognition.  An  effective  mechanism  for 
figure-ground  discrimination  distinguishes  few  possible 
objects  among  many  possible  combinations  of  image  fea¬ 
tures  with  low  complexity.  Grouping  typically  refers  to 
ad  hoc  methods  for  figure-ground  discrimination. 

Avoiding  matching  all  views  of  all  objects  is  the  hy¬ 
pothesis  generation  and  indexing  problem.  Effective  hy¬ 
pothesis  generation  and  indexing  generates  few  hypothe¬ 
ses  including  the  correct  set  of  hypotheses,  i.e.  few  views 
of  few  objects.  Hypothesis  generation  and  indexing  is  a 
second  important  problem.  Invariants  aid  in  hypothe¬ 
sis  generation  and  indexing  but  give  little  aid  in  figure- 
ground  discrimination. 

Quasi-invariants  were  introduced  to  extend  invariants 


to  more  vision  problems,  e.g.  figure-ground  discrimina¬ 
tion  and  hypothesis  generation  in  real,  complex  vision 
problems.  E.g.,  for  five  points  in  a  plane  there  are  two 
invariants.  There  are  no  invariants  for  3  or  4  points  in 
a  plane,  although  quadrilaterals  (4  points  in  a  plane) 
or  triangles  (3  points)  are  important.  There  are  many 
quasi-invariants;  they  are  widely  applicable,  e.g.  in  cases 
in  which  there  are  known  to  be  no  invariants.  There  are 
4  quasi-invariants  for  4  points  in  a  plane  and  2  qu^tsi- 
invariants  for  3  points  in  a  plane. 

Quasi-invariants  provide  a  mechanism  for  figure- 
ground  discrimination  with  unknown  objects  with  great 
variation,  based  on  quasi-invariant  relations  on  GC 
parts.  Quasi-invariants  provide  a  mechanism  for  hypoth¬ 
esis  generation  by  estimating  the  approximate  shape  of 
GC  parts  in  3-space  and  shape  of  objects  formed  of  GC 
parts.  Estimating  3-space  shape  from  image  measure¬ 
ments  is  an  inverse  process.  Indexing  is  done  using  par¬ 
tial,  structured  3d  shape  descriptors  as  in  Nevatia  and 
Binford  [Nevatia  83].  Structured  means  part-whole  de¬ 
scriptions.  Verification,  i.e.  matching,  is  accomplished 
with  full  3d  shape  descriptions.  This  can  be  called  struc¬ 
tured,  3d  interpretation  by  inverse  methods. 

The  structured  3d  interpretation  paradigm  matches 
object  descriptions  in  3-space.  In  contrast,  view- 
sensitive  methods  match  image  descriptions.  There  is 
much  more  variation  in  images  than  there  is  in  3-space 
because  of  surface  markings,  pose,  lighting,  sensor  re¬ 
sponse  and  obscuration.  Our  paradigm  reduces  com¬ 
plexity  of  matching  by  reducing  complexity  of  descrip¬ 
tions  at  the  cost  of  complexity  in  generating  3d  descrip¬ 
tions.  For  the  structured  3d  approach  to  succeed,  seg¬ 
mentation,  figure-ground  discrimination  and  3d  shape 
inference  must  all  be  effective.  Most  researchers  choose 
to  avoid  dependence  on  effective  segmentation,  figure- 
ground  discrimination  and  3d  shape  inference.  View- 
sensitive  methods  suffer  complexity  of  matching  com¬ 
plex  descriptions  by  ignoring  generating  3d  image  de¬ 
scriptions.  In  reality,  they  cannot  avoid  grouping,  i.e. 
figure-ground  discrimination;  it  is  essential  to  reduce  the 
combinatorics  of  combinations  of  image  features.  The 
balance  in  terms  of  complexity  swings  far  in  favor  of  gen¬ 
erating  3d  descriptions.  The  majority  of  recognition  by 
aspect  graphs  uses  3d  data  with  3d  objects,  even  though 
it  is  possible  to  measure  Euclidean  invariants  directly 
with  3d  data.  Recognition  from  monocular  intensity 
imagery  is  much  more  difficult,  using  2d  data  for  3d 
objects.  Structured,  3d  interpretation  is  effective  with 
3d  data,  but  monocular  image  intensity  data  presents  a 
great  challenge.  Striking  results  have  been  achieved  with 
monocular  intensity  images. 

The  structured  3d  approach  reduces  computational 
complexity  by  reducing  the  number  of  combinations  of 
image  features  by  figure-ground  discrimination  and  re¬ 
duces  the  number  of  object  hypotheses  by  hypothesis 
generation  and  indexing. 

In  this  paper,  we  define  quasi-invariants,  give  exam¬ 
ples  of  quasi-invariants,  and  discuss  systematics  of  the 
family  of  quasi-invariants.  We  also  discuss  the  use  of 
quasi-invariants  in  computer  vision. 
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2.  USE  OF  INVARIANTS 


The  use  of  invariants  is  discussed  to  clarify  strategies 
for  use  of  quasi-invariants.  Consider  an  example:  the 
two  invariants  for  5  points  in  the  plane  [Barrett  91]. 

[Barrett  91]  show  an  example  of  recognition  of  aircraft 
based  on  the  two  invariants  for  5  points  in  a  plane.  They 
compute  invariants  for  5  tuples  of  points  in  the  image 
and  indexing  into  the  database  to  match  invariants  for 
models.  An  aircraft  is  not  planar  but  it  was  possible  to 
find  5  points  on  a  plane. 

[Mundy  92]  also  demonstrate  recognition  of  buildings 
using  invariants  for  five  points  in  a  plane.  A  rectan¬ 
gular  building  has  only  4  points,  but  there  might  be 
small  structures  on  the  roof.  An  L-shaped  building  has 
6  points  in  a  plane;  a  U-shaped  building  has  8  points  in 
a  plane.  In  these  examples,  there  are  typically  10,000 
points  or  lines  in  a  large  image.  The  number  of  combi¬ 
nations  of  5  points  is  %  10^*  invariant  calcula^ 

tions  and  indexes.  That  number  is  prohibitive.  Simple 
grouping  was  used,  corresponding  to  curvilinearity  and 
intersection  at  vertex.  Although  the  paradigm  initially 
was  single  step  indexing,  it  has  been  found  essential  to 
test  hypotheses  generated  by  indexing.  For  now,  points 
are  somewhat  difficult  to  estimate. 

3.  USE  OF  QUASI-INVARIANTS 

To  motivate  the  mathematical  analysis  of  quasi¬ 
invariants,  this  section  examines  their  use  in  elective  use 
in  computer  vision.  Quasi-invariants  have  been  used  in 
stereo  correspondence  [Arnold  and  Binford  80],  in  recog¬ 
nition  with  Bayes  nets  [Binford  87]  and  in  interpretation 
of  complex  objects  [Sato  and  Binford  92a,  92b]. 

A  quasi-invariant  was  found  for  two  views  of  an  edge 
in  stereo  vision  [Arnold  and  Binford  80].  For  human 
stereo,  two  views  of  an  edge  element  have  approximately 
the  same  angle  in  two  views  in  the  canonical  stereo  co¬ 
ordinate  system.  This  stereo  quasi-invariant  is  nearly 
invariant.  The  difference  between  the  two  angles  has  a 
distribution  with  1  degree  full  width  at  half  maximum. 
Another  stereo  quasi-invariant  was  found  for  the  dis¬ 
tance  between  pairs  of  edges  in  two  views.  See  figure  1. 
The  quasi-invariants  were  incorporated  in  two  systems 
for  stereo  reconstruction  [Arnold  83,  Baker  and  Binford 
81]. 

Quasi-invariants  for  angles  and  ratios  of  lengths  of 
straight  lines  were  developed.  Statistical  distributions 
were  incorporated  into  a  Bayes  net  for  recognition  of 
plumbing  parts;  valve,  elbow  or  other  [Binford  87]. 

Generalized  cylinders  provide  quasi-invariants  that 
group  curves  on  surfaces  to  form  hypotheses  of  object 
parts.  This  provides  a  viable  mechanism  for  figure- 
ground  discrimination  in  complex  scenes.  The  gener¬ 
alized  cylinder  representation  generates  quasi-invariants 
for  a  large  class  of  shapes.  Parallel  cross  sections  for 
cylinders  with  arbitrary  cross  sections,  SCGCs,  straight 
constant  cross  section  GCs,  have  edges  that  are  parallel 
in  space.  Parallelism  is  quasi-invariant  in  projection,  as 


shown  below.  Parallel  cross  sections  for  SHGCs,  straight 
homogeneous  GCs,  have  edges  that  are  scaled  versions 
of  one  another.  That  property  is  quasi-invariant.  It  in¬ 
cludes  the  previous  case  of  cylinders.  A  search  is  fea¬ 
sible  for  corresponding  pairs  of  curves  that  scale.  De¬ 
tecting  correspondence  appears  to  have  low  complexity. 
Complete  3d  descriptions  were  generated  for  a  number 
of  complex  objects  [Sato  and  Binford  92a,  92b]. 

Statistical  distributions  for  quasi-invariants  enable 
their  use  in  Bayesian  networks.  It  is  important  to  note 
that  successful  recognition  is  not  dependent  on  assump¬ 
tions  behind  these  statistical  distributions,  but  compu¬ 
tational  complexity  depends  on  distributions  of  pose  of 
objects.  That  is  because  the  statistical  distributions  are 
used  primarily  in  indexing  where  they  measure  the  prob¬ 
ability  of  detection  and  the  probability  of  false  alarms, 
i.e.  the  probability  that  a  surface  will  be  viewed  from  a 
favorable  viewpoint  that  aids  indexing.  Indexing  oper¬ 
ates  in  the  paradigm  of  hypothesize  and  test.  In  typical 
cases,  only  part  of  the  information  can  be  used  to  gen¬ 
erate  hypotheses  with  low  complexity.  There  is  more 
information  than  is  used  in  indexing,  adequate  informa¬ 
tion  to  verify  correct  identity  if  the  correct  hypotheses 
were  known.  I.e.,  typically,  if  we  were  told  the  correct  hy¬ 
pothesis,  there  would  be  more  than  enough  information 
to  test  that  hypothesis;  verification  would  require  little 
computation.  A  key  issue  is  to  use  all  available  infor¬ 
mation  in  verification,  i.e.  matching..  Quasi-invariants 
facilitate  indexing  by  building  an  approximate  3d  model 
from  single  monocular  images  or  from  partial  3d  data. 

Quasi-invariants  exist  for  a  broad  class  of  measure¬ 
ments  that  parameterize  shape,  i.e.  ratios  of  dimensions, 
angles,  ratios  of  curvatures.  Quasi-invariants  are  valu¬ 
able  because  they  are  almost  always  available. 

4.  DEFINITION  OF  INVARIANTS 

For  intuition  in  defining  quasi-invariants,  consider  a 
definition  of  invariants  and  a  concrete  example. 

Let  A  be  a  collection  of  appearances  (mathematical 
objects).  Let  K  be  a  collection  of  values  (mathematical 
objects).  Let  p  :  A  —*  V  be  a  function  that  defines  an 
equivalence  relation  on  A: 

01.02  €  A;  02  «  oi  <:>  p(a2)  =  p(ai)  1 

Comment:  appearances  and  values  are  names  for  intu¬ 
ition  only;  they  have  no  mathematical  significance. 

A  mapping  <!>  :  A  —*V  is  an  invariant  of  the  equiva¬ 
lence  relation  defined  by  p  if  it  is  constant  on  equivalence 
classes  of  A.  A  complete  invariant  distinguishes  any  two 
inequivalent  objects. 

An  invariant  satisfies 

01,02  6  A;ai  »  02  ^  ^(oi)  =  ^(02)  2o 

A  complete  invariant  satisfies: 

^(oi)  =  ^(02)  oi  fts  02  26. 
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Example 

A  concrete  example  will  help  make  the  definition  clear. 
For  appearances  A  take  the  set  of  all  views  of  all  sets  of  4 
colinear  points  in  3-space.  For  the  equivalence  relation, 
consider  equivalent  any  two  views  of  4  colinear  points 
with  the  cross  ratio  computed  in  three  space,  p  -*  R} 
is  the  cross  ratio  computed  in  three  space,  ^ts  of  4 
colinear  points  with  different  lengths  but  the  same  cross 
ratio  are  distinct  in  a  Euclidean  sense  but  are  equivalent 
under  p. 

Let  G  be  the  set  of  translations  with  rigid  rotations;  it 
is  a  noncompact  group.  Intersect  G  with  the  fat  sphere, 
i.e.  the  portion  of  the  infinite  viewing  ball  beyond  a 
minimum  radius.  An  equivalence  class  of  appearances 
is  a  set  of  4  colinear  points  with  the  same  cross  ratio 
computed  in  three  space,  with  origin  in  the  fat  sphere, 
i.e.  a  subset  of  all  translations  and  rotations. 

The  cross  ratio  of  4  image  points  A  —*  R}  is  a 
scalar  function  that  projects  4  colinear  points  and  com¬ 
putes  a  scalar,  i.e.  V  =  R}.  The  cross  ratio  of  4  image 
points  is  known  to  be  a  projective  invariant;  it  is  invari¬ 
ant  for  the  equivalence  relation  defined  by  p,  i.e.  it  is 
constant  on  all  sets  of  4  colinear  points  with  the  same 
cross  ratio  computed  in  space. 

the  cross  ratio  of  4  image  points,  is  complete  in 
this  example;  ^  has  different  values  for  any  pair  of  4 
colinear  points  with  different  values  of  p,  the  cross  ratio 
computed  in  three  space.  The  cross  ratio  is  not  com¬ 
plete  on  the  other  equivalence  relation  mentioned  above, 
equality  of  Euclidean  lengths  of  the  three  intervals  of  4 
colinear  points. 

It  is  useful  to  think  of  internal  structure  independent 
of  external  variation.  Internal  structure  is  equivalent  to 
the  set  of  equivalence  classes  determined  by  p,  the  cross 
ratio  in  this  example.  In  other  cases,  internal  structure 
is  parameterized  by  various  measures,  e.g.  Euclidean 
lengths  of  intervals,  other  shape  parameters  for  more 
complex  structures,  or  by  kinematics  or  dynamics,  e.g. 
quantum  numbers.  External  variation  is  determined  by 
G,  e.g.  the  translations  with  rigid  rotations  T3  x  SOs- 

The  classical  definition  of  invariants  is  slightly  less 
general.  Consider  affine  invariants,  invariant  under  lin¬ 
ear  transformations,  is  a  field  of  characteristic  zero, 
e.g.  the  reals  R  ox  C  ,  the  complex  numbers.  G  is 
GL(n,  K),  the  general  linear  group  with  dimension  n 
over  field  K.  F  is  a  homogeneous  form  of  degree  d  in 
n  variables  . .  .^n  with  coefficients  in  K.  <r  €  G  is  a 
linear  transform  on  an  affine  space,  the  action  of  G  on 
the  polynomial  ring  R,  a  matrix  defined  by  the  rational 
representation  of  <r. 

9  is  a  relative  invariant  if  ag  =  a(o’)9;  further,  a{<r)  = 
{det  ay .  g  is  an  invariant  of  weight  w.  g  is  an  absolute 
invariant  if  w  =  0. 

Relative  invariants  on  the  ring  A[4i . .  .^n]  are  covari¬ 
ants.  I.e.,  covariants  involve  both  coefficients  and  coor¬ 
dinates.  Not  every  equivalence  relation  is  determined  by 
group  action. 

5.  DEFINITION  OF  QUASI-INVARIANTS 


For  intuition,  the  definition  of  quasi-invariants  (semi¬ 
invariants)  will  be  as  nearly  paridlel  as  possible  to  the 
definition  of  invariants  and  the  example  above.  Later, 
the  definition  of  quasi-invariants  will  be  compared  to  ex¬ 
isting  mathematical  usage  of  the  terms  semi-invariants 
or  quasi-invariants. 

Let  A  be  collections  of  appearances  (mathematical 
objects).  Let  V  be  collections  of  values  (mathematical 
objects).  Let  p  :  A  —*  V  be  a  function  that  defines 
equivalence  classes  on  A  by: 

01,03  €  A;  ai  «  03  O  p(a3)  =  p(ai). 

A  mapping  ^  :  A  — »  V  is  a  quasi-invariant  (semi¬ 
invariant)  of  p  at  a  €  A  if  it  is  locally  constant  on 
equivalence  classes  of  A  (defined  below)  and  if  ^  is 
locally  equivalent  to  p  at  a. 

Exposition 

4>  is  locally  constant,  i.e.  locally  invariant,  if  the  Tay¬ 
lor  series  for  ^  at  a  €  A  is  constant  to  second  order 
under  differential  group  actions.  <t>  is  locally  equivalent 
to  p  at  a  if  it  has  the  same  Taylor  series  to  first  order. 

Make  definitions  to  implement  those  Taylor  series.  Let 
A  =  RT'o  and  V  =  il”“.  Let  7  €  be  an  infinitely 
differentiable  representation  of  G  with  parameterization 
u  €  R".  Let  p  be  infinitely  differentiable  with  parame¬ 
terization  V  €  R”. 


A  mapping  ^  :  A—*V  is  quasi-invariant  if  the  follow¬ 
ing  two  conditions  hold: 

d<t>i  _  Q  _ 
dvu 

Comments 

A  differentiable  representation  of  G  implies  a  met¬ 
ric  and  differentiability  of  a  €  A.  A  quasi-invariant  is 
a  complete  quasi-invariant.  It  distinguishes  any  two  in¬ 
equivalent  objects  locally.  A  trivial  quasi-invariant  has 
the  same  value  on  all  equivalence  classes.  lYivial  quasi¬ 
invariants  have  no  interest. 

An  example  helps  to  clarify  the  definition  of  quasi¬ 
invariants.  See  figure  2a.  The  collection  of  appearances 
A  is  the  set  of  all  views  of  all  sets  of  three  colinear 
points.  G  is  the  set  of  translations  and  rigid  rotations 
intersected  with  the  fat  sphere.  The  equivalence  classes 
are  defined  by  p  :  two  views  are  equivalent  for  3  colinear 
points  with  the  same  colinear  ratio  computed  in  three 
space.  Equivalence  classes  are  all  views  of  3  colinear 
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points  with  the  same  colinear  ratio  computed  in  three 
space. 

<t>  :  A  is  a  scalar  function,  e.g.  the  projected 

colinear  ratio  of  3  projected  image  points.  ,  the  pro¬ 
jected  colinear  ratio  is  quasi-invariant  at  z  =  oo,d  =  0; 
the  gradient  is  zero  there,  both  partials  vanish.  The 
projected  colinear  ratio  is  constant  to  second  order  at 
z  =  oo,  0  =  0  over  equivalence  classes,  colinear  triples 
with  the  same  colinear  ratios  evaluated  in  three  space. 
The  projected  colinear  ratio  is  even  constant  to  third  or¬ 
der  at  z  =  00,0  =  0  i.e.  second  partials  are  zero  also. 
The  projected  colinear  ratio  at  z  =  oo,0  =  0  is  lin¬ 
early  equivalent  to  the  true  colinear  ratio  evaluated  in 
three  space;  it  has  the  same  value  and  first  derivative 
at  z  =  00,0  =  0  .  The  colinear  ratio  is  locally  com¬ 
plete  on  A.  It  takes  a  different  value  near  z  =  oo,0  =  0 
for  non-equivalent  triples  of  points,  views  of  3  colinear 
points  with  different  ratio.  The  colinear  ratio  is  not  com¬ 
plete  on  equivalence  classes  defined  by  Euclidean  length 
of  intervals.  Two  sets  of  3  colinear  points  with  the  same 
colinear  ratio  could  differ  in  Euclidean  lengths  of  inter¬ 
vals. 

Previous  Definitions  of  Semi-Invariants 

Two  uses  exist  for  the  term  semi-invariant.  The  defi¬ 
nition  similar  to  this  definition  is  semi-invariants  of  Lie 
groups.  Let  G  be  a  Lie  Group.  Let  /f  be  iZ  or  C.  p  is 
a  differentiable  representation  of  G,  p(G)  C  GL(n,  K), 
the  general  linear  group,  p  defines  an  action  of  G  on 

For  infinitesimal  Xa,  a  semi-invariant  or  invariant  el¬ 
ement  /  satisfies:  Xa/  =  0  (VX*)  or;  Xaf  —  0‘if,a  € 
K  (VXa) 

Another  definition  is  equivalent  to  relative  invariant; 
that  definition  is  uninteresting.  /Z  is  a  commutative  ring. 
The  group  action  (r  €  G  defines  an  automorphism:  /  — » 
O’/  €  fZ  and  o{jf)  =  (or)/  for  a,T  ^G. 

/  in  fZ  is  G-semi-invariant  if  for  each  <r  €  G, 
f  is  invariant  up  to  an  invariant  multiplier  depend¬ 
ing  on  o:  fff  =  a(o)/.  That  definition  is  equivalent 
to  relative  invariant.  There  is  an  integer  w  such  that 
a(o)  =  {det{<T)r. 

Stability  of  Quasi-Invariants 

The  intent  of  quasi-invariants  is  that  a  quasi-invariant 
0  approximates  a  3-space  measurement  p  .  For  p  to 
be  interesting,  it  must  be  a  Euclidean  invariant.  E.g. 
as  above,  choose  p  the  colinear  ratio  measured  in  3- 
space,  a  Euclidean  invariant.  The  image  colinear  ratio 
^  approximates  p. 

As  defined,  quasi-invariants  are  local  invariants.  Con¬ 
tinuity  is  a  very  weak  condition.  Even  if  ^  has  con¬ 
tinuous  derivatives  of  all  orders,  it  could  be  very  badly 
behaved,  i.e.  ift  could  be  a  poor  approximation  to  p  ex¬ 
cept  very  near  the  quasi-invariant.  The  utility  of  quasi¬ 
invariants  depends  on  their  stability.  It  turns  out  that 
many  quasi-invariants  are  quite  stable. 

For  quasi-invariants  to  be  useful,  the  quasi-invariant 


4>  must  be  a  reasonable  approximation  to  p  over  a  large 
part  of  the  parameter  range.  In  equation  ???,  the  second 
term  is  zero  because  the  gradient  is  zero.  The  range  over 
which  the  quasi-invariant  is  useful  is  determined  by  the 
range  over  which  the  quadratic  term  and  all  higher  terms 
have  a  sum  small  compared  to  the  constant  term,  e.g. 
1/3  of  the  constant  term.  For  quasi-invariant  examples 
of  interest,  it  will  be  useful  to  examine  the  quadratic 
term  to  determine  the  range  of  stability.  By  definition, 
the  linear  term  is  zero. 

That  condition  is  related  to  a  probabilistic  interpre¬ 
tation  of  quasi-invariants.  In  many  situations,  it  is  not 
possible  to  restrict  very  much  the  viewing  conditions  of 
the  observer  relative  to  the  object.  If  it  were  possible 
to  restrict  the  view,  that  would  be  very  useful  and  could 
probably  be  included  in  an  analysis  like  this.  For  a  quasi¬ 
invariant  to  be  useful,  it  should  be  a  good  approximation 
over  a  large  fraction  of  views,  i.e.  over  a  large  part  of 
the  fat  view  sphere,  the  hollow  viewing  ball. 

For  several  of  the  quasi-invariants  considered  here,  all 
second  partial  derivatives  vanish.  This  makes  the  or¬ 
der  of  zero  higher  at  m,  hence  intuitively  stronger.  The 
analysis  of  the  range  of  useful  quasi-invariance  then  re¬ 
quires  third-order  differentiability  and  equivalent  behav¬ 
ior  for  third-order  terms  of  the  Taylor  expansion.  There 
is  an  equivalent  definition  for  third-order  differentials. 
Even  analytic  functions  contain  many  wildly  misbehav¬ 
ing  functions.  These  conditions  on  differentials  are  much 
stronger  than  continuity.  Fortunately,  those  differentials 
are  small  enough  for  quasi-invariants  to  be  quite  useful. 

6.  SUMMARY 

INVARIANTS  AND  QUASI-INVARIANTS 

For  an  excellent  summary  of  results  in  invariants  and 
their  relations  to  computer  vision,  see  [Mundy  91a],  es¬ 
pecially  the  introduction  [Mundy  and  Zisserman  91b]. 

Colinear  Points 

For  the  line,  there  are  no  invariants  for  1,  2  or  3  co¬ 
linear  points.  For  4  colinear  points,  the  cross  ratio  is 
invariant.  For  more  than  4  points,  other  invariants  ex¬ 
ist. 

For  3  points,  the  colinear  ratio  is  quasi-invariant.  For 
more  than  3  points,  quasi-invariants  and  invariants  exist. 

Coplanar  Points 

For  coplanar  point  sets,  there  are  invariants  for  lines 
embedded  in  the  plane.  Colinearity  of  three  or  more 
points  is  invariant.  The  invariants  of  the  line  are  inher¬ 
ited  in  the  plane. 

For  coplanar  point  sets  that  are  not  colinear,  there  are 
no  invariants  for  1,  2,  3,  4  points.  There  are  2  invariants 
for  5  points.  Because  lines  and  points  are  dual,  there  are 
also  2  invariants  for  5  non-degenerate  lines.  For  more 
than  5  points  or  more  than  5  lines,  more  invariants  exist. 
The  invariants  assume  correspondence  of  points. 

With  one  fewer  point,  quasi-invariants  exist.  For  4 
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points  in  the  plane,  4  quasi-invariants  exist.  4  non¬ 
degenerate  lines  form  a  quadrilateral,  hence  4  quasi¬ 
invariants  exist  for  four  lines  in  the  plane. 

With  3  points,  the  angle  and  ratio  of  lengths  or  longi¬ 
tudinal  and  transverse  ratios  are  two  quasi-invariants.  3 
points  define  the  simplex,  i.e.  quasi-invariants  exist  on 
any  face  of  a  polyhedron.  3  non-degenerate  lines  form  a 
triangle;  for  3  non-degenerate  lines  there  are  two  quasi¬ 
invariants.  Just  as  with  invariants,  quasi-invariants  re¬ 
late  corresponding  points. 

For  5  or  more  points,  there  are  invariants  and  quasi- 
invariants. 


Plane  Curves 

Because  all  conics  are  equivalent  under  projection, 
there  are  no  invariants  for  conics.  For  two  conics  there 
is  an  invariant  involving  the  bi-tangents,  lines  tangent 
to  two  conics.  There  is  an  invariant  for  a  conic  and  two 
straight  lines. 

One  quasi-invariant  exists  for  a  single  conic.  E.g. 
for  an  ellipse,  the  ratio  of  minor/major  axis  is  quasi¬ 
invariant. 

Invariants  exist  for  algebraic  curves  of  third  and  higher 
order.  Classical  results  of  algebraic  curves  give  an  exten¬ 
sive  theory  of  invariants. 

3  space:  Single  View 

Invariants  exist  for  algebraic  surfaces.  They  appear  to 
be  sensitive  to  measurement  error.  Invariants  are  known 
not  to  exist  for  single  views  of  general  point  sets  in  3D. 

A  large  number  of  quasi-invariants  ate  known  for  3d 
surfaces.  The  utility  of  quasi-invariants  is  related  to  the 
generalized  cylinder  representation  of  object  parts.  [Sato 
and  Binford  92a] 

3  space:  Multiple  Views 


il~P\—Po\  p2=p2-P0',  93  =  P3-P0< 


li  =  l9il:  h  =  IftI;  la  =  l9a|; 


*1 


iL- 

h’ 


_  qi  X  ga 
“  hh  ’ 


es  X  Cl . 


cosa 


92-91. 

hh  ’ 


coafis 


93-91 

hla  ’ 


-  _  gi  X  fa 

Choose  the  view  frame  as  in  figure  2c  with  an  or¬ 
thonormal  triple  El,  E2,  Ea,  with  Ei  along  an  arbitrary 
axis  (“right”)  in  the  image  plane,  and  with  E3  along 
the  principal  axis,  the  normid  to  the  plane  of  projection, 
toward  the  center  of  projection.  Then  Ej  is  “down”  in 
the  image  plane  in  a  right-handed  frame. 

Consider  four  coplanar  points  initially  in  a  frame 
aligned  with  the  view  frame; 


e\=  E\-,  e\  =  E^-,  ej  =  £'3. 


IVansform  the  coplanar  points  into  the  four  points  in  a 
general  frame: 

pi;  ^2-^P2>^3-*P3- 


Rotate  through  Euler  angles,  and  translate  to 

pb-  Rotate  first  about  63  through  to  align  63  — >  Cj 
perpendicular  to  the  plane  of  £3  and  n.  Then  rotate  by 
$  about  the  transformed  e^,  i.e.  about  to  bring  the 
normal  of  the  view  frame  into  the  normd  to  the  plane 
of  the  four  coplanar  points:  C3  — »  n.  Rotate  by  V  about 
n. 

The  rotation  matrices  are  R^,  R4,  R^: 


cos^ 

sin<ft 

0 


—  sin  ^  0 
cos^  0  ’ 
0  1. 


7. 


While  no  invariants  exist  for  single  views  of  general 
point  sets  in  3D,  there  are  useful  invariants  for  multiple 
views  [Barrett  91). 

Quasi-invariants  for  stereo  were  found  for  angles  of 
edges  and  intervals  between  edges,  as  mentioned  above 
[Arnold  and  Binford  80). 


QUASI-INVARIANTS  FOR  COPLANAR  POINTS 


COB  9 

0 

—  sintf 

0 

1 

0 

Bin  9 

0 

COB  9  . 

COB\l> 

-  sin  V*  0 

sin  V* 

COS0  0 

The  combined  rotation  matrix  is 


7.2. 


For  three  points,  there  are  two  quasi-invariants.  For 
four  coplanar  points,  there  are  four  quasi-invariants. 
Figure  2b  shows  four  coplanar  points  in  a  general  posi¬ 
tion  relative  to  the  viewframe:  po, pi, pi, pa-  Those  four 
points  in  a  body-centered  frame  are:  Po,Pi,p2,p3- 


O' 

■  /a  cos  0  ■ 

Po  = 

0 

;  p*  = 

0 

1  Pa  — 

/}  sin  a 

.0. 

Lo. 

0  . 

'l^coaP' 
I3  «n  (3 
.  0  . 


7.1. 


R  —  R^RfR^ 

7.3. 

The  coordinates  of  the  points  po,pi,Pa,p3 

in  the  view 

frame  are: 

pi  =  Rtl>R$R4Pi  +  A 

7.4. 

Projected  coordinates  of  the  points  are: 

5  ^0, 

Pi  =  --p<: 

7.5 

where  Zo  is  the  image  distance,  Zi  is  the  distance  to 
point  pi  in  the  view  frame.  Define  image  difference 
vectors: 

Qi  =  Pt  —  Pq-,  Qj  =  P2  —  .^0;  Qa^  Pa  —  Po-  7.6 
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Quasi-Invariants  for  Four  Coplanar  Points 

Define  two  quasi-invariants  for  images  of  four 

coplanar  points  in  a  general  frame,  in  terms  of  image 
difference  vectors  Qi.Qs.Qa,  : 

_  -Q22Q31  +  Q21Q32 . 

-O22Q11  +  Q21Q12’ 

_  — Q32Q11  +  Q12Q31 .  __ 

— Q22Q11  +  Q21Q12  ’ 

where  is  the  jth  component  of  the  tth  image  differ¬ 
ence  vector  Qi.  There  are  two  additional  quasi-invariants 
for  three  coplanar  points.  For  now,  consider  Mi, M2  ■ 
They  are  strong  quasi-invariants. 

To  show  that  mi  ,  M2  quasi-invariant  under  projec¬ 
tion  at  z  =  00,0  =  0,  show  that  they  are  equal  to  the 
Euclidean  invariants  and  that  the  gradient  with  respect 
to  view  frame  parameters  vanishes: 

Ml  -*  MiiM2  M2',^  00,0  -*  0, 

dMi  _  0  =  ^ 

dx  dy  dz  dO  dV>  ’ 

z-+oo;fl— *0.  7.8. 

Substitute  variables  from  equations  7.6,  7.5,  7.4,  7.2 
and  7.1  into  equation  7.7.  Conditions  7.8  were  demon¬ 
strated  in  maple;  however,  only  the  condition  z  — »  00  is 
necessary;  the  condition  6  —*0  is  not  necessary. 

A  stronger  result  was  proved:  all  second  derivatives 
vanish  at  z  — »  00. 

^^=0; 

These  are  strong  quasi-invariants,  constant  to  third  or¬ 
der. 


Three  Coplanar  Points 

With  three  coplanar  points  with  the  notation  above, 
there  are  two  independent  quasi-invariants  at  z  — » 
00, 0  — »  0,  the  angle  between  the  two  difference  vectors 
and  ratio  of  lengths  of  difference  vectors: 

cos7  =  .4-4i-;  = 

\Q2\\Qi\  Qi  Qi 

Those  quasi-invariants  are  equivalent  to  the  ratios  of  lon¬ 
gitudinal  and  transverse  components  of  Q3  to  the  length 
ofQi. 

_  Q2  ■  Qi  _  Q2  X 

IQI?  ’  101? 

Both  7,  p  satisfy  the  condition  that  the  image  esti¬ 
mates  are  identical  to  the  Euclidean  invariant  at  z  — » 
oo,fl  — ►  0. 

7— ►7*;p— ►p*;  z— *00,0— ♦0, 

^—0  —  ^  —  ^  —  ^  —  ^  — 

dx  dy  dz  d<P  do  dil>  ’ 

z  — *  00;  0  — *  0. 


dp  _  _  dp  _  dp  _  dp  _  dp  _  dp 

dx  dy  dz  d^  dO  dil> ' 

z  — ►  oo;0  — ►  0. 

Pa  -*  Pa,  Pi  pI,  2  —  00,0  —  0, 

dx  dy  dz  d^  dO  di>’ 

z  — 00;  0  —  0.  7.9 

Figure  3  shows  the  two-dimensional  distribution  of 
7,  p  in  terms  of  fractions  of  the  view  sphere.  The  two 
quasi-invariants  are  not  strong  quasi-invariants,  i.e.  sec¬ 
ond  derivatives  are  non-vanishing.  They  are  nearly  in¬ 
dependent. 


The  Image  Colinear  Ratio  is  Quasi-Invariant 


The  cross  ratio  is  invariant  for  four  colinear  points. 
The  image  colinear  ratio  is  quasi-invariant  under  projec¬ 
tion  for  three  colinear  points  at  Z3  =  00, 0  =  0  on  the  fat 
sphere. 

Three  points  in  space  xo,xi,X3  project  to  three  image 
points  Xo,Xi,X3. 


■*o' 

-Po  = 

\Xo] 

Po  = 

0 

0 

= 

*0 

.20. 

,  z . 

.  z  . 

-A  = 

\Xi] 

Pi  = 

0 

0 

‘0 

.21. 

.  z . 

.  z  . 

■»2‘ 

A  = 

0 

-A  = 

0 

*0 

.22. 

.  z . 

.  z  . 

Represent  the  line  in  space  by  the  center  point  Zj,  two 
line  segments  /i,/2,  and  two  angles,  azimuth  and  0, 
inclination,  line  segment:  Z3,/i,/3,^,0  Choose  ei  and 
xi  along  the  line,  and  ij  and  X3  normal  to  the  line. 
Thus  ^  =  0. 


*i  =  Z3  - /i  CO8  0;  zi  =  Z3  — /i8in0 

*3  =  *2  +  h  CO*  0;  2s  =  Z3  -f-  {3  sin  0 
The  colinear  ratio  is  computed  in  3-space: 

h 


7.11 


r  = 


fi  + 13 


The  image  colinear  ratio  is  computed  in  the  image  by  the 
ratio  of  one  interval  in  the  image  to  the  sum  of  intervals. 


R  = 


(13Z3  -  Z323)232i 
(1321  -  ZlZ3)Z3Z3 


7.12 


R  = 


X3-X3  ^ 


i£A 

_£2_ 


X3  -  Xi  2s*-  Zi 
* 


From  equation  7.12,  calculate  the  ratio  using  equation 
7.11  for  Xi,Z3: 


R  = 


22  cos  0  —  13  sin  0  —  /i  sin  0  ^cos  0  —  ^  sin  0^ 


h  +  h 


23  cos  0  —  13  sin  0 


7.13 
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Now,  determine  that  the  colinear  ratio  is  quasi¬ 
invariant  by  computing  its  derivatives  at  zj  =  oo,0  =  0. 


dR  dR 


0. 


The  colinear  ratio  is  quasi-invariant  at  Z2  =  oo;  0  =  0  if 
derivatives  vanish.  It  is  an  afiine  invariant. 

Now  consider  the  worst  case,  i.e.  the  largest  values  of 
li/zj.  Human  and  aerial  limits  that  follow  are  extremes. 
For  most  objects,  li/z^  values  are  much  smaller. 

•  For  human  perception,  a  line  20  cm  at  30  cm  /] /z}  < 

.  1  subtends  a  large  visual  angle. 

•  For  aerial  photography  with  a  9”  by  9”  image  with 
a  focal  length  of  f=12”,  /,/za  <  4.5/12  =  .375.  At  6 
miles  altitude,  li  +I2  >  4.5  miles. 

Consider  c  contours  of  the  colinear  ratio: 


R  = 


h 

h  +  h 


(1-0. 


Again,  consider  za  =  0  for  simplicity. 


—  sinfl  =  <; 
Z2 


sintf  = 


e 

<2 


•  The  colinear  ratio  is  invariant  to  10%  over  the  hu¬ 
man  limit  li/za  <  .1. 

•  The  colinear  ratio  is  invariant  to  30%  over  almost 
the  full  range  of  interest  for  aerial  photography. 

The  colinear  ratio  for  3  colinear  points  is  a  strong 
quasi-invariant.  Figure  4  shows  c  contours  for  .1,.2,  .3. 
The  area  above  and  to  the  left  of  a  contour  is  the  range 
of  angles  and  distances  on  the  fat  sphere  for  which  the 
colinear  ratio  is  less  than  c. 


Variance  of  Colinear  Ratio 
Now  compute  the  variance  of  the  colinear  ratio  from 
the  true  value  /a/(/i  +  h)' 


0  0 

Assume  that  the  distribution  of  viewpoints  is  uniform 
on  ^  over  [0,q]  where  q  is  .1  for  human  vision;  q  is  .375 
for  aerial  photography.  Assume  that  the  distribution  is 
uniform  on  0 


V{R)  = 


•  For  the  human  limit,  f  =  .1 

V  =  ^.1*=  .000262. 


The  standard  deviation  is  o’  =  .016;  invariant  to  1.6%. 

•  For  aerial  photography,  q  =  .375 

V  =  :^.375*  =  .0517. 

The  standard  deviation  is  o  =  .117,  invariant  to  12%. 

These  values  are  very  tight,  nearly  invariant. 

Length  is  not  Quasi-Invariant 

Projected  length  is  not  a  quasi-invariant  under  per¬ 
spective  or  weak  perspective.  Projected  length  does 
equal  Euclidean  length  at  one  distance  but  the  gradient 
with  respect  to  view  parameters  does  not  vanish  there. 
At  z  =  00,  the  gradient  does  vanish,  but  the  value  of 
projected  length  does  not  converge  to  Euclidean  length. 
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Figure  2*:  Three  c(dine*r  points. 


Figure  2k  Four  coplnanr  points. 


Figure  2c:  Egocentric  coordinate  frame. 


ioiak  picbataity  rntflt  rmtie^ual  &m§k  md  rmtU) 

Figure  3:  2-d  distribution  of  angle  and  length  ratio  for  three  points. 


Im'tX  Imi 

Figure  4:  Epsilon  contours  of  colinear  ratio  quasi-invariant. 
All  values  above  and  left  of  epsilon  contour  lie  within  the  contour. 
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Abstract 

In  this  paper,  we  introduce  a  new  surface  representa¬ 
tion  for  recognizing  curved  objects.  Our  approach 
begins  by  representing  an  object  by  a  discrete  mesh  of 
points  built  firm  range  data  or  from  a  geometric  model 
of  the  object.  The  mesh  is  corrqruted  from  the  data  by 
deforming  a  standard  shaped  mesh,  for  example,  an 
ellipsoid,  until  it  fits  the  surface  of  the  object.  We  define 
local  regularity  constraints  that  the  mesh  must  satisfy. 
We  then  define  a  canonical  mapping  between  the  mesh 
describing  die  object  and  a  suiridard  spherical  mesh.  A 
surface  curvature  index  that  is  pose-invariant  is  stored  at 
every  node  of  the  mesh.  We  use  this  object  representation 
for  recognition  by  comparing  the  spherical  model  of  a 
reference  object  with  the  model  extracted  from  a  new 
observed  scene.  We  show  how  the  similarity  between  ref¬ 
erence  model  and  observed  data  can  be  evaluated  and 
we  show  how  the  pose  of  the  reference  object  in  die 
observed  scene  can  be  easily  computed  using  this  repre¬ 
sentation. 

1.0  Introduction 

In  this  pmr,  we  pn^Ktse  a  new  iqnesentatitKi  for  3- 
D  objects  wmch  is  suitaUe  for  object  recognition  and 
pose  d^erminatitxi.  Many  lepresentatioos  have  been 
proposed  to  address  this  recognition  {voblem.  Local 
j^ifmaches  attempt  to  represent  objects  as  sets  of  primi¬ 
tives  such  as  faces  or  edges.  Most  eariy  local  mmods 
handle  polyhedral  objwts  and  report  effective  and 
encouraging  results.  Representative  systems  include 
[12][19][1S].  Few  systems  can  handle  curved  surfaces. 
Examples  include  t^y  woik  in  which  primitive  surfaces 
enclosed  by  orientation  discontinuity  boundaries  are 
extracted  from  range  data  [20].  Other  systems  determine 
primitive  surfaces  which  satisfy  planar  or  (Quadric  equa¬ 
tions  [9].  Techniques  based  on  differential  geometry 
such  as  [3]  segment  range  images  using  Gaussian  curva¬ 
tures. 

The  global  methods  assume  one  particular  coordinate 
system  attached  to  an  objea  and  rejnesent  the  objea  as 
an  imididt  or  parametric  function  in  this  coordinate  sys- 
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tern.  The  resulting  iqiresentation  is  global  in  that  die 
implicit  function  represents  the  entire  sb^  of  die  obj^ 
or  of  a  large  portion  of  the  object  The  generalized  cylin¬ 
der  (GC)  is  a  representative  of  tins  group.  Aldmgh 
encouraging  results  have  been  obtained  in  recognizing 
GCs  in  intensity  images,  using  generalized  cylin^is  for 
recognition  is  t^cult  due  to  the  difficulty  of  extracting 
GC  parameters  from  irqiut  images. 

Superquadrics  (SQ)  reprerentation  also  belongs  to 
the  cl^  of  global  representations  [21].  The  SQs  nqne- 
sent  a  limited  set  of  shapes  whidi  can  be  extended  by 
adding  parameters  to  the  generic  implicit  eipi^on  of 
SQs.  A  possiUe  extension  is  to  segment  objects  imo  sets 
of  superquadrics  [10],  ahbougfa  computational  com¬ 
plexity  of  the  scene  a^ysis  may  become  prohibitive.  An 
interesting  attempt  to  handle  a  large  class  of  natural 
objects  in  discussed  in  [4], 

EGI  and  CEGI  mtqi  surfru«  orientation  distributions 
to  die  Gaussian  sphere  [13]  [14]  [17]  [16].  Since  the 
Gauss  mi^  is  indnendent  of  translation,  the  represena- 
tioo  is  quite  suitable  to  handle  convex  curved  objects.  In 
this  case,  recognition  proceeds  by  ffiiding  the  rotation 
that  maximizes  the  correlation  between  two  EGIs 
[14][7].  However,  when  part  of  the  object  is  occluded, 
there  techniques  cannot  rdiably  extract  the  representa¬ 
tion. 

Recently,  new  approaches  based  on  die  idea  of  fittirig 
a  bounded  algebraic  surffice  of  fixed  degree  to  a  set  of 
data  points  [22][23][11]  have  been  proposed.  Although 
encouraging  results  have  been  obtained  in  diis  area,  more 
research  is  needed  in  the  areas  of  bounding  constraints, 
convergence  of  surface  fittirig,  and  recognition  before 
dus  approach  becomes  practi^.  For  a  survey  of  odier 
techniques  that  can  be  u^  for  global  surface  fitting,  see 
15]. 

Another  class  of  approaches  attempts  to  match  sets  of 
points  directly  without  any  prior  surface  fitting.  An 
example  is  the  work  by  [2]  in  which  the  di^aice 
betwe«a  poirx  rets  is  computed  a^  minimized  to  find  die 
best  transformation  between  model  and  scene. 

Our  iq^roach  is  a  combination  of  the  point  ret 
matching  of  the  original  EGI  iqiproach.  As  in  the 
case  of  point  ret  matdung,  we  warn  to  avoid  fitting 
analytical  surfrices  to  r^presem  an  object  Instead,  we  use 
a  repierentation  that  simply  consists  of  a  cdlection  of 
points,  or  nodes,  arranged  in  a  mesh  covering  die  entire 
surface  of  the  object  This  has  the  advantage  that  the 
object  can  have  any  arbitrary  shqie,  as  long  as  that  shape 
is  topologically  equivalent  to  the  sphere.  To  avmd  prc^ 
lems  with  variable  density  of  nodes  on  die  mesh,  we  need 
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to  define  regularity  constraints  that  must  be  enforced.  We 
use  an  extension  of  the  deformable  surfaces  algorithms 
introduced  in  [8]  to  compute  the  meshes.  As  in  the  EGl 
algorithms,  each  node  of  the  mesh  is  mapped  onto  a  reg¬ 
ular  mesh  on  the  unit  s[^re,  and  a  measure  of  local  sur¬ 
face  curvature,  the  simplex  angle,  is  stored  at  the 
corresponding  node  on  the  sphere.  We  caU  the  corre¬ 
sponding  spherical  representation  the  Simplex  Angle 
Image  (SAI).  Finally,  we  define  the  regularity  constraints 
such  that  if  Af  is  the  mesh  representing  an  object,  and 
is  the  mesh  representing  the  same  object  after  transfor¬ 
mation  by  a  combination  of  rotation,  translation,  and 
scaling,  t^n  the  corresponding  distributions  of  simplex 
angles  (m  the  spherical  representations  S  and  5'  are  die 
same  to  a  rotation.  Therefore,  to  determine  whether 
two  objects  are  the  same,  we  need  only  compare  the  cor- 
respon^g  spherical  distributions. 

This  approach  spears  similar  to  the  EGI  approach. 
However,  one  fiindtunental  difference  is  that  a  unique 
mesh,  up  to  rotation,  translation  and  scale,  can  be  recon¬ 
structed  from  a  given  SAI.  In  the  case  of  the  EGI,  this 
property  is  true  only  for  convex  objects.  Another  funda¬ 
mental  difference  is  that  the  SAI  preserves  connectivity 
in  that  patches  that  are  connected  on  the  surface  of  the 
input  object  are  still  connected  in  the  spherical  represen¬ 
tation.  The  latter  is  the  main  reason  why  our  aj^roach 
can  applied  to  arbitrary  non-convex  objects.  Connectiv¬ 
ity  conservation  is  also  the  reason  why  the  SAI  can  be 
used  for  recognition  even  in  the  presence  of  significant 
occlusion,  as  we  will  see  later  in  ^  paper,  whereas  EGI 
and  other  global  representations  cannot. 

The  paper  is  organized  as  follows.  In  Section  2.0,  we 
describe  a  simple  representation  of  closed  2-D  curves 
which  we  extend  to  three  dimensional  surfaces  in  Section 
3.0.  In  Sections  3.0  to  5.0  we  describe  the  fundamentals 
of  the  SAI  algorithms  in  the  case  of  complete  object 
models.  In  Sectitm  4.0,  we  show  how  to  obtain  SAls 
from  range  data.  In  Section  5.0,  we  describe  the  SAI 
matching.  We  address  the  problem  of  occlusion  and  par¬ 
tial  models  in  Section  6.0. 

2.0  Representation  of  2-D  Curves 

A  standard  approach  to  representing  and  recognizing 
contours  is  to  sqrproximate  contours  by  polygons,  and  to 
compute  a  quantity  that  is  related  to  tte  curvature  of  the 
utxierlying  curve.  The  similarity  between  contours  can 
then  be  evaluated  by  comparing  the  distribution  of  cur¬ 
vature  measurement  at  the  vertices  of  the  polygons.  In 
this  section,  we  introduce  the  basic  concepts  that  can  be 
used  to  manipulate  polygonal  representations  of  con¬ 
tours.  The  concepts  discussed  in  this  section  are  well 
known  and  have  been  studied  extensively.  Our  purpose 
here  is  to  introduce  them  in  a  way  that  facilitates  their 
extension  to  three-dimensional  surfaces. 

In  order  to  quantify  the  curvature  of  a  contour,  we  use 
the  angle  <p  between  consecutive  segments.  Like  the  cur¬ 
vature,  the  angle  9  is  independent  of  rotation  and  trans¬ 
lation.  One  problem  is  that  if  the  lengths  of  the  segments 
representing  the  curve  are  allowed  to  vary,  the  value  of  9 
d^nds  not  only  on  the  shape  of  the  curve  but  also  on  the 
distribution  of  ^ints  on  the  curve.  One  way  to  enforce 
uniform  distribution  is  to  impose  a  local  regularity  con¬ 
dition  on  the  distribution  of  vertices.  The  local  regdarity 
condition  simply  states  that  all  the  segments  must  have 
the  same  length.  Another  geometric  definition  of  this 


condition  is  illustrated  in  Figure  1.  The  condition  that  the 
length  of  the  two  segments  PPj  and  PPi  are  the  same  is 
equivalent  to  the  condition  that  the  projection  of  node  P 
on  the  line  joining  its  two  neighbors  P,  and  P;  coincides 
with  the  center  of  P,  and  P^.  This  is  obviously  a  more 
complicated  way  of  formulating  the  simple  regularip' 
condition,  but  it  will  becortre  useful  when  we  extetxl  this 
notion  to  three  dimensions. 


Figure  1:  Local  Regularity 


The  last  step  in  representing  two-dimensional  con¬ 
tours  is  to  build  a  drcular  representation  that  can  be  used 
for  recognizing  contours.  Let  us  assume  that  the  contour 
is  divid^  into  N  segments  with  vertices  P^  ,.,Pw,  atKi  with 
corresponding  angles  9j,.,9jv.  Let  us  divide  the  unit  circle 
using  N  equally  spaced  vertices  Ci,.,C^.  Finally,  let  us 
store  the  angle  9,  associated  with  P,  at  the  corresponding 
circle  point  C  (Hgure  2).  The  circular  representation  of 
the  contour  is  invariant  by  rotation,  translation,  and  scal¬ 
ing.  As  the  density  of  points  increases,  the  circular  repre¬ 
sentations  of  two  contours  are  identical  up  to  a  rotation 
of  the  unit  circle.  This  property  aUows  for  comparing 
contours  by  declaring  that  two  contours  are  identical  if 
there  exists  a  rotation  of  the  unit  circle  that  brings  their 
representation  in  correspondence.  Also,  when  comparing 
contours,  the  distribution  of  the  vertices  C,  on  the  circle 
must  be  uniform.  We  refer  to  this  pn^rty  as  global  reg- 


Figure  2:  Mrqiping  from  Sh^  and  to  Unit  Circle 

3.0  Representation  of  3-D  Surfaces 

In  this  section  we  extetxl  to  three-Klimensional  sur¬ 
faces  the  concepts  of  curvature  measure  (Section  3.3), 
local  aixl  global  regularity  (Section  3.2  arxl  Section  3.1), 
arxl  circular  representation  (Section  3.4).  We  consider 
the  case  of  representing  surfaces  topologically  equiva- 
lem  to  the  sphere.  (Cases  in  whidi  only  part  of  the  sur¬ 
face  is  visible  will  «ldressed  in  Section  6.0.)  Detailed 
presentations  of  the  basic  results  on  semi-regular  tessel¬ 
lations,  triangularioris,  and  duality  can  found  in 
[18][24][25]. 

3.1  Global  Regularity  and  Mesh  Topology 

A  natural  discrete  representation  of  a  surface  is  a 
grsqrh  of  points,  or  mesh,  such  that  each  txxle  is  con¬ 
nected  to  ead)  of  its  closest  by  an  arc  of  the  gr^h.  It  is 
desirable  for  many  algorithms  to  have  a  constant  number 
of  nei^bors  at  each  node.  We  use  a  class  of  meshes  such 
that  each  trade  has  exactly  three  neighbors.  Such  meshes 
can  always  be  constructed  as  the  dual  of  a  triangulation. 

As  mentioned  in  the  previous  section,  global  regular¬ 
ity  can  be  easily  achieved  in  two  dimensions  since  a 
curve  can  always  be  divided  into  an  arbitrary  number  of 
segments  of  equal  length.  It  is  well  known  that  only 
approximate  global  regularity  can  be  achieved  in  three 
dunensions.  We  use  a  quasi-regular  triangulation  con- 
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stnicted  by  subdividing  each  triangular  face  of  a  20-face 
icosahedrm  into  smaller  triangles.  The  final  mesh  is 
built  by  taking  the  dual  of  the  20A^  faces  triangulation, 
yielding  a  mesh  with  the  same  number  of  nodes.  For  the 
experiments  presented  in  this  paper,  we  used  a  subdivi¬ 
sion  frequency  of  =  7  for  a  total  number  of  nodes  of 
980. 

3.2  Local  Regularity 

The  definition  of  local  regularity  in  three  dimensions 
is  a  straightforward  extension  of  the  definition  of  Section 
2.0.  Let  Pbe  a  node  of  the  mesh,  P,,  Pj,  P,  be  its  three 
neighbors,  G  be  the  centroid  of  the  tl^e  points,  and  Q  be 
the  projection  of /*  on  the  plane  defined  by  Pj,  Pj,  and  Pj 
(Figure  3).  The  local  regularity  condition  simply  states 
that  Q  coincides  with  G.  This  is  the  same  condition  as  in 
two  dimensions,  replacing  the  triangle  (Pj,  Pj,  P)  of  Fig¬ 
ure  1  by  the  tetrahedron  (P,,  Pj,  P„  P).  The  local  regular¬ 
ity  condition  is  invariant  by  rotation,  translation,  and 
scaling. 


G  center  of  (Pi,P2,P3) 


Figure  3:  Local  Regularity  in  Three  Dimensions 


3.3  Discrete  Curvature  Measure 

Before  defining  the  discrete  curvature  measure,  we 
need  to  define  some  notations  (Figure  4  (a)).  Let  P  be  a 
node  of  the  mesh,  Pj,  P^,  Pj  its  £iee  nei^bors,  O  the 
center  of  the  sphere  circumscribed  to  the  tetrahedron  (P, 
Pj,  Pj,  Pj),  Z  the  line  passing  through  O  and  through  the 
center  of  the  circle  circumscribed  to  (P„Pj,Pj).  Now,  let 
us  consider  the  cross  section  of  the  surface  by  the  plane 
n  containing  Z  and  P.  The  intersection  of  FI  with  the  tet¬ 
rahedron  is  a  triangle.  One  vertex  of  the  triangle  is  P,  and 
the  base  opposite  P  is  in  the  plane  (P„Pj,Pj)  (Figure  4 
(b)).  We  define  the  angle  9,  as  the  angle  between  the  two 
edges  of  the  triangle  intersecting  atP.  By  definition,  9^.  is 
the  discrete  curvature  measure  at  node  P.  We  caQ  9,  the 
simplex  angle  at  P,  since  it  is  the  extension  to  a  duee- 
dimensional  simplex,  the  tetrahedron,  of  the  notion  intro¬ 
duced  in  Figure  1  for  a  two-dimensional  simplex,  the  tri¬ 
angle. 


It  is  clear  that  the  simplex  angle  is  invariant  by  rota¬ 
tion,  translation,  and  scaling.  In  the  remainder  of  the 
paper,  the  simplex  angle  9„  at  a  node  P  will  be  denoted 
by  g(P).  It  is  important  to  note  that  other  definitions  of 
g(/0  are  possible  as  long  as  the  definition  guarantees  that 
g(P)  is  invariant  by  ri^d  transformations  and  by  scaling. 
We  selected  this  definition  because  it  is  easy  to  compute 
and  it  is  reasonably  stable  with  respect  to  small  variation 
of  the  surface. 


3.4  Simplex  Angle  Image 

We  have  extended  the  notions  of  regularity  and  sim¬ 
plex  angle  to  three-dimensional  surfaces;  we  can  now 
extend  the  circular  representation  developed  in  two 
dimensions  to  a  spherical  representation  in  three  dimen¬ 
sions.  Let  Afbe  a  mesh  of  points  on  a  surface  such  that  it 
has  the  topology  of  the  quasi-regular  mesh  of  Section 
3. 1.  Let  5  be  a  reference  mesh  with  the  same  number  of 
tKxles  on  the  unit  sphere.  We  can  establish  a  one-to-one 
mapping  h  between  the  nodes  of  !K{  and  the  nodes  of  5. 
The  moping  h  depends  only  on  the  topology  of  the  mesh 
aixi  the  number  of  nodes.  Specifically,  for  a  given  size  of 
the  mesh,  M  =  20xA^,  where  N  is  the  frequency  of  the 
mesh  (Section  3. 1),  we  can  define  a  canonical  numbering 
of  the  nodes  that  represents  the  topology  of  any  Af-mesh. 
In  other  words,  if  two  nodes  from  two  different  Af-mesb 
have  the  same  index,  so  do  their  neighbors.  >A^tb  this 
iixlexing  system,  h{P),  where  P  is  a  node  of  the  spherical 
mesh,  is  tte  node  of  the  object  mesh  that  has  the  same 
index  as  P. 

Given  h,  we  can  store  at  each  node  P  of  5the  simplex 
angle  of  the  corresponding  node  on  the  surface  g(h(P)). 
The  resulting  structure  is  a  quasi-regular  mesh  on  the 
unit  sphere,  each  node  being  associated  with  a  value  cor¬ 
responding  to  the  simplex  angle  of  a  point  on  the  original 
surface.  By  analogy  with  the  EGl,  we  call  this  represen¬ 
tation  the  Simplex  Angle  Image  (SAI).  In  the  remainder 
of  the  paper,  we  will  denote  by  g(P)  instead  of  gih(P)) 
the  simplex  angle  associated  with  the  object  mesh  node 
h(P)  since  there  is  no  ambiguity. 

If  the  original  mesh  iVf  satisfies  the  condition  of  local 
regularity,  then  the  corresponding  SAI  has  several 
important  properties.  First,  tte  SAI  is  invariant  by  trans¬ 
lation  and  scaling  of  the  original  object,  given  a  mesh  M. 
This  condifion  is  satisfied  because  the  simplex  angle 
itself  is  invariant  by  translation  and  scaling  (S^tion  3.4), 
and  because  iM'srill  satisfies  the  local  regularity  condition 
after  translation  and  scaling  (Section  3.2). 

Rom  this  definition  of  the  mapping  h,  we  can  easily 
see  the  ori^  the  property  of  connectivity  conservation 
mentioned  in  the  Introduction.  If  two  nodes  P,  and  Pj  are 
connected  on  the  spherical  mesh,  then  the  two  corre¬ 
sponding  rxxles  M,=h{P,)  aixl  M2=h{P2)  on  the  object 
mesh  are  also  connected  by  an  arc  of  the  object  mesh. 
The  property  bolds  because  of  die  definition  of  h  which 
depends  only  on  die  topology  of  die  mesh,  not  on  the 
positions  of  the  nodes. 

The  fundamental  property  of  the  SAI  is  that  it  repre¬ 
sents  an  object  unambi^ously  up  to  a  rotation.  Mote 
precisely,  if  Mani  iM'aie  two  meshes  on  the  same  object 
with  the  same  number  of  nodes  both  satisfying  the  local 
regularity  condidcHi,  then  the  corresponding  SAIs  S  and 
S’  are  identical  iq>  to  a  rotatitm  of  the  unit  sphere.  This 
property  bolds  even  in  the  case  of  arbitrary  ncm-convex 
objects  because  of  die  connectivity  conservation  prop¬ 
erty.  Stricdy  speaking,  this  is  true  ^y  as  the  num^r  of 
nodes  becomes  very  large  because  the  nodes  of  one 
sphere  do  not  necessarily  coincide  with  the  nodes  of  the 
rotation  version  of  the  other  sphere.  (This  problem  is 
addressed  in  Section  S.  1 .) 

4.0  Building  SAIs  from  3*D  Data 

In  the  previous  sections,  we  have  defined  the  notion 
of  locally  regular  mesh  and  its  associated  SAI.  In  this 
section,  we  describe  the  algorithm  for  computing  such  a 
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mesh  from  an  input  object.  We  assume  that  we  have 
some  input  description  of  an  object.  The  only  require¬ 
ment  is  that  the  input  description  allows  for  computing 
the  distance  between  an  arbitrary  point  in  space  and  the 
surface  of  the  object. 

The  general  approach  is  to  hist  define  an  initial  mesh 
near  the  object  (Section  4.2)  and  to  defoim  it  by  moving 
its  nodes  until  the  mesh  satisfies  two  conditions  (Section 
4.1):  It  must  be  close  to  the  input  object,  and  it  must  sat¬ 
isfy  the  local  regularity  condition.  Once  the  mesh  is  cre¬ 
ated,  the  simplex  angle  is  computed  at  every  node  and  is 
mapped  on  t^  unit  sphere  (Section  4.3). 

4.1  Mesh  Deformation 

The  formalism  of  deformable  surfaces  [8]  is  applied 
to  deform  the  mesh.  Specifically,  each  node  is  subject  to 
two  types  of  forces.  The  first  ty^  of  forces  brings  a  node 
closer  to  the  input  surface,  while  the  second  type  forces 
the  node  to  satisfy  the  normal  constraint.  Let  be  the 
force  of  the  first  type  tqiplicd  at  a  given  node  N,  aixl 
be  the  force  of  the  second  type  at  the  same  node.  If 
P„  and  P,.j  are  the  positions  of  node  P  at  three  consecu¬ 
tive  iterations,  the  update  rule  is  defined  as: 

This  expression  is  simply  the  discrete  version  of  the 
fundamental  equation  describing  a  mechanical  system 
subject  to  two  forces  and  to  a  dipping  coefficient  D.  In 
practice,  the  iterative  update  of  the  mesh  is  halted  when 
the  relative  displacements  of  the  nodes  from  one  iteration 
to  the  next  are  small. 

is  defined  by  calculating  the  point  P^  from  the 
original  surface  that  is  closest  to  the  node,  that 
is:F^  =  kPP^,  where  is  the  spring  constant  of  the 
force  which  must  be  between  0  and  i . 

The  curvature  force  F^  is  calculated  by  computing 
the  point  P^  that  is  on  the  line  normal  to  the  triangle 
formed  by  the  three  neighbors  of  P  and  containing  G 
(Figure  3),  and  such  that  the  mesh  curvature  at  P  and  P^ 

are  the  same:  —  ^(F).  Those  two  conditions 

ensure  that  P,  satisfies  the  local  regularity  condition 
while  keeping  the  original  mesh  curvature.  F^  is  defined 
as  a  spring  force  proportional  to  the  distance  between  P 

andP,:  =  aPP„ 

*  8  8 

4.2  Initialization 

For  the  iterative  mesh  update  to  converge,  the  mesh 
must  be  initialized  to  some  shape  that  is  close  to  the  ini¬ 
tial  shape.  We  use  two  different  aj^roaches  depending 
on  whether  the  input  data  is  measured  on  the  object  by  a 
sensor,  or  a  synthetic  CAD  model. 

In  the  case  of  sensor  data,  we  use  the  techniques  pre¬ 
sented  in  [8]  using  deformable  surfaces  to  build  a  trian¬ 
gulated  mesh  that  approximates  the  object.  Once  a 
triangulation  is  obtained,  the  mesh  is  initialized  by  tes- 
sellating  the  ellipsoid  of  inertia  of  the  input  shape. 
Although  the  ellipsoid  is  only  a  crude  approximation  of 
the  object,  it  is  close  enough  for  the  mesh  deformation 
process  to  converge. 

In  ihe  case  of  a  synthetic  CAD  model  as  input,  for 
example  a  polyhedron,  the  ellipsoid  of  inertia  is  com¬ 
puted  direcUy  from  the  synthetic  model.  A  regular  mesh 
is  mapped  on  the  ellipsoid  in  the  same  manner  as  in  the 
previous  case. 


Once  the  initial  ellipsoid  is  generated,  the  mesh  gen¬ 
eration  is  completely  independent  of  the  actual  format  of 
the  input  data.  The  only  data-dependent  operation  is  the 
computation  of  the  object  point  closest  to  a  given  node. 

43  From  Mesh  to  SAI 

Once  a  regular  mesh  is  created  from  the  input  data,  a 
reference  mesh  with  the  same  number  of  nodes  is  created 
on  the  unit  sphere.  The  value  of  the  angle  at  each  node  of 
the  mesh  is  stored  in  the  corresponding  node  of  the 
sphere.  The  SAI  building  algorithm  is  illustrated  in  Fig¬ 
ure  6  with  range  data  as  input  and  in  Figure  8  with  a  poly¬ 
hedral  model  as  input.  Figure  S  shows  three  views  of  a 
green  pepper  from  which  three  240x256  range  images 
were  taken  using  the  OGIS  range  finder.  The  images  are 
merged  and  an  initial  description  of  the  object  is  pro¬ 
duct  using  the  deformable  surface  algorithm.  Figi^  6 
(a)  and  Figure  6  (b)  show  the  initial  mesh  mapped  on  the 
ellipsoid  and  the  mesh  at  an  intermediate  stage.  Figure  6 

(c)  shows  the  final  regular  mesh  on  the  object.  Figure  6 

(d)  shows  the  corresponding  SAI.  The  meshes  are  dis¬ 
played  as  depth-cued  wirefi^es.The  SAI  is  displayed 
by  placing  each  node  of  the  sphere  at  r  -distance  from  the 
origin  that  is  proportional  to  the  angle  stored  at  that  node. 


Figure  5:  Three  Views  of  an  Object 


(a)  Initial  Ellipsoid  (•>)  Mesh  Aft^  10  Iterations 


'i  -  • ' :  v 


(c)  Final  Mesh  (d)  SAI 

Figure  6:  Building  SAI  from  Range  Data 
Figure  8  (a)  and  Figure  8  (b)  show  the  mesh  and  the 
SAI,  respectively,  in  the  case  of  an  objea  initially 
described  as  a  polyhedron  as  generated  by  the  VAN¬ 
TAGE  CAD  system  shown  in  Figure  7.  The  arrow 
between  Figure  8  (a)  and  Figure  8  (b)  shows  the  corre- 
.spondence  between  object  mesh  and  its  SAI.  The  vertical 
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crease  in  the  middle  of  the  SAI  conesponds  to  the  con¬ 
cave  region  between  the  two  flinders.  The  top  and  bot¬ 
tom  regions  of  the  SAI  exhibit  large  values  of  the  angle 
corresponding  to  the  transition  between  the  cylindrical 
and  planar  faces  at  both  extremities  of  the  object.  In  this 
example,  the  SAI  exhibits  some  noise  in  regions  that  are 
near  the  edges  between  faces  of  the  object.  In  practice, 
the  SAI  is  smoothed  before  being  used  for  recognition. 


(a)  Final  Mesh  (b)  SAI 

Figure  8:  Building  the  SAI  from  a  Polyhedral  Model 

5.0  Matching  Objects 

We  now  address  the  matching  problem:  Given  two 
SAIs,  determine  whether  they  coiiespood  to  the  same 
object  and  find  the  rigid  transfonnadon  between  the  two 
instances  of  the  object  As  discussed  in  Section  3.0,  the 
representations  of  a  single  object  in  two  different  poses 
are  related  by  a  rotation  of  the  underlying  sphere.  There¬ 
fore,  the  most  straightforward  s^roacb  is  to  compute  a 
distance  measure  between  the  SAIs  and  to  find  the  rota¬ 
tion  yielding  minimum  distaiKe.  The  full  3-D  transfor¬ 
mations  can  be  computed  based  on  this  rotatioit  This 
approach  is  e^qrensive  because  it  requires  the  testing  of 
the  entire  3-D  space  of  routions.  We  discuss  strategies  to 
reduce  the  seandi  space  in  Section  5.3.  Before  that,  in 
Sections  5.1  and  5.2,  we  discuss  the  distance  measure 
and  the  computation  of  the  final  rigid  transformation, 
respectively. 

5.1  Finding  the  Best  Rotation 

Let  S  and  J' be  the  spherical  representatioos  of  two 
objects.  Dermting  by  g(J^,  resp.  g’(F),  the  value  of  die 
simplex  angle  at  a  node  P  of  S,  resp.  Pot  S',  6  and  S'  are 
representations  of  the  same  object  if  there  exists  a  rota¬ 
tion  R  such  that  g\P)  =  g{PP)  for  every  point  P 
of  S'.  Since  the  SAI  is  discrete,  g{RP)  is  not  defined 
because  in  getreral  RP  will  fall  between  txxles  of  S'.  We 
define  a  discrete  iq)proximation  of  g(RP),  G{RP),  by 
interpolating  the  values  of  g  at  the  four  nodes  of  S’  near¬ 
est  to  RP. 

The  problem  now  is  to  find  this  rotation  using  the  dis¬ 
crete  representatioo  of  S  and  S'.  This  is  done  by  defining 
a  distance  D(S,  S R)  between  SAIs  as  the  sum  of  squared 


differences  between  the  simplex  angles  at  the  rxxies  of 
ot)e  of  the  sphere  and  at  the  nodes  of  the  rotated  ^bere; 

D(S,S',R)  =  '^(g'iPhGiRP))^ 

s 

The  minimum  of  D  cotrespotxls  to  the  best  rotation 
that  brings  S  and  S’  in  correspmidence. 

It  is  impmtant  to  note  that  die  rotation  is  nor  the  rota¬ 
tion  between  the  original  objects;  it  is  the  rotation  of  the 
spherical  representations.  An  additional  step  is  needed 
to  compute  the  actual  transformation  between  objects  as 
described  below. 

5.2  Computing  the  Full  TVansformation 

The  last  step  in  matching  objects  is  to  derive  the 
transformation  between  the  actual  objects,  given  the 
rotation  between  their  SAIs.  The  rotational  part  of  the 
transformation  is  denotc^d  by  R„  die  translational  part  by 
r„.  Given  a  SAI  rotation  R,  for  each  node  P’  of  S’  we 
compute  the  node  P  of  5  that  is  nearest  to  RP'.  Let  M, 
resp.  M',  be  the  point  on  the  object  cmrespooding  to  the 
no^  P  of  S,  resp.  P’.  A  first  estimate  of  the  transforma¬ 
tion  is  computed  by  minimising  the  sum  of  the  squared 
distances  between  the  points  ilf  of  the  first  objea  and  the 
corresponding  points  RJi4'+T,  of  the  second  object  The 
lesulfing  transformation  is  only  an  ^roximation 
because  it  assumes  that  the  nodes  fiom  the  two  meshes 
correspond  exactly.  We  use  an  additional  step  to  refine 
the  transformation  by  looking  for  the  node  M  closest  to 
M’  for  every  no^  of  the  and  by  computing  again 
die  optimal  pose. 

53  Reducing  the  Search  Space 

As  mentioned  in  Sectimi  5.1,  the  brute  force 
appioadi  to  finding  die  best  mesh  rotation  is  too  etqien- 
sive  to  be  practical.  However,  several  strategies  can  be 
used  to  make  it  more  efficwnt  The  first  strategy  is  to  use 
a  coarse-to-fine  mproacfa  to  locating  the  minimum  of  the 
function  D  of .  In  this  qtproach,  ^  space  of  possible 
rotations,  droned  by  tfarw  angles  of  rotation  about  the 
diiee  axis,  (9,6,^),  is  searched  using  large  angular  steps 
(Atp,  A6,  Aqr).  After  this  initial  coarse  search,  a  small 
number  of  locations  are  identified  around  which  the  min¬ 
imum  may  occur.  The  space  of  rotatioos  is  again 
searched  around  eadi  potential  minimum  found  at  the 
coaise  level  using  sm^r  angular  steps  (5q),  S6,  Sy). 
Typical  values  are  Aip=  A^  Ay^  10*.  More  levels  of 
search  may  be  mme  effidoit,  although  we  have  not  yet 
tried  to  determine  the  best  combination  of  coarse-to-^ 
levels. 

The  second  tqiproacb  is  to  use  a  priori  knowledge 
about  the  relative  poses  of  the  obje^  to  reduce  the 
search  space.  For  example,  the  rotation  defined  by  the 
axis  of  inertia  of  the  SAIs  can  be  used  as  a  starting  point 
for  the  search.  In  practice,  using  the  axis  of  inertia  is  very 
effective  in  pruning  the  seardi  ^ace  as  long  as  the  visi¬ 
ble  part  of  tte  objea  is  large  enough. 

5.4  Example 

Hgure  9  shows  three  views  of  the  same  objea  as  in 
Hgure  6  placed  in  a  different  orientatioa  A  model  is 
built  from  the  three  corresponding  range  images  using 
the  approach  described  in  4.3.  Hgure  12  illustrates  the 
difference  )f  pose  between  the  two  models  computed 
from  tfr  two  sets  of  images.  Hgure  10  shows  the  value 
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of  the  SAI  distance  measure.  The  distance  measure  is 
displayed  as  a  function  of  <p  and  6.  The  displayed  value 
at  ((p,0)  is  the  minimum  value  found  for  all  the  possible 
values  of  y.  This  display  shows  that  there  is  a  single 
sharp  minimum  corresponding  to  the  rotation  that  brings 
the  SAI  in  correspondence.  Figure  11  shows  one  of  the 
models  backprojected  in  the  image  of  the  other  using  the 
transformation  computed  from  the  SAI  correspondence 
using  the  algorithms  of  Section  5.2.  Figure  II  (a)  is  the 
original  image:  Figure  1 1  (b)  is  the  bacl^rojected  model. 
To  illustrate  the  accuracy  of  the  registration.  Figure  12 
(a)  shows  the  superimposiuon  of  the  cross  sections  of  the 
two  models  in  the  plane  before  matching.  Figure  12  (b) 
shows  the  same  cross  sections  after  matching.  These  dis¬ 
plays  show  that  the  transformation  is  correctly  computed 
in  that  the  average  distance  between  the  two  models  after 
transformation  is  on  the  order  of  the  accuracy  of  the 


Figure  9:  Three  Views  of  the  Object  of  Figure  6 


Figure  10:  DistatKe  Between  SAls  as  Function  of  ((p,6) 


Figure  11:  Display  of  the  Model  in  the  Computed  Pose 


(a)  Before  matching  (b)  After  matching 

Figure  12;  Relative  Positions  of  the  Two  Objects 


6.0  Partial  Views  and  Occlusion 

Up  to  now  we  have  assumed  that  we  have  a  complete 
model  of  the  obJiect,  as  in  Figure  8,  or  that  we  have  data 
coveting  the  entire  surface  of  the  object,  as  in  Figure  6. 
This  assumption  is  appropriate  for  building  reference 
models  of  objects.  During  the  recognition  phase,  how¬ 
ever,  only  a  portion  of  the  object  is  visible  in  the  scene. 
The  matching  algorithm  of  Section  5.0  must  be  modifted 
to  allow  for  partial  representatims.  The  algorithm  used 
for  extracting  the  initial  surface  model  is  able  to  distin¬ 
guish  between  regions  of  the  mesh  that  are  close  to  input 
surfaces  or  to  data  points,  and  parts  that  are  interpolated 
between  input  data.  The  first  type  of  region  is  the  visible 
part  of  the  mesh,  and  the  second  type  is  the  occluded  part 
of  the  mesh. 

The  simation  is  illustrated  in  Figure  14  in  the  case  of 
a  two-dimensional  contour.  In  Figure  14  (a)  a  contour  is 
approximated  by  a  mesh  of  eight  points.  The  mesh  is 
assumed  to  be  regular,  that  is  ^  the  points  of  the  mesh 
are  equidistant.  Let  L  =  8/  be  the  total  length  of  the  mesh. 
Figure  14  (b)  shows  the  same  contour  with  one  portion 
hidden.  The  occluded  portion  is  shown  as  a  shaded 
curve.  The  visible  section  is  approximated  by  a  regular 
mesh  of  eight  nodes  of  length  L]=Sl,.  Since  the  occluded 
part  is  interpolated  as  a  straight  litre,  the  length  of  this 
mesh  is  sm^er  than  the  total  length  of  the  mesh  on  the 
original  object:  L  >  L,.  Conversely,  the  length  of  the  part 
of  the  representation  corresponding  to  the  visible  part,  L; 
shown  in  Figure  14  (d),  is  greater  than  the  length  of  the 
same  section  of  the  curve  on  the  original  representation, 
L*  shown  in  Figure  14  (c).  In  order  to  compute  the  dis¬ 
tance  measure  of  Section  5.0,  the  SAI  of  the  observed 
curve  must  be  scaled  so  that  it  occupies  the  same  length 
on  the  unique  circle  as  in  the  reference  representation  of 
the  object.  If  L*  were  known,  the  scale  faaor  would  be  k 
In  reality,  L*  is  not  known  because  we  do  not  yet 
know  which  part  of  the  reference  curve  corresponds  to 
the  visible  part  of  the  observed  curve.  To  eliminate  I*, 
we  use  the  relation  LJL  =  L*/2n.  This  relation  simply 
expresses  the  fact  that  the  ratios  of  visible  atxl  total 
length  in  object  atxl  representation  spaces  are  the  same, 
which  is  always  true  when  the  mesh  is  regular.  L*  can  be 
eliminated  frmn  these  two  relations,  yielding  an  expres¬ 
sion  ofk  =  iKLjlLjL. 

The  situation  is  the  same  in  three  dimension  except 
that  lengths  are  replaced  by  areas  A,  A„  A^,  A*.  The  pre¬ 
vious  expression  becomes  k  =  47CAy/A^. 

The  direct  extension  firom  two  to  three  dimension  is 
only  an  approximation  because  the  relation  A, /A  =  A*/ 
471,  holds  only  if  the  area  per  node  is  constant  over  the 
entire  mesh.  In  practice,  however,  the  area  per  node  is 
nearly  constant  for  a  mesh  that  satisfies  the  lo^  regular¬ 
ity  condition. 


ta)  Object  (b)  Partial  View  (c)  SAI 

Figure  13;  Matching  Partial  Representation  in  2-D 

Once  k  is  computed,  the  appropriate  scaling  is 
^plied  to  the  SAI  by  moving  the  nc^es  on  the  surface  of 
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the  sphere  given  the  scaling  ratio  k.  After  scaling,  the  dis¬ 
tribution  of  tKxles  on  the  part  of  the  SAl  corresponding 
to  the  visible  part  of  the  object  in  the  scene  and  the  dis¬ 
tribution  in  the  corresponding  region  of  the  model  SAI 
are  identical.  The  key  in  this  algorithm  is  the  cormectiv- 
ity  conservation  property  of  the  SAI  mentioned  previ¬ 
ously. 

We  now  show  two  examples  of  recognition  in  the 
presence  of  occlusion.  In  the  first  example,  a  range  image 
of  an  isolated  objea  is  taken.  Then  a  complete  model  of 
the  object  is  matched  with  the  SAI  representation  from 
range  data.  Figure  15  shows  the  model  backprojeaed  in 
the  observed  image  using  the  computed  transformation. 
In  this  example,  the  reference  model  was  computed  by 
taking  three  registered  range  images  of  the  object  as  in 
the  example  of  Figure  6.  O^y  about  30%  of  the  object  is 
visible  in  the  image.  The  remaining  70%  of  the  represen¬ 
tation  built  from  the  image  is  interpolated  and  is  ignored 
in  the  estimation  of  the  SAI  distance.  Figure  17  di^lays 
the  gra{^  of  the  distance  between  SAIs  as  function  of 
rotation  angles.  Figure  16  shows  two  views  of  the  dis¬ 
tance  as  a  fuiKtion  of  <p  atKi  6.  Figure  17  shows  the  same 
function  displayed  in  space.  These  displays  demon¬ 
strate  that  there  is  a  well-defined  minimum  at  the  optimal 
rotation  of  the  SAIs. 

In  die  second  exanmle,  the  reference  model  is  the 
CAD  model  of  Figure  8.  The  result  of  the  matching  is 
shown  in  Figure  18  in  which  the  model  is  displayed  in  the 
orientation  computed  from  the  range  data  using  the  SAI 
matching.  Only  part  of  the  object  is  visible  in  the  image 
because  of  self  occlusion  and  because  of  occlusion  from 
other  objects  in  the  scene. 

In  both  examples,  the  deformable  surface  algorithm 
is  used  to  separate  the  olyect  firom  the  rest  of  the  scene 
and  to  build  an  initial  surface  model,  as  described  in  [8]. 
If  there  is  no  data  point  in  its  vicinity,  the  visible  portion 
of  the  object  mesh  tuid  the  corresponding  SAI  are  identi¬ 
fied  by  marking  a  node  of  the  mesh  as  interpolated.  To 
illustrate  the  effect  of  scaling.  Figure  19  (a)  shows  die 
SAI  computed  from  the  image  of  Figure  IS,  Figure  19  (b) 
shows  the  S^  after  the  scaluag  in  tilled  to  compensate 
for  occlusions.  The  density  of  points  incieases  in  the 
region  that  corresponds  to  the  visible  part  of  the  object 
(ideated  by  the  solid  anow).  Conversely  the  density  of 
pmints  decreases  in  the  region  corresponding  to  the 
occludedjiait  of  the  object  (indicated  by  the  shaded 
arrow).  iKse  examples  show  that  die  SAI  matching 
algorithm  can  deal  with  occlusions  and  partial  views, 
even  when  only  a  relatively  small  percentage  of  the  sur¬ 
face  is  visible. 


Figure  14:  Model  Registered  with  Input  Image 


Figure  16:  Distance  Between  as  Function  of  (p  and 


Figure  17:  Model  Registered  with  Irqwt  Image 
occluded  region  visible  region 


(aj  SAI  from  Scene  (b)  SAI  after  Scaling 
Figure  18:  Effect  of  Occlusion-Compensating  Scaling 

7.0  Conclusion 

In  this  pi^r,  we  introduced  a  new  iraroach  for 
building  and  recognizing  models  of  curved  objects.  The 
basic  representation  is  a  mesh  of  nodes  on  tte  surface 
that  satisfies  certain  regularity  constraints.  W;  intro¬ 
duced  the  notion  of  simplex  angle  as  a  curvature  indica¬ 
tor  stored  at  each  node  of  the  mesh.  We  showed  how  a 
mesh  can  be  mapped  into  a  spherical  representation  in 
canonical  maimer,  and  how  objects  can  be  recognized  by 
computing  the  distance  between  spherical  repiesema- 
tions. 
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The  SAl  representation  has  many  desirable  proper¬ 
ties  that  make  it  very  effective  as  a  tool  for  3-D  object 
recognition.  The  SAI  is  invariant  with  respect  to  transla¬ 
tion,  rotation,  and  scaling  of  the  object.  This  invariance 
allows  the  recognition  algorithm  to  compare  shapes 
through  the  computation  of  distances  between  SAls 
without  requiring  explicit  matching  between  object  fea¬ 
tures  or  explicit  computation  of  object  pose  .The  SAI  pre¬ 
serves  connectivity  between  parts  of  the  object  in  that 
nodes  that  are  neighbors  on  the  object  mesh  are  also 
neighbors  on  the  SAI.  Thus  the  SAI  can  handle  non-con- 
vex  objects,  partial  views,  and  occluded  objects  thanks  to 
the  property  of  connectivity  conservation  of  the  SAI. 

Results  show  that  the  SAI  representation  can  be  used 
to  determine  the  pose  of  an  object  in  a  range  image.  This 
approach  is  particularly  well  suited  for  :q)plications  deal¬ 
ing  with  natural  objects.  Typically,  conventional  object 
modeling  and  recognition  techniques  would  not  work 
due  to  the  variety  a^  complexity  of  naniral  shapes. 

Many  issues  remain  to  be  addressed.  First,  we  need 
to  improve  the  search  for  the  minimum  distance  between 
SAIs  during  the  recognition  phase.  This  improvement 
can  be  achieved  by  improving  the  coarse-to-fine 
approach  to  extrema  localization,  and  by  using  cues 
computed  from  the  original  data  to  restria  the  area  in 
which  the  extrema  are  searched.  Another  important 
extension  is  the  use  of  tqrpearance  information  such  as 
hue  or  albedo,  in  addition  to  geometric  information.  Such 
information  can  be  included  in  the  SAI  representation  by 
storing  appearance  dau  at  every  node  of  the  SAI  in  addi¬ 
tion  to  tte  curvature  measure.  Appearance  data  can  be 
incorporated  in  the  distance  measure  between  SAIs  to 
help  disambiguate  between  sh:q>es  that  are  similar  geo¬ 
metrically  but  have  different  q>pearance  features.  We  are 
now  pursuing  such  an  approach  using  color  images. 
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Abstract 

An  extension  to  the  alignment  approach  is  pro¬ 
posed  that  includes  a  pose  refinement  step  be¬ 
fore  verification.  In  the  alignment  approach  the 
pose  estimates  of  the  initial  hypotheses  tend  to 
be  somewhat  inaccurate,  since  they  are  based 
on  minimal  sets  of  corresponding  features.  A 
method  is  described  that  refines  the  pose  esti¬ 
mate  while  simultaneously  identifying  and  in¬ 
corporating  the  constraints  of  all  supporting 
image  features.  The  strategy  also  makes  prac¬ 
tical  initial  alignments  based  on  low  resolution 
features  -  which,  being  less  numerous,  allow 
faster  running  times. 

Two  statistical  formulations  of  model-based 
recognition  are  described:  MAP  Model  Match¬ 
ing,  and  Posterior  Marginal  Pose  Estimation 
(PMPE).  These  formulations  use  a  normal 
model  for  feature  fluctuations.  Empirical  ev¬ 
idence  is  provided  from  the  domain  of  video 
edge  features  indicating  that  normal  probabil¬ 
ity  densities  are  good  models  of  feature  fluctu¬ 
ations  -  better  than  uniform  densities  in  that 
domain.  The  evidence  is  provided  in  the  form 
of  observed  and  fitted  cumulative  distributions. 
The  results  of  some  statistical  tests  are  re¬ 
ported. 

The  Expectation  -  Maximization  (EM)  algo¬ 
rithm  is  discussed  as  a  method  of  carrying  out 
local  searches  in  pose  space  of  the  PMPE  ob¬ 
jective  function.  A  recognition  experiment  is 
described  where  the  method  is  used  with  fea- 
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tures  derived  from  synthetic  range  images. 

1  Introduction 

1.1  The  Statistical  Approach  and  Object 
Recognition 

In  this  paper,  visual  object  recognition  is  approached 
via  the  principles  of  maximum-likelihood  (ML)  and 
maximum-a-posteriori  probability  (MAP).  These  princi¬ 
ples,  along  with  specific  probabilistic  models  of  aspects 
of  object  recognition,  are  used  to  derive  objective  func¬ 
tions  fra  evaluating  and  refining  recognition  hypotheses. 
The  ML  and  MAP  criteria  have  a  long  history  of  suc¬ 
cessful  apphcation  in  formulating  decisions  and  in  mak¬ 
ing  estimates  from  observed  data.  They  have  attractive 
properties  of  optimaUty,  and  are  often  useful  when  mea¬ 
surement  errors  are  significant. 

In  other  areas  of  computer  vision,  statistics  has  proven 
useful  as  a  theoretical  framework.  The  work  of  Yuille, 
Geiger  and  Biilthoif  on  stereo  [l]  is  one  example,  while 
the  work  of  Geman  and  Geman  on  image  restoration  [2] 
is  another.  The  statisticaJ  approach  that  is  used  in  this 
paper  converts  the  recognition  problem  into  a  well  de¬ 
fined  (although  not  necessarily  easy)  optimization  prob¬ 
lem.  This  has  the  advantage  of  providing  an  explicit 
characterization  of  the  problem,  while  separating  the 
specification  of  the  problem  from  the  description  of  the 
algorithms  used  to  attack  it.  Ad-Hoc  objective  functions 
have  been  profitably  used  in  some  areas  of  computer  vi¬ 
sion.  Such  an  ^proach  is  used  by  Barnard  in  stereo 
matching  [3],  Blake  and  Zisserman  [4]  in  image  restora¬ 
tion  and  Beveridge,  Weiss  and  Riseman  [S]  in  line  seg¬ 
ment  based  model  matching.  With  this  approach,  plau¬ 
sible  forms  for  components  of  the  objective  function  are 
often  combined  using  trade-off  parameters.  Such  trade¬ 
off  parameters  are  determined  empirically.  An  advantage 
of  deriving  objective  functions  from  statistical  theories  is 
that  the  assumptions  become  explicit  in  that  the  forms 
of  the  objective  function  components  are  clearly  related 
to  specific  probabilistic  models.  A  second  advantage  is 
that  the  treule-off  parameters  in  the  objective  function 
can  be  derived  from  measurable  statistics  of  the  domain. 

1.2  Alignment 

The  basic  strategy  of  the  alignment  method  of  recogni¬ 
tion  [6]  is  to  use  separate  mechanisms  for  generating  and 
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testing  hypotheses. 

Recently,  indexing  methods  have  become  available  for 
efficiently  generating  hypotheses  in  recognition.  These 
methods  avoid  a  significant  amount  of  search  by  look¬ 
ing  up,  in  pre-computed  tables,  the  object  features  that 
might  correspond  to  a  group  of  image  features.  The 
geometric  hashing  method  of  Lamdan  and  Wolfson  [7] 
uses  invariant  properties  of  small  groups  of  features  un¬ 
der  affine  transformations  as  the  lookup  key.  Clemens 
and  Jacobs  [8]  describe  an  indexing  method  that  gains 
efficiency  by  using  a  feature  grouping  process  to  select 
small  sets  of  image  features  that  are  likely  to  belong  to 
the  same  object  in  the  scene. 

1.3  Align  -  Refine  —  Verify 

The  recognition  strategy  advocated  in  this  work  may  be 
summarized  as  “align-refine-verify.”  The  key  observa¬ 
tion  is  that  local  search  in  pose  space  may  be  used  to  re¬ 
fine  the  hypothesis  from  the  alignment  stage  before  ver¬ 
ification  is  carried  out.  With  the  alignment  method,  the 
pose  estimates  of  the  initial  hypotheses  tend  to  be  some¬ 
what  inaccurate,  since  they  are  based  on  minimal  sets 
of  corresponding  features.  Better  pose  estimates  (hence, 
better  verifications)  are  likely  to  result  from  using  ail 
supporting  image  feature  data,  rather  than  a  small  sub¬ 
set.  This  paper  describes  a  method  that  refines  the  pose 
estimate  while  simultaneously  identifying  and  incorpo¬ 
rating  the  constraints  of  all  supporting  image  features. 

The  strategy  also  makes  practical  initial  alignments 
based  on  lower  resolution  features.  Using  low  resolution 
features  in  indexing  is  attractive,  because  there  are  fewer 
features  at  low  resolution,  allowing  faster  running  times. 

1.4  Structure  of  the  Paper 

In  Sections  2  and  3  probabilistic  models  of  feature 
correspondences  and  feature  fluctuations  are  described,, 
while  Section  4  discusses  linear  models  of  feature  projec¬ 
tion.  Section  5  describes  a  statistical  method  of  simul¬ 
taneously  evaluating  correspondences  and  object  pose. 
Building  on  this.  Section  6  outlines  a  statistical  method 
of  object  pose  evaluation:  Posterior  Marginal  Pose  Esti¬ 
mation  (PMPE).  Section  7  describes  the  use  of  the  ex¬ 
pectation  -  maximization  (EM)  algorithm  for  solving  the 
PMPE  objective  function.  An  experiment  using  the  EM 
algorithm  with  features  derived  from  synthetic  range  im¬ 
ages  is  presented  in  Section  8,  and  related  work  is  dis¬ 
cussed  in  Section  9. 

2  Modeling  Feature  Correspondence 

Let  the  image  that  is  to  be  analyzed  be  represented  by 
a  set  of  v-dimensional  point  features 

Y  =  {YiY2...Y„}  ,  YeR''  . 

Image  features  are  discussed  in  more  detail  in  Sections 

3  and  4. 

The  object  to  be  recognized  is  also  described  by  a  set 
of  features,  these  are  represented  by  real  matrices: 

M  =  . 

Additional  details  on  object  features  appears  in  Section 
4. 


The  interpretation  of  the  features  in  an  image  is  rep¬ 
resented  by  the  variable  F,  which  describes  the  mapping 
from  image  features  to  object  features.  This  is  also  re¬ 
ferred  to  as  the  correspondences. 

r  =  {rir2...r„}  ,  r,  e  a/u{i} 

In  an  interpretation,  each  image  feature,  Vi,  will  be 
assigned  either  to  some  object  feature  Mj,  or  to  the 
background,  which  is  denoted  by  the  symbol  ±.  This 
symbol  plays  a  role  similar  to  that  of  the  null  character 
in  the  interpretation  trees  of  Crimson  and  Lozano-Perez 
(9].  Each  variable  F,  represents  the  assignment  of  the 
corresponding  image  feature  Yi ,  it  may  take  on  as  value 
any  of  the  object  features  Mj,  or  the  background,  ±. 

2.1  Independent  Correspondence  Model 

In  this  section  a  simple  probabilistic  model  of  correspon¬ 
dences  is  described.  The  intent  is  to  capture  some  infor¬ 
mation  bearing  on  correspondences  before  the  image  is 
compared  to  the  object.  This  model  has  been  designed 
to  be  a  reasonable  compromise  between  simplicity  and 
accuracy. 

In  this  model,  the  correspondence  status  of  differing 
image  features  are  assumed  to  be  independent,  so  that 

p(r)  =  n/'(r.)  (1) 

i 

There  is  evidence  against  using  statistical  indepen¬ 
dence  here,  for  example,  occlusion  is  locally  correlated. 
Independence  is  used  as  an  engineering  approximation 
that  simplifies  the  resulting  formulations  of  recognition. 
It  may  be  justified  by  the  good  performance  of  the  recog¬ 
nition  experiments  described  in  Section  8.  Few  recogni¬ 
tion  systems  have  used  non-independent  models  of  corre¬ 
spondence.  Breuel  described  one  approach  [lO].  In  [ll] 
the  independence  assumption  of  Equation  1  has  been 
relaxed  in  a  Markov  Random  Field  (MRF)  model  of  cor¬ 
respondences  that  is  meant  to  capture  the  correlated  as¬ 
pects  of  occlusion. 

The  component  probability  function  is  designed  to 
characterize  the  amount  of  clutter  in  the  image,  but  to 
be  otherwise  as  non-committal  as  possible. 

iKr.)={Ls  (2) 

The  joint  model  p(F)  is  the  maximum  entropy  prob¬ 
ability  function  that  is  consistent  with  the  constraint 
that  the  probability  of  an  image  feature  belonging  to 
the  background  is  B.  B  may  be  estimated  by  taking 
simple  statistics  on  images  from  the  domain. 

The  independent  correspondence  model  is  used  in  the 
experiments  described  in  this  paper. 

3  Modeling  Image  Features 

3.1  Uniform  Model  for  Background  Features 

Image  features  belonging  to  the  background  are  as¬ 
sumed  to  be  uniformly  distributed  throughout  the  d- 
dimensional  coordinate  space  of  the  image. 
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(The  PDF  is  zero  outside  the  coordinate  space  of  the 
image.)  The  position  and  orientation,  or  pose  of  the 
object  is  described  by  0.  The  fV,  are  the  dimensions  of 
the  feature  coordinate  space. 

Equation  3  describes  the  maximum  entropy  PDF  con¬ 
sistent  with  the  constraint  that  the  coordinates  of  image 
features  are  always  expected  to  lie  within  the  space  of 
the  image. 

3.2  Normal  Model  for  Matched  Features 
Image  features  that  are  matched  to  object  features  are 
assumed  to  be  normally  distributed  about  their  pre¬ 
dicted  position  in  the  image, 

piXi  I  r,  y?)  =  G^. .  (Yi  -ViMj ,  /?))  if  r.  =  Mj  .  (4) 

is  the  v-dimensional  Gaussian  probability  den¬ 
sity  function  with  covariance  matrix  G^,^(r)  = 

(2x)“ i IV'i;  I” ^  The  covariance  matrix 

il^ij  is  discussed  more  fully  in  Section  3.3. 

When  Fi  =  Mj,  the  predicted  coordinates  of  image 
feature  Yj  are  given  by  V{Mj ,  0),  the  projection  of  object 
feature  j  into  the  image  with  object  pose  0.  Projection 
and  pose  are  discussed  in  more  detail  in  Section  4. 

3.2.1  Empirical  Evidence  for  Normal  Model  of 
Feature  Fluctuations 

This  section  describes  some  empirical  evidence  from 
the  domain  of  video  image  edge  features  indicating  that, 
in  that  domain,  normal  probability  densities  are  good 
models  of  feature  fluctuations,  and  that  they  are  bet¬ 
ter  than  uniform  probability  densities.  The  evidence  is 
provided  in  the  form  of  observed  and  fitted  cumulative 
distributions.  Additionally,  the  results  of  some  statisti¬ 
cal  tests  are  reported. 

The  data  that  is  analyzed  are  the  perpendicular  devi¬ 
ations  of  edge  features  derived  from  video  images.  The 
features  are  shown  in  Figure  1. 

The  model  features  are  from  mean  edge  images  (these 
are  averaged  with  respect  to  illumination),  and  the  edge 
operator  used  in  obtaining  the  image  features  is  ridges 
in  the  magnitude  of  the  image  gradient.  These  edge 
detection  methods  are  described  in  [ll].  The  smoothing 
standard  deviation  used  in  the  edge  detection  was  1.93 
pixels.  The  object  is  at  the  same  pose  in  the  model  and 
image  scenes,  the  variation  in  features  is  due  to  changes 
in  lighting,  the  presence  of  background  features  in  the 
“image”  scene  and  the  vagaries  of  edge  detection. 

For  the  analysis  in  this  section,  the  feature  data  con¬ 
sists  of  the  average  of  the  X  and  Y  coordinates  of  the 
pixels  from  edge  curve  fragments.  The  features  are  dis¬ 
played  as  circular  arc  fragments  for  clarity.  The  edge 
curves  were  broken  arbitrarily  into  10  pixel  fragments. 

Correspondences  from  image  features  to  model  fea¬ 
tures  were  established  by  a  neutral  subject  using  a 
mouse.  These  correspondences  are  indicated  by  heavy 
lines  in  Figure  2.  Perpendicular  and  parallel  deviations 
of  the  corresponding  features  were  calculated  with  re¬ 
spect  to  the  normal  vectors  to  edge  curves  at  the  image 
features. 

Figure  3  shows  the  cumulative  distribution  of  the  per¬ 
pendicular  deviations.  The  cumulative  distribution  of 


a  fitted  normal  density  is  plotted  in  the  left  figure  as 
heavy  dots  over  the  observed  distribution.  The  distribu¬ 
tion  was  fitted  to  the  data  using  the  maximum-likelihood 
method.  These  figures  show  good  agreement  between 
the  observed  distribution,  and  the  fitted  normal  distri¬ 
bution.  The  observed  cumulative  distribution  is  shown 
again  on  the  right  in  Figure  3,  this  time  with  the  cumula¬ 
tive  distribution  of  a  fitted  uniform  density  over-plotted 
in  heavy  dots.  As  before,  the  uniform  density  was  fit¬ 
ted  to  the  data  using  the  maximum-likelihood  method. 
This  figure  show  relatively  poor  agreement  between  the 
observed  and  fitted  distribution,  in  comparison  to  the 
normal  density.  Similar  results  obtained  for  the  parallel 
deviations,  and  for  a  similar  set  of  coarse  features  [ll]. 

Kolmogorov-Smirnov  tests  were  carried  out  [ll]  to 
evaluate  the  compatibility  of  the  data  with  the  fitted 
normal  and  uniform  distributions.  In  the  cases  of  fine 
perpendicular  and  parallel  deviations,  and  coarse  per¬ 
pendicular  deviations,  refutation  of  the  uniform  model 
is  strongly  indicated.  Strong  contradictions  of  the  fitted 
normal  models  are  not  indicated  in  any  of  the  cases. 

3.3  Oriented  Stationary  Statistics 

The  covariance  matrix  that  appears  in  the  model  of 
matched  image  features  in  Equation  4  is  allowed  to  de¬ 
pend  on  both  the  image  feature  and  the  object  feature 
involved  in  the  correspondence.  Indexing  on  i  allows  de¬ 
pendence  on  the  image  feature  detection  process,  while 
indexing  on  j  allows  dependence  on  the  identity  of  the 
model  feature.  This  is  useful  when  some  mode]  features 
are  know  to  be  noisier  than  others.  This  flexibility  is  car¬ 
ried  through  the  formalism  of  later  sections.  Although 
such  flexibility  can  be  useful,  substantial  simplification 
results  by  assuming  that  the  features’  statistics  are  sta¬ 
tionary  in  the  image.  In  its  strict  form  this  assumption 
may  be  too  limiting.  This  section  outlines  a  compromise 
approach,  oriented  stationary  statistics,  that  was  used 
in  the  experiments  described  in  Section  8. 

The  Oriented  Stationary  Statistics  method  involves 
attaching  a  coordinate  system  to  each  image  feature. 
The  coordinate  system  has  its  origin  at  the  point  lo¬ 
cation  of  the  feature,  and  is  oriented  with  respect  to  the 
direction  of  the  underlying  curve  at  the  feature  point. 
When  (stationary)  statistics  on  feature  deviations  are 
measured,  they  are  taken  relative  to  these  coordinate 
systems.  When  an  image  is  presented  for  recognition, 
the  constant  feature  covariance  is  specialized  by  rotating 
it  to  orient  it  with  respect  to  each  image  feature.  Ori¬ 
ented  Stationary  Statistics  is  described  in  more  detail  in 
[11]. 

4  Linear  Projection  Models 

Pose  determination  is  frequently  framed  as  an  optimiza¬ 
tion  problem.  The  pose  determination  problem  may  be 
significantly  simplified  if  the  feature  projection  model  is 
linear  in  the  parameters  of  the  transformation  (the  pose 
vector).  The  system  described  in  this  paper  uses  a  pro¬ 
jection  model  having  this  property,  this  enables  solving 
the  embedded  optimization  problem  using  least  squares 
Linear  projection  models  may  be  written  in  the  following 


841 


Figure  2:  Feature  Correspondences 


form: 

=  .  (5) 

The  pose  of  the  object  is  represented  by  a  column 
vector,  the  object  model  feature  by  A/j ,  a  matrix,  rji ,  the 
projection  of  the  model  feature  into  the  image  by  pose 
0,  is  a  column  vector. 

Several  linear  projection  models  were  described  in  [12], 
and  in  [ll]. 


4.1  2-D  Oriented-Range  Feature  Model 

A  2-D  projection  and  feature  model  that  incorporates 
local  information  about  the  coordinates,  normal,  and 
range  at  a  point  along  a  curve  of  range  discontinuity, 
is  defined  by 
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The  coordinates  of  model  point  i  are  pi*  and  p<y.  The 
coordinates  of  the  model  point  t,  projected  into  the  im¬ 
age  by  pose  /3,  are  p[,  and  p[y .  at  and  ap  are  a  vector 
who’s  direction  is  perpendicular  to  the  range  discontinu¬ 
ity  and  pointing  away  from  the  discontinuity,  while  the 


length  of  the  vector  is  the  inverse  of  the  range  at  the 
discontinuity.  The  counterparts  in  the  image  are  given 
by  c'it  and  cjy. 

This  transformation  is  equivalent  to  rotation  by  0, 
scaling  by  s,  and  translation  by  T,  where 


8  =  -t-  V* 


0  =  arctan 


The  aggregate  feature  translates,  rotates  and  scales 
correctly  when  used  with  imaging  models  where  the  ob¬ 
ject  features  scale  according  to  the  inverse  of  the  distance 
to  the  object.  This  holds  under  perspective  projection 
with  attached  range  data  when  the  object  is  small  com¬ 
pared  to  the  distance  to  the  object. 

This  projection  model  was  inspired  by  a  method  used 
by  Faugeras  and  Ayache  in  their  vision  system  HYPER 
[13],  and  it  is  used  in  the  experiments  described  in  Sec¬ 
tion  8. 


5  MAP  Model  Matching 

This  section  outlines  MAP  Model  Matching  '  ,  a  means 
of  evaluating  joint  hypotheses  of  match  and  pose  using 
the  MAP  criterion. 

Briefly,  probability  densities  of  image  features,  condi¬ 
tioned  on  the  parameters  of  match  and  pose  (“the  pa¬ 
rameters”  ),  are  combined  with  prior  probabilities  on  the 

'An  early  version  of  this  work  appeared  in  [l2]  and  [l4]. 
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Figure  3:  Observed  Cumulative  Distribution  with  Fitted  Normal  and  Uniform  Cumulative  Distributions 


parameters  using  Bayes’  rule.  The  result  is  a  posterior 
probability  density  on  the  parameters,  given  an  obset  i 
image.  An  estimate  of  the  parameters  is  thei  'mutated 
by  choosing  them  so  as  to  maximize  their  ^steriori 
probability  (Hence  the  term  MAP).  MAP  estimators  are 
especially  practical  when  used  with  normal  probability 
densities. 

The  parameters  to  be  estimated  in  matching  are  the 
correspondences  between  image  and  object  features,  and 
the  pose  of  the  object  in  the  image. 

The  probabilistic  models  of  image  features,  condi¬ 
tioned  on  match  and  pose,  that  is  described  in  Equations 
3  and  4  may  be  written  as  follows; 

p/v.  1  r  =  /  w.ivj  IV.  =-‘- 

\  -  V{Mi ,  0))  if  Ti  =  Mi  • 


Here  is  the  covariance  matrix  associated  with  im¬ 
age  feature  t  and  object  model  feature  j.  Thus  image 
features  arising  from  the  background  are  uniformly  di^ 
tributed  over  the  space  of  the  image  (the  width  of  the  im¬ 
age  space  along  dimension  t  is  given  by  1^1)1  matched 
image  features  are  normally  distributed  about  their  pre¬ 
dicted  locations  in  the  image.  In  some  applications  V 
could  be  a  ccmstant  -  an  assumption  that  the  feature 
statistics  are  stationary  in  the  image,  or  ^  may  depend 
only  on  i,  the  image  feature  index.  This  is  the  case  when 
the  oriented  stationary  statistics  model  is  used  (see  Sec¬ 
tion  3.3). 

Assuming  independent  features,  the  joint  probability 
density  on  image  feature  coordinates  may  be  written  as 
follows 

v(Y\T,0)  =  (7) 

=  n— ^  « 

n  G^„{yi-V{Mi,0))  . 

HT.sMj 


Next,  a  joint  prior  on  correspondences  and  pose  is  con¬ 
structed.  Prior  information  on  the  pose  is  assumed  to  be 
supplied  as  a  normal  density, 

P(0)  =  G^^(0-0o)  =  (2x)^|V'/jr*  exp 


Here  is  the  covariance  matrix  of  the  pose  prior  and 
z  is  the  dimensionality  of  the  pose  vector,  0.  With  this 
choice  for  the  form  of  the  pose  prior  the  system  is  closed 
in  the  sense  that  the  resulting  pose  estimate  will  also 
be  normal.  This  is  convenient  for  coarse-fine.  If  little  is 
known  about  the  pose  a-priori,  the  prior  may  be  made 
quite  broad.  This  is  expected  to  be  often  the  case.  If 
nothing  is  known  about  the  pose  beforehand,  the  pose 
prior  may  be  left  out.  In  that  case  the  resulting  criterion 
for  evaluating:  hypotheses  will  be  based  on  maximum- 
likelihood  for  pose,  and  on  MAP  for  correspondences. 

Assuming  independence  of  the  correspondences  and 
the  pose  (before  the  image  is  compared  to  the  ob¬ 
ject  model),  and  using  the  independent  correspondence 
model  of  ^nations  1  and  2,  a  mixed  joint  prior  proba¬ 
bility  function  may  be  written  as  follows. 


v(T,0)=G^,(0-0^)  n  n 


i-fl. 


<;r,=i  ir.^x 

The  variable  B  has  been  genersJized  here  to  have  an  im¬ 
age  feature  dependent  subscript,  t.  In  the  experiments 
reported  in  Section  8,  B,-  =  B  for  all  i.  This  general¬ 
ization  is  explored  more  fully  in  [ll].  This  probability 
function  on  match  and  pose  is  now  used  with  Bayes’  rule 
as  a  prior  for  obtaining  the  posterior  probability  of  F  and 

p(r,^|y)^p(Il^^(Ml  ,  (9) 

where  p{Y)  is  the  probability  of  the  image  being  an¬ 
alyzed  -  a  constant  in  terms  of  the  parameters  being 
estimated. 
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The  MAP  strategy  is  used  to  obtain  estimates  of  the 
correspondences  and  pose  by  maximizing  their  posterior 
density  with  respect  to  F  and  0,  as  follows 

r,/?  =  argmMp(r,/9  1  Y) 

I  .P 

The  search  for  maximizing  joint  hypotheses  of  match 
and  pose  may  be  ordered  as  either  a  search  in  correspon¬ 
dence  space,  or  a  search  in  pose  space.  Recognition  ex¬ 
periments  in  correspondence  based  search  were  reported 
in  [12],  where  heuristic  beam  search  was  used.  In  [ll] 
it  is  shown  that  searching  a  specialization  of  this  MAP 
criterion  in  pose  space  is  equivalent  to  robust  chamfer 
matching. 

6  Posterior  Marginal  Pose  Estimation 

In  the  previous  section  the  object  recognition  problem 
was  posed  as  an  optimization  problem  resulting  from  a 
statistical  theory.  In  that  formulation,  a  hypothesis  was 
the  position  and  orientation,  or  pose,  of  the  object  -  as 
well  as  the  correspondences  between  object  and  image 
features. 

The  formulation  of  recognition  that  is  described  in  this 
section.  Posterior  Marginal  Pose  Estimation  (PMPE)  ^ 
,  builds  on  MAP  model  matching.  It  provides  a  smooth 
objective  function  for  evaluating  hypotheses  that  are 
simply  the  pose  of  the  object.  The  pose  is  the  most  im¬ 
portant  aspect  of  the  problem,  in  the  sense  that  knowing 
the  pose  enables  grasping  or  other  interaction  with  the 
object. 

PMPE  was  motivated  by  the  observation  that  in 
heuristic  searches  over  correspondences  with  the  objec¬ 
tive  function  of  MAP  model  matching,  hypotheses  hav¬ 
ing  implausible  matches  scored  poorly  in  the  objective 
function.  Additional  motivation  was  provided  by  the 
work  of  Yuille,  Geiger  and  Biilthoff  on  stereo  [l].  They 
discussed  computing  disparities  in  a  statistical  theory  of 
stereo  where  a  marginal  is  computed  over  matches. 

Here  we  use  the  same  strategy  as  MAP  Model  Match¬ 
ing  for  evaluating  hypotheses  now  consisting  only  of  pose 
-  they  are  evaluated  by  their  posterior  probability,  given 
an  image  :  p{0  |  Y).  The  posterior  probability  density 
of  the  pose  may  be  computed  from  the  joint  posterior 
probability  on  pose  and  match,  by  formally  taking  the 
marginal  over  possible  matches; 

p(0\y)  =  '£p{t,0\y)  . 
r 

In  Section  5,  Equation  9,  p(T,0  |  V)  was  obtained  via 
Bayes’  rule  from  probabilistic  models  of  image  features, 
correspondences,  and  the  pose.  Substituting  for  p{r,0  \ 
Y),  the  posterior  marginal  may  be  written  as 

— W) — 

Using  Equations  1  (the  independent  feature  model) 
and  7,  we  may  express  the  posterior  marginal  of  0  in 

^An  early  version  of  this  work  appeared  in  [is]. 


terms  of  the  component  densities; 

|r.,^)p(r.)]  . 

Ti  r**  • 

Breaking  one  factor  out  of  the  product  gives 

pi0\Y)  = 

P'  '  r,  r._,  i=i 

Ep(^"|rn,/?)p(r„)  . 
r. 

Continuing  in  similar  fashion  yields 

p(/?|y)  =  ^n  Ep(Y|r.,^)p(r,) 

This  may  be  written  as 

p{0\y)  =  ^Up^y.\0) .  (10) 

since 

p(Vi  I/?)  =  Ep(^‘ |i'-^)p(r*)  (11) 

r. 

Splitting  the  F,-  sum  into  its  cases, 
piYi\0)  =  p(V- |F<=l,;9)p(Fi=l) 

M, 

Substituting  the  densities  assumed  in  the  model  of  Sec¬ 
tion  5  in  Equations  6  and  2  yields 

M, 


Installing  this  into  Equation  10  leads  to 


p(/?  I  y)  = 


B\B^  B„  p{0) 

{W,W2  ■w.Y p{y) 


X 


•  Mj 

The  objective  function  for  posterior  marginal  pose  es¬ 
timation  may  be  defined  as  the  scaled  logarithm  of  the 
posterior  marginal  probability  of  the  pose,  as  follows, 


[e!^;  , 


where  A  is  a  constant  that  has  the  following  definition: 
_  BiBj  -Bn  I  \=^  1 

^  p(y) 


K  has  been  chosen  to  simplify  the  form  of  the  objective 
function.  This  leads  to  the  following  expression  for  the 
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objective  function  (use  of  a  normal  pose  prior  is  assumed, 
as  in  Equation  8) 

m  =  -  ^o) + x; 

i 

(13) 

This  objective  function  for  evaluating  pose  hypotheses 
is  a  smooth  function  of  the  pose.  Methods  of  continuous 
optimization  may  be  used  to  search  for  local  maxima, 
although  starting  values  are  an  issue. 

The  first  term  in  the  PMPE  objective  function  (Equar 
tion  13)  is  due  to  the  pose  prior.  It  is  a  quadratic  penalty 
for  deviations  from  the  nominal  pose.  The  second  term 
essentially  measures  the  degree  of  alignment  of  the  ob¬ 
ject  model  with  the  image.  It  is  a  sum  taken  over  im¬ 
age  features  of  a  smooth  non-linear  function  that  peaks 
up  positively  when  the  pose  brings  object  features  into 
alignment  with  the  image  feature.  The  logarithmic  term 
will  be  near  zero  if  there  are  no  model  features  close  to 
the  image  feature  in  question. 

When  the  MRF  correspondence  model  mentioned  in 
Section  2.1  is  used  in  PMPE,  a  simple  closed  form  ex¬ 
pression  for  the  estimator  no  longer  obtains.  However,  a 
tractable  dynamic  programming  method  for  evaluating 
the  resulting  PMPE  objective  function  at  a  specific  pose 
is  described  in  [ll]. 

7  Expectation  -  Maximization 
Algorithm 

The  Expectation  -  Maximization  (EM)  algorithm  was 
introduced  in  its  general  foriii  by  Dempster,  Rubin  and 
Laird  in  1978  [l6j.  It  is  often  useful  for  computing  es¬ 
timates  in  domains  having  two  sample  spaces,  where 
the  events  in  one  are  unions  over  events  in  the  other. 
This  situation  holds  among  the  sample  spaces  of  poste¬ 
rior  marginal  pose  estimation  (PMPE)  and  MAP  model 
matching.  In  the  original  paper,  the  wide  generality  of 
the  EM  algorithm  is  discussed,  along  with  several  previ¬ 
ous  appearances  in  special  cases,  and  convergence  results 
are  described. 

In  this  section,  a  specific  form  of  the  EM  algorithm 
is  described  for  use  with  PMPE.  It  is  used  for  hypoth¬ 
esis  refinement  in  the  recognition  experiments  that  are 
described  in  Section  8.  A  linear  model  is  assumed  for 
feature  projection. 

In  PMPE,  the  pose  of  an  object,  P,  is  estimated  by 
maximizing  its  posterior  probability,  given  an  image. 

4  =  argit^p()9  |y)  . 

P 

A  necessary  condition  for  the  maximum  is  that  the 
gradient  of  the  posterior  probability  with  respect  to  the 
pose  be  zero,  or  equivalently,  that  the  graulient  of  the 
logarithm  of  the  posterior  probability  be  zero: 

O  =  V0lnp{$\Y)  .  (14) 


Imposing  the  condition  of  Equation  14  on  the  posterior 
probability  of  the  pose  of  an  object,  given  an  image,  of 
Equation  10  yields  the  following, 

0  VgP(/g)  y^V/>P(V^I^) 

P(4)  ^  P(yi\0)  ' 

Using  the  linear  projection  model,  Equation  12  may 
be  re-written  as 


P(Yi\0)  = 


l-Bi 


WiW2  -W^ 


The  zero  gradient  condition  of  Equation  15  may  now  be 
expressed  as  follows, 

0  = 

m 

•  W:Wt'  W.  +  ^r*-  Ej  {Yi-Mj  0) 

Using  the  normal  pose  prior,  and  differentiating  the  nor¬ 
mal  densities  yields 

0  =  tl>0H0-0o)  + 

ys  ^  Ej  -  Mi0)Mjrl,-^{Yi  -  M^4) 

i  w:w^  W,  +  E;  {Yi-  Mj0) 


Finally,  the  zero  gradient  condition  may  be  expressed 
compactly  as  follows, 

O  =  i>;\0-0o)  +  J2WiiMTrl>-\Yi-Mj0)  ,  (16) 


with  the  following  definition: 

H',,  =  -5 - ^  .  (.7) 

i-bi  WiW^  W,  +  E;  G^.jiYi  -  Mj0) 

Equation  16  has  the  appearance  of  being  a  linear  equa¬ 
tion  for  the  pose  estimate  0  that  satisfies  the  zero  gra¬ 
dient  condition  for  being  a  maximum.  Unfortunately,  it 
is  not  a  linear  equation,  because  Wij  (the  “weights”)  are 
not  constants,  they  are  functions  of  0.  To  find  solutions 
to  Equation  16,  the  EM  algorithm  iterates  the  following 
two  steps: 

•  IVeating  the  weights,  Wij  as  constants,  solve  Equa¬ 
tion  16  as  a  Unear  equation  for  a  new  pose  estimate 
0.  This  is  referred  to  as  the  M  step. 

•  Using  the  most  recent  pose  estimate  0,  re-evaluate 
the  weights,  Wij,  according  to  Equation  17.  This  is 
referred  to  as  the  E  step. 

The  M  step  is  so  named  because,  in  the  exposition 
of  the  algorithm  in  [l6],  it  corresponds  to  a  mzocimum 
likelihood  estimate.  As  discussed  there,  the  algorithm 
is  also  amenable  to  use  in  MAP  formulations  (like  this 
one).  Here  the  M  step  corresponds  to  a  MAP  estimate  of 
the  pose,  given  that  the  current  estimate  of  the  weights 
is  correct. 
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The  E  step  is  so  named  because  calculating  the  Wij 
corresponds  to  taking  the  expectation  of  some  random 
variables,  given  the  image  data,  and  that  the  most  recent 
pose  estimate  is  correct.  These  random  variables  have 
value  1  if  the  i’th  image  feature  corresponds  to  the  j’th 
object  feature,  and  0  otherwise.  Thus,  after  the  itera¬ 
tion  converges,  the  weights  provide  a  continuous-valued 
estimate  of  the  correspondences,  that  varies  between  0 
and  1. 

It  seems  somewhat  ironic  that,  having  abandoned  the 
correspondences  as  being  part  of  the  hypothesis  in  the 
formulation  of  PMPE,  a  good  estimate  of  them  has  re¬ 
appeared  as  a  byproduct  of  a  method  for  search  in  pose 
space.  This  estimate,  the  posterior  expectation,  is  the 
minimum  variance  estimator. 

Being  essentially  a  local  method  of  non-linear  opti¬ 
mization,  the  EM  algorithm  needs  good  starting  values 
in  order  to  converge  to  the  right  local  maximum.  It  may 
be  started  on  either  step.  If  it  is  started  on  the  E  step, 
an  initied  pose  estimate  is  required.  When  started  on  the 
M  step,  an  initial  set  of  weights  is  needed. 

An  initial  set  of  weights  can  be  obtained  from  a  par- 
tiEd  hypothesis  of  correspondences  in  a  simple  manner. 
The  weights  associated  with  each  set  of  corresponding 
features  in  the  hypothesis  are  set  to  1,  the  rest  to  0. 
Indexing  methods  are  one  source  of  such  hypotheses. 

8  Range  Feature  Recognition 
Experiment 

In  this  section,  an  experiment  is  described  where  the  EM 
algorithm  is  used  in  a  coarse  -  fine  scheme  to  estimate 
the  pose  of  a  vehicle  appearing  in  synthetic  range  images, 
without  the  need  for  feature  correspondence  information. 
The  region  of  convergence  of  the  coarse  -  fine  algorithm 
is  explored.  The  object  has  four  degrees  of  freedom  - 
translation,  rotation,  and  scaling  in  the  plane.  Similar 
experiments,  with  full  freedom  of  motion  in  3D  are  de¬ 
scribed  in  [ll]. 

8.1  Preparation  of  Features 

The  preparation  of  the  features  used  in  the  experiment 
is  summarized  in  Figure  4.  The  features  were  oriented- 
range  features,  as  described  in  Section  4.1.  Two  sets  of 
features  were  prepared,  the  “model  features”,  and  the 
“image  features” . 

The  object  model  features  were  derived  from  a  syn¬ 
thetic  range  image  of  an  M35  truck  that  was  created 
using  the  ray  tracing  program  associated  with  the  BRL 
CAD  Package  [17].  The  ray  tracer  was  modified  to  pro¬ 
duce  range  images  instead  of  shaded  images.  The  syn¬ 
thetic  range  image  appears  in  the  first  image  of  Figure 
5. 

In  order  to  simulate  a  laser  radar,  the  synthetic  range 
image  described  above  was  corrupted  with  simulated 
laser  radar  sensor  noise,  using  a  sensor  noise  model 
that  is  described  by  Shapiro,  Reinhold,  and  Park  [18]. 
In  this  noise  model,  measured  ranges  are  either  valid 
or  anomalous.  Valid  measurements  are  normally  dis¬ 
tributed,  and  anomalous  measurements  are  uniformly 
distributed.  The  corrupted  range  image  appears  as  the 


second  image  in  Figure  5.  To  simulate  post  sensor  pro¬ 
cessing,  the  corrupted  image  was  “restored”  via  a  statis¬ 
tical  restoration  method  of  Menon  and  Wells  [19].  The 
restored  range  image  appears  as  the  third  image  of  Fig¬ 
ure  5. 

Oriented  range  features,  as  described  in  Section  4.1, 
were  extracted  from  the  synthetic  range  image,  for  use 
as  model  features  -  and  from  the  restored  range  im¬ 
age,  these  are  called  the  noisy  features.  The  features 
were  extracted  from  the  range  images  in  the  following 
manner.  Range  discontinuities  were  located  by  thresh¬ 
olding  neighboring  pixels,  yielding  range  discontinuity 
curves.  These  curves  were  then  segmented  into  approx¬ 
imately  20-pixel-long  segments  via  a  process  of  line  seg¬ 
ment  approximation.  The  line  segments  (each  represent¬ 
ing  a  fragment  of  a  range  discontinuity  curve)  were  then 
converted  into  oriented  range  features  in  the  following 
manner.  The  X  and  Y  coordinates  of  the  feature  were 
obtained  from  the  mean  of  the  pixel  coordinates.  The 
normal  vector  to  the  pixels  was  gotten  via  least-squares 
line  fitting.  The  range  to  the  feature  was  estimat  ^  by 
taking  the  mean  of  the  pixel  ranges  on  the  near  s  of 
the  discontinuity.  This  information  was  packaged  into  an 
oriented-range  feature,  as  described  in  Section  4.1.  The 
model  features  tire  shown  in  the  first  image  of  Figure  6. 
Each  line  segment  represents  one  oriented-range  feature, 
the  ticks  on  the  segments  indicate  the  near  side  of  the 
range  discontinuity.  There  are  113  such  object  features. 

The  noisy  features,  derived  from  the  restored  range 
image,  appear  in  the  second  image  of  Figure  6.  There 
are  62  noisy  features.  Some  features  have  been  lost 
due  to  the  corruption  and  restoration  of  the  range  im¬ 
age.  The  set  of  image  features  was  prepared  from  the 
noisy  features  by  randomly  deleting  half  of  the  features, 
transforming  the  survivors  according  to  a  test  pose,  and 
adding  sufficient  randomly  generated  features  so  that  ^ 
of  the  features  are  due  to  the  object.  The  248  image 
features  appear  in  the  third  image  of  Figure  6. 

8.2  Coarse-Fine  Method 

A  coarse-fine  search  method  was  used  for  finding  max¬ 
ima  of  the  pose-space  objective  function.  Two  levels  of 
smoothing  the  objective  function  were  used.  Peaks,  ini- 
tiedly  located  at  the  coarsest  scale,  are  used  as  start¬ 
ing  values  for  a  search  at  the  next  (less  smooth)  scale. 
Finally,  results  of  the  second  level  search  are  used  as 
the  initial  values  for  search  in  the  un-smoothed  objec¬ 
tive  function.  This  coarse-fine  method  combines  the  ac¬ 
curacy  of  the  un-smoothed  objective  function  with  the 
larger  region  of  convergence  of  the  smoothed  objective 
function. 

The  objective  function  was  smoothed  by  replacing  the 
stationary  feature  covariance  matrix  ip  in  the  following 
manner: 

Ip  ^tp  +  xp,  . 

The  effect  of  the  smoothing  matrix  ip,  is  to  increase 
the  spatial  scale  of  the  covariance  matrices  that  appear 
in  the  objective  function.  The  smoothing  matrices  used 
in  the  experiment  were  as  follows, 

D1AG((2.0)^  (2.0)^ (.01)^  (.01)^)  , 
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Figure  5:  Synthetic  Range  Image,  Noisy  Range  Image,  and  Restored  Range  Image 


Figure  6:  Model  Features,  Noisy  Features,  and  Image  Features 
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and 

DIAG((5.0)^  (5.0)^  (.025)^  (.025)^)  . 
where  D1AG(  )  constructs  diagonal  matrices  from  its  ar¬ 
guments.  These  smoothing  matrices  were  determined 
empirically. 

8.3  Results 

Figure  7  illustrates  the  approximate  region  of  conver¬ 
gence  of  the  coarse-fine  sectrch  with  the  EM  algorithm 
described  above,  when  used  with  the  image  and  model 
features  described  in  Section  8.1.  Eight  poses  are  dis¬ 
played  with  the  truck  shown  with  light  lines  -  the  true 
pose  of  the  truck  is  shown  for  reference  with  heavy  lines. 
These  eight  poses  are  displacements  from  the  true  pose 
in  both  directions  along  the  four  coordinate  axes  of  the 
pose  space.  They  represent,  approximately,  the  region 
of  convergence  of  the  coarse-fine  method  -  the  algorithm 
converges  from  these  poses,  but  not  from  much  farther 
away  from  the  true  pose. 

An  image  displaying  a  sequence  of  poses  from  an  EM 
iteration  at  the  coarsest  scale  appears  in  Figure  8.  The 
algorithm  converged  in  21  iterations. 

9  Related  Work 

Green  [20]  and  Shapiro  and  Green  [2l]  describe  a  theory 
of  maximum-likelihood  laser  radar  range  profiling.  The 
research  focuses  on  statistically  optimal  detectors  and 
recognizers.  The  single  pixel  statistics  are  described  by 
a  mixture  of  uniform  and  normal  components.  Range 
profiling  is  implemented  using  the  EM  algorithm.  Under 
some  circumstances,  least  squares  provides  an  adequate 
starting  value.  A  continuation-style  variant  is  described, 
where  a  range  accuracy  parameter  is  varied  betweei.  EM 
convergences  from  a  coarse  value  to  its  true  value. 

Cass  [22]  describes  an  approach  to  visual  object  recog¬ 
nition  that  searches  in  pose  space  for  maximal  align¬ 
ments  under  the  bounded-error  model.  The  pose-space 
objective  function  used  there  is  piecewise  constant,  and 
is  thus  not  amenable  to  continuous  search  methods.  The 
search  is  based  on  a  geometric  formulation  of  the  con¬ 
straints  on  feasible  transformations. 

There  are  some  connections  between  PMPE  and  stan¬ 
dard  methods  of  robust  pose  estimation,  like  that  de¬ 
scribed  by  Haralick  [23].  Both  can  provide  robust  es¬ 
timates  of  the  pose  of  an  object,  based  on  an  observed 
image.  The  main  difference  is  that  the  standard  meth¬ 
ods  require  specification  of  the  feature  correspondences, 
while  PMPE  does  not  -  by  considering  all  possible  cor¬ 
respondences.  PMPE  requires  a  starting  value  for  the 
pose  (as  do  standard  methods  of  robust  pose  estimation 
that  use  non-convex  objective  functions). 

As  mentioned  above,  Yuille,  Geiger  and  Biilthoff  [l] 
discussed  computing  disparities  in  a  statistic,  al  theory 
of  stereo  where  a  marginal  is  computed  over  matches. 
Yuille  extends  this  technique  [24]  to  other  domains  of 
vision  and  neural  networks,  among  them  winner-take-all 
networks,  stereo,  long-range  motion,  the  traveling  sales¬ 
man  problem,  deformable  template  matching,  learning, 
content  addressable  memories,  and  models  of  brain  de¬ 
velopment.  In  addition  to  computing  marginals  over  dis¬ 
crete  fields,  the  Gibbs  probability  distribution  is  used. 


This  f2icilitates  continuation-style  optimization  meth¬ 
ods  by  variation  of  the  temperature  par^uneter.  There 
are  some  similarities  between  this  approach  and  using 
coarse-fine  with  the  PMPE  objective  function. 

There  is  a  similarity  between  posterior  marginal 
pose  estimation  and  Hough  transform  (HT)  methods. 
Roughly  speaking,  HT  methods  evaluate  parameters  by 
accumulating  votes  in  a  discrete  parameter  space,  based 
on  observed  features.  (See  the  survey  paper  by  Illing¬ 
worth  and  Kittler  [25]  for  a  discussion  of  Hough  meth¬ 
ods.) 

In  a  recognition  application,  as  described  here,  the  HT 
method  would  evaluate  a  discrete  pose  by  counting  the 
number  of  feature  pairings  that  are  exactly  consistent 
somewhere  within  the  cell  of  pose  space.  As  stated,  the 
HT  method  has  difficulties  with  noisy  features.  This 
is  usually  addressed  by  counting  feature  pairings  that 
are  exactly  consistent  somewhere  nearby  the  cell  in  pose 
space. 

The  utility  of  the  HT  as  a  stand-alone  method  for 
recognition  in  the  presence  of  noise  is  a  topic  of  some 
controversy.  This  is  discussed  by  Grimson  in  [26],  pp. 
220.  Perhaps  this  is  due  to  an  unsuitable  noise  model 
implicit  in  the  Hough  Transform. 

PMPE  evaluates  a  pose  by  accumulating  the  logarithm 
of  posterior  marginal  probability  of  the  pose  over  image 
features. 

The  connection  between  HT  methods  and  parameter 
evaluation  via  the  logarithm  of  posterior  probability  has 
been  described  by  Stephens  [27].  Stephens  proposes  to 
call  the  posterior  probability  of  parameters  given  im¬ 
age  observations  “The  Probabilistic  Hough  Transform." 
He  provided  an  example  of  estimating  line  parameters 
where  image  feature  point  probability  densities  were  de¬ 
scribed  as  having  uniform  and  normal  components.  He 
also  states  that  the  method  has  been  used  to  track  3D 
objects,  referring  to  his  thesis  [28]  for  definition  of  the 
method  used. 

Lipson  [29]  describes  a  system  that  does  refinement  of 
alignments  under  Linear  Combination  of  Views.  The 
method  iterates  solving  linear  systems.  It  differs  by- 
matching  model  features  to  the  nearest  image  feature 
under  the  current  pose  hypothesis,  while  the  method  de¬ 
scribed  here  entertains  matches  to  all  of  the  image  fea¬ 
tures,  weighted  by  their  probability. 
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Scalable  Data  Parallel  Geometric  Hashing:  Experiments  on  MasPar 
MP-1  and  on  Connection  Machine  CM-5 

Ashfaq  Khokhar  and  Viktor  K.  Prasanna  * 

{Summary  of  Results) 


Abstract 

Geometric  hashing  has  been  recently  proposed  as  a  tech¬ 
nique  for  model  based  object  recognition  in  occluded 
scenes.  However,  this  technique  is  computationally  de¬ 
manding;  a  probe  of  the  recognition  phase  on  a  serial 
machine  can  take  several  minutes  to  complete.  In  this 
paper,  we  present  parallel  algorithms  and  several  fast 
parallel  implementations  on  MasPar  MP-1,  a  Single  In¬ 
struction  Multiple  Data  (SIMD)  machine,  and  on  the 
Connection  Machine  CM-5  operating  in  Single  Program 
Multiple  Data  (SPMD)  mode.  Based  on  the  results,  we 
compare  the  merits  of  the  above  classes  of  architectures 
for  vision.  In  earlier  parallel  implementations,  the  num¬ 
ber  of  processors  employed  is  independent  of  the  size  of 
the  scene  but  depends  on  the  size  of  the  model  database 
which  is  usually  very  large.  We  design  new  parallel 
algorithms  and  map  them  onto  MP-1  and  onto  CM-5. 
These  techniques  significantly  improve  upon  the  number 
of  processors  employed  while  at  the  same  time  achieve 
much  superior  time  performance.  Earlier  implementa¬ 
tions  claim  700  to  1300  msec  for  a  probe  of  the  recog¬ 
nition  phase,  assuming  200  feature  points  in  the  scene 
on  an  8K  processor  CM-2.  Our  implementations  run  on 
a  P  processor  machine,  such  that  1  <  P  <  5,  where  5 
is  the  number  of  feature  points  in  the  scene.  Our  re¬ 
sults  show  that  a  probe  of  the  recognition  phase  for  a 
scene  consisting  of  1024  feature  points  takes  less  than 
50  msec  on  a  IK  processor  MP-1  and  it  takes  less  than 
10  msec  on  a  256  processor  CM-5.  The  model  database 
used  in  these  implementations  contains  1024  models  and 
each  model  is  represented  by  16  feature  points.  The  im¬ 
plementations  developed  in  this  paper  require  number  of 
processors  independent  of  the  size  of  the  model  database 
and  are  scalable  with  the  machine  size.  Results  of  con¬ 
current  processing  of  multiple  probes  of  the  recognition 
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phase  are  also  reported. 

1  Introduction 

This  paper  is  in  continuation  of  our  efforts  in  studying 

ftarallel  techniques  for  solving  high  level  vision  problems 
10;  9;  11;  12;  16].  In  this  paper,  we  present  scalable 
parallel  techniques  to  implement  the  object  recognition 
problem  using  geometric  hashing.  Object  recognition,  a 
high  level  vision  task,  is  a  key  step  in  an  integrated  vi¬ 
sion  system.  Most  model  based  recognition  systems  work 
by  hypothesizing  matches  between  scene  features  and 
model  features,  predicting  new  matches,  and  verifying 
or  changing  the  hypotheses  through  a  search  process  [3; 
4;  5;  6;  19].  Geometric  hashing  [20]  offers  a  different  and 
more  parallelizable  paradigm.  However,  parallel  tech¬ 
niques  are  needed  to  use  geometric  hashing  in  real  time 
applications. 

In  geometric  hashing,  given  a  set  of  models  and  their 
features  points,  for  each  model,  all  possible  pairs  of  the 
feature  points  are  designated  as  a  basis  set.  The  coordi¬ 
nates  of  the  features  points  of  a  model  are  computed  rel¬ 
ative  to  each  member  of  its  basis  set.  These  coordinates 
are  then  used  as  indices  into  a  hash  table.  The  records 
in  the  hash  table  comprise  of  {model,  basis)  pairs.  In  the 
recognition  phase,  an  arbitrary  pair  of  feature  points  in 
the  scene  is  chosen  as  basis  and  coordinated  of  the  fea¬ 
ture  points  in  the  scene  are  computed.  The  new  coordi¬ 
nates  are  used  to  hash  into  the  hash  table  and  the  cor¬ 
responding  entries  of  the  hashed  bin  are  accessed.  The 
{model,  basis)  pair  winning  maximum  number  of  votes 
is  chosen  as  candidate  for  matching. 

There  have  been  two  prior  efforts  in  parallelizing  the 
geometric  hashing  algorithm  [l;  18].  Both  implementa¬ 
tions  have  been  performed  on  SIMD  hypercube  based 
machines.  These  implementations  are  among  the  early 
efforts  in  using  parallel  techniques  to  solve  high-level  vi¬ 
sion  problems.  One  of  the  major  problems  in  both  the 
implementations  is  the  requirement  of  large  number  of 
processors.  In  our  results,  we  exploit  the  fact  that  the 
number  of  votes  cast  in  an  iteration  of  the  recognition 
phase  is  bounded  by  5,  the  number  of  feature  points  in 
the  scene.  Therefore,  no  more  than  S  locations  of  the 
hash  table  are  accessed  during  the  execution  of  the  recog¬ 
nition  algorithm.  This  allows  us  to  reduce  the  number 
of  processors  employed  to  at  most  S,  the  number  of  fca- 
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ture  points  in  a  scene.  Previous  implementations  used 
O(Mn^)  processors,  i.e.,  the  number  of  bins  in  the  hash 
table,  where  M  is  the  number  of  models  in  the  database 
and  n  is  the  number  of  feature  points  in  each  model.  In 
addition,  the  implementation  by  Bourdon  and  Medioni 
[l]  suffers  due  to  inefficiency  of  the  routing  algorithm. 
This  inefficiency  limits  the  scope  of  their  implementation 
to  a  small  model  database.  In  [18],  Rigoutsos  and  Hum¬ 
mel  suggest  to  use  radix  sort  to  implement  histogram- 
ming,  a  technique  used  in  their  implementation  to  count 
the  votes  for  each  (model,  basis)  pair.  The  use  of  radix 
sort  in  histogramming  is  advantageous  only  if  the  num¬ 
ber  of  levels  in  the  histogram  is  much  less  than  the  num¬ 
ber  of  data  points  [15].  In  case  of  geometric  hashing,  this 
is  not  true.  In  this  paper,  we  present  several  fast  parallel 
techniques  to  implement  the  object  recognition  problem 
using  geometric  hashing  on  SIMD  and  SPMD  machines. 
We  also  provide  implementation  results  of  the  techniques 
developed  in  this  paper  on  the  Connection  Machine  CM- 
5  operating  in  SPMD  mode,  and  on  MasPar  MP-1,  an 
SIMD  machine. 

In  our  implementations,  we  provide  various  parti¬ 
tioning,  mapping  and  routing  techniques  to  address  the 
above  issues.  These  lead  to  significantly  less  num¬ 
ber  of  processors  to  be  used,  while  achieving  much  su¬ 
perior  time  performance.  Earlier  implementations  [l; 
18]  claim  700  to  1300  msec  for  one  probe  of  the  recog¬ 
nition  phase,  assuming  a  scene  of  200  feature  points,  on 
an  8K  processor  CM-2.  We  provide  techniques  to  im¬ 
plement  the  recognition  phase  on  a  P  processor  array, 
such  that  I  <  P  <  S,  where  S  is  the  number  of  feature 
points  in  the  scene.  Our  results  show  that  one  probe 
of  the  recognition  phase  for  a  scene  consisting  of  1024 
feature  points  takes  less  than  50  msec  on  a  IK  processor 
MP-1  and  it  takes  less  than  10  msec  on  a  256  processor 
CM-5.  The  model  database  used  in  the  implementations 
contains  1024  models  and  each  model  is  represented  us¬ 
ing  16  feature  points.  The  implementations  developed  in 
this  paper  require  number  of  processors  independent  of 
the  size  of  the  model  database  and  are  scalable  with  the 
machine  size.  Results  of  concurrent  processing  of  mul¬ 
tiple  probes  of  the  recognition  phase  are  also  reported. 
Based  on  the  implementation  results,  we  compare  the 
merits  of  classes  of  machines  used  for  vision  algorithms. 

The  organization  of  the  paper  is  as  follows.  The 
geometric  hashing  technique  for  object  recognition  is 
outlined  in  Section  2.  Section  3  discusses  parallelism 
in  geometric  hashing.  I  ■  Section  4,  implementation 
details  are  shown  and  experimental  results  are  tab¬ 
ulated  and  compared.  Conclusions  are  presented  in 
Section  5.  This  paper  summarizes  our  work.  Re¬ 
lated  results  and  additional  details  can  be  found  in  [ll; 
7). 

2  Object  Recognition  Using  Geometric 
Hashing 

In  a  model-based  recognition  system,  a  set  of  objects  is 
given  and  the  task  is  to  find  instances  of  these  objects 
in  a  given  .scene.  The  objects  are  repre.sented  as  .sets  of 
geometric  features,  such  as  points  or  edges,  and  their  ge¬ 


ometric  relations  are  encoded  using  a  minimal  set  of  such 
features.  The  task  becomes  more  complex  if  the  objects 
overlap  in  the  scene  and/or  other  occluded  unfamiliar 
objects  exist  in  the  scene. 

Many  model  based  recognition  systems  are  based  on 
hypothesizing  matches  between  scene  features  and  model 
features,  predicting  new  matches,  and  verifying  or  chang¬ 
ing  the  hypotheses  through  a  search  process.  Geometric 
hashing,  introduced  by  Lamdan  and  Wolfson  [20],  of¬ 
fers  a  different  and  more  parallelizable  paradigm.  It  can 
be  used  to  recognize  flat  objects  under  weak  perspec¬ 
tive.  For  the  sake  of  completeness,  we  briefly  outline  the 
geometric  hashing  technique  in  Section  2.1.  Additional 
details  can  be  found  in  [20]. 

2.1  Geometric  Hashing  Algorithm 

The  algorithm  consists  of  two  procedures,  preprocessing 
and  recognition.  These  are  shown  in  Figures  1  and  2 
respectively. 

Preprocessing: 

The  preprocessing  procedure  is  executed  off-line 
and  only  once.  In  this  procedure,  the  model  fea¬ 
tures  are  encoded  and  are  stored  in  a  hash  table 
data  structure.  However,  the  information  is  stored 
in  a  highly  redundant  multiple-viewpoint  way.  As¬ 
sume  each  model  in  the  database  has  n  feature 
points.  For  each  ordered  pair  of  feature  points 
in  the  model  chosen  as  basis,  the  coordinates  of 
all  other  points  in  the  model  are  computed  in  the 
orthogonal  coordinate  frame  defined  by  the  basis 
pair.  Each  such  coordinate  is  quantized  and  is  used 
as  an  entry  to  a  hash  table,  where  the  {model,  ba¬ 
sis)  pair,  at  which  the  coordinate  was  obtained, 
is  recorded.  The  complexity  of  this  preprocessing 
procedure  is  0{n^)  for  each  model,  hence  0{Afn^) 
for  Af  models. 

Recognition: 

In  the  recognition  procedure,  a  scene  consisting  of 
S  feature  points  is  given  as  input.  An  arbitrary 
ordered  pair  of  feature  points  in  the  scene  is  cho¬ 
sen.  Taking  this  pair  as  a  basis,  the  coordinates  of 
the  remaining  feature  points  are  computed.  Each 
such  coordinate  is  used  as  a  key  to  enter  the  hash 
table  (constructed  in  the  preprocessing  phase),  and 
for  every  recorded  ( model,  basis)  pair  at  the  corre¬ 
sponding  location,  a  vote  is  collected  for  that  pair. 
The  pair  winning  the  maximum  number  of  votes 
is  taken  as  a  matching  candidate.  The  execution 
of  the  recognition  phase  corresponding  to  one  ba¬ 
sis  pair  is  termed  as  a  probe.  Finally,  edges  of  the 
matching  candidate  model  are  verified  against  the 
scene  edges.  If  no  {model,  basis)  pair  scores  high 
enough,  another  basis  from  the  scene  feature  points 
is  chosen  and  probe  is  executed.  Therefore,  the 
worst  case  time  complexity  of  the  recognition  pro¬ 
cedure  is  0(5^).  However,  if  some  classification  fir 
choosing  a  basis  from  the  scene  is  available,  the 
complexity  can  be  reduced  to  0(5)  [l3]. 

The  time  taken  per  probe  depends  on  the  hash  func¬ 
tion  employed.  The  vision  community  has  experimented 
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Preprocessiiig( ) 

for  each  model  i  such  that  1  <  i  <  M  do 
Extract  n  feature  points  from  the  model; 
for  j  =  1  to  n 
for  Jb  =  1  to  n 

-  Compute  the  coordinates  of  all  other  features 
points  in  the  model  by  taking  this  pair  as  basis. 

-  Quantize  each  of  the  above  computed  coordinates 
and  use  it  as  a  key  to  enter  into  a  hash  table 
where  the  pair  {model,  basis),  i.e.,  {i,jk), 

is  recorded, 
next  k 
next  j 
next  i 
end 

Figure  1;  A  sequential  procedure  to  construct  a  hash 
table  consisting  of  ( model,  basis)  pairs. 


Recognition  0 

1.  Extract  S  feature  points  from  the  scene; 

2.  Selection: 

Select  a  pair  of  feature  points  as  basis 

3.  Probe: 

a.  Compute  the  coordinates  of  all  other  features 
points  in  the  scene  relative  to  the  selected  basis. 

b.  Quantize  each  of  the  above  computed  coordinates 
and  use  it  as  a  key  to  access  the  hash  table 
containing  the  entries  of  the  (model,  basis)  pairs. 

c.  Vote  for  the  entries  in  the  hash  table. 

d.  Select  the  (model,  basis)  pair  with  the  maximum 
votes  as  the  matched  model  in  the  scene. 

4.  Verification: 

Verify  the  candidate  model  edges  against  the 
scene  edges. 

5.  If  the  model  wins  the  verfication  process,  remove  the 
corresponding  feature  points  from  the  scene. 

6.  Repeat  steps  2,  3,  4,  and  5  until  some  specified 
condition. 

end 

Figure  2:  Outline  of  the  steps  in  sequential  recognition. 


with  various  hash  functions  and  hash  functions  distribut¬ 
ing  the  feature  points  uniformally  into  the  hash  table  are 
known  [18].  We  will  be  using  these  hash  functions  in  our 
implementations.  Assuming  that  5  feature  points  of  the 
input  scene  leads  to  0(S)  total  number  of  votes,  the  vot¬ 
ing  process  in  a  probe  of  the  recognition  phase  can  be  im¬ 
plemented  in  0(5  log  5)  time  using  sorting.  Other  parts 
of  the  computation  are  time  consuming,  even  though 
they  do  not  contribute  to  the  time  complexity.  Note 
that  the  total  number  of  (model,  basis)  pairs  is  O(Mn^). 
The  voting  time  can  be  reduced  to  0(S  +  Mn^)  by  em¬ 
ploying  O(Mn^)  boxes  to  collect  the  votes.  Through  out 
this  paper,  we  assume  5  <<  Mn^. 


3  Scalable  Data  Parallel  Geometric 
Hashing 

In  this  section,  we  present  parallel  techniques  to  imple¬ 
ment  the  recognition  phase  on  a  P  processor  machine. 
Algorithms  presented  in  this  section  are  implemented 
on  Connection  Machine  CM-5  operating  in  SPMD  mode 
and  on  MasPar  MP-1,  an  SIMD  array.  In  an  SIMD 
machine,  each  processor  executes  a  stream  of  instruc¬ 
tions  in  a  lock-step  mode  on  the  data  available  in  its 
local  memory.  The  instructions  are  broadcast  by  the 
control  unit.  The  SPMD  mode  of  execution  combines 
the  characteristics  of  SIMD  and  MIMD  modes.  In  this 
mode,  the  control  processor  broadcasts  a  section  of  the 
data  parallel  program  to  the  processing  nodes,  rather 
than  broadcasting  an  instruction  at  a  time  (as  in  a  typ¬ 
ical  SIMD  machine).  At  the  start  of  the  execution  of  a 
program,  the  complete  program  is  sent  to  all  the  nodes 
with  pseudo  synchronization  instructions  embedded  in 
the  code.  Each  node  executes  the  program  independent 
of  others  until  an  embedded  synchronization  instruction 
is  reached.  It  resumes  the  execution  of  the  program  only 
after  all  the  nodes  have  reached  the  synchronization  bar¬ 
rier.  The  control  unit  assists  in  enforcing  the  synchro¬ 
nization  barriers  embedded  in  the  program.  This  op¬ 
eration  mode  is  also  referred  to  as  synchronized  MIMD 
mode  [2]. 

We  will  not  elaborate  on  parallelizing  the  preprocess¬ 
ing  phase,  since  it  is  a  one  time  process  and  can  be  car¬ 
ried  out  off-line.  However,  details  of  that  procedure  can 
be  found  in  [7].  The  size  of  the  model  database  (the 
number  of  hash  bins)  is  O(Mn^). 

3.1  Partitioned  Implementation  of  the 
Recognition  Procedure 

We  use  P  processors  such  that  1  <  P  <  5,  where  5  is 
the  number  of  feature  points  in  a  scene.  Each  Processing 
Element  (PE)  in  the  array  is  assumed  to  have  0(^-) 
memory. 

The  input  is  a  scene  consisting  of  5  feature  points.  In 
the  recognition  phase,  possible  occurrence  of  the  models 
(stored  in  the  database)  in  the  scene  is  checked.  The 
models  are  available  in  a  bash  table  created  during  pre¬ 
processing.  All  the  models  are  allowed  to  go  under  rigid 
and  or  similarity  transformations.  An  arbitrary  ordered 
pair  of  feature  points  in  the  scene  is  chosen.  Taking 
this  pair  as  a  basis,  a  probe  of  the  model  data  base  is 
performed.  The  main  steps  of  a  parallel  algorithm  to 
process  a  single  probe  of  the  recognition  phase  are  given 
in  Figure  3. 

As  we  are  using  less  number  of  processors  than  the 
size  of  the  hash  table,  each  PE  will  have  several  hash 
table  bins  stored  in  its  local  memory.  Two  issues  arise 
during  the  execution  of  the  procedure  Para//e/_Pro6e(). 

1 .  More  than  one  feature  point  in  the  scene  may  cast 
their  votes  to  the  same  location  in  the  hash  table, 
resulting  in  a  contention  for  a  single  memory  loca¬ 
tion  in  a  PE  (see  Figure  4). 

2.  More  than  one  feature  point  in  the  scene  may  cast 
their  votes  to  different  bins  stored  in  a  PE,  resulting 
in  a  congestion  at  a  PE  (see  Figure  5). 
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Parallel-Probe(S,  P) 

/*  S  is  the  number  of  features  in  the  input  scene 
and  P  is  the  number  of  processors. 

Initially  each  PE  is  assumed  to  have  S/P  distinct 
scene  feature  points  stored  in  a  local  array  FP|].  */ 

-  Choose  an  arbitrary  pair  of  feature  points  in  the 
scene  as  a  basis  and  broadcast  it  to  all  the  PEs. 

-  Compute.KeysO 

-  Vote() 

-  Compute. Winner() 
end 

Figure  3:  A  parallel  procedure  to  process  a  probe  of  the 
recognition  phase. 


Figure  5:  Congestion  at  a  PE  while  accessing  different 
hash  bins  stored  in  a  PE. 
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over  the  network. 

Similarly,  in  order  to  address  the  processor  congestion 
problem,  we  propose  to  store  multiple  copies  of  the  hash 
table  in  the  array.  In  the  worst  case,  a  copy  of  the  hash 
table  can  be  assigned  to  each  sub-array  of  suitable  size. 
This  increases  the  size  of  the  memory  required  within 
each  processor.  Each  PE  restricts  its  search  for  the  hash 
table  bins  within  the  corresponding  subarray.  This  solu¬ 
tion  localizes  the  congestion  problem  to  the  sub  arrays. 


Figure  4:  Contention  for  a  single  hash  bin. 


The  worst  case  in  both  the  cases  will  be  0(S)  con¬ 
tention  and  congestion.  Such  a  scenario  can  lead  to  no 
speedup  at  all.  On  the  other  hand,  other  researchers 
have  used  large  number  of  processors  to  avoid  congestion 
and  contention  problems.  However,  in  their  implemen¬ 
tations  the  processor  utilization  is  extremely  low.  Also, 
such  solutions  result  in  enormous  communication  over¬ 
heads  in  performing  global  operations,  such  as  global 
max  and  histogramming,  as  evident  in  the  implemen¬ 
tations  proposed  in  [l;  18].  In  the  following,  we  ad¬ 
dress  these  issues  and  present  efficient  mapping  and  rout¬ 
ing  techniques  to  resolve  the  contention  and  congestion 
problems  arising  in  performing  a  probe,  while  using  a 
small  number  of  processors. 

In  order  to  eliminate  the  memory  contention  prob¬ 
lem  in  the  array,  we  introduce  a  Merge.Isey{)  procedure 
shown  in  Figure  6.  This  procedure  sorts  the  hash  table 
keys  corresponding  to  the  input  scene.  The  keys  having 
the  same  key  value  reside  in  a  block  of  PEs  and  in  each 
block  the  least  indexed  PE  holds  the  leader  key.  The 
leader  key  has  the  sum  of  the  number  of  elements  in  its 
block.  During  the  voting  process,  each  leader  key  ac¬ 
cesses  the  PE  holding  the  corresponding  location  of  the 
hash  table  and  casts  a  vote  on  behalf  of  all  the  keys  in  its 
block,  i.e.  if  there  are  m  elements  in  the  block,  m  votes 
will  be  registered  for  the  corresponding  location  in  the 
hash  table.  This  reduces  the  number  of  accesses  to  the 
hash  table  stored  in  a  PE  and  thus  reduces  the  traffic 


MergeOCeysflNPUT) 

-  Sort{lNPUT) 

In  parallel,  each  PE,,  0  <  •  <  P  —  1 
for  each  distinct  key  y,  0  <  y  <  S/P  —  1. 

Identify  the  leader  key  and  mark  it  in  the  array 
INPUT[j]. 

In  parallel,  each  PEi,  0  <  •  <  P  —  1 
for  each  leader  key, 

Count  the  number  of  keys  same  as  the  leader  key 
and  store  it  in  OUTPUTQ. 

end 

Figure  6:  A  parallel  procedure  to  merge  quantized  coor¬ 
dinates  of  feature  points  of  the  scene. 

In  the  following  analysis,  we  ignore  the  initialization 
costs,  such  as  loading  the  scene  points  to  the  processor 
array,  loading  hash  table  locations  to  the  processor  array, 
and  initialization  of  memory  locations  used  inside  each 
PE.  These  assumptions  are  also  made  in  the  previous 
implementations  reported  in  [l;  18]. 

For  asymptotic  time  analysis,  we  employ  a  model  of 
CM-5  operating  in  SIMD  mode  of  operation.  The  fat  tree 
[l4]  is  the  underlying  interconnection  network  of  CM-5. 
We  assume  an  SIMD  mesh  array,  shown  in  Figure  8,  for 
asymptotic  time  analysis  on  MP-1. 

Theorem  1  Given  a  fat  tree  architecture  consisting  of 
P  leaf  nodes,  one  probe  of  the  recognition  phase  can  be 
processed  in  O(plog5)  time  on  a  scene  consisting  of  S 
feature  points,  where  log*  P  <  log5.  □. 

The  restriction  on  P  can  be  releixed  if  sufficient  num¬ 
ber  of  copies  of  the  hash  table  is  available.  Based  on  the 
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Figure  8:  A  mesh  array  model. 


above  theorem,  the  algorithm  for  the  recognition  phase 
is  processor  time  optimal  and  scales  linearly  with  P  for 
1  <  P  <  5*/*. 

Theorem  2  Given  a  mesh  array  of  y/P  x  -^P  proces¬ 
sors,  such  that  P  =  S,  one  probe  of  the  recognition  phase 
can  ie  processed  in  0(-\/P)  time  on  a  scene  consisting  of 
S  feature  points.  □. 

Detailed  analysis  of  the  running  time  on  these  archi¬ 
tectures  can  be  found  in  [7]. 

4  Implementation  Details  and 
Experimental  Results 

In  this  section,  first,  we  describe  the  underlying  mod¬ 
els  of  the  machines  used  in  our  implementations  and 
then  present  data  mapping  and  partitioning  strategies 
employed  on  these  machines. 

4.1  The  Connection  Machine  CM-5 

A  Connection  Machine  Model  CM-5  system  contains  be¬ 
tween  32  and  16,348  processing  nodes.  Each  node  is  a 
32  MHz  SPARC  processor  with  upto  32  Mbytes  of  local 
memory.  A  64  bit  floating  point  vector  processing  unit 
is  optional  with  each  node.  Each  processing  node  is  a 
general  purpose  computer  that  can  fetch  and  interpret 
its  own  instruction  stream.  System  administration  tasks 
and  serial  user  tasks  are  performed  by  control  proces¬ 
sors.  Input  and  output  is  performed  via  high-bandwidth 
I/O  interfaces. 


4.2  The  MasPar  MP-1 

The  MP-1  is  a  massively  parallel  SIMD  computer  system 
with  upto  16K  Processing  Elements.  The  system  con¬ 
sists  of  a  high  performance  Unix  Workstation  as  Front 
End  (FE)  and  a  Data  Parallel  Unit  (DPU).  The  DPU 
consists  of  PEls,  each  with  upto  64  Kbytes  of  memory 
and  192  bytes  of  register  space.  All  PEs  execute  instruc¬ 
tions  broadcast  by  an  Array  Control  Unit  (ACU)  in  lock 
step.  PE^  have  indirect  addressing  capability  and  can 
be  selectively  disabled. 

The  machines  used  in  the  implementations  differ 
w.r.t.  mode  of  operation,  processing  speed  of  the  PEs, 
and  interprocessor  communication  bandwidth. 

In  CM-5,  a  coarse  grain  SPMD  machine,  each  pro¬ 
cessing  element  (PE)  is  a  powerful  SPARC  processor. 
The  high  computing  power  of  each  PE  and  relatively 
expensive  communication  among  processors  motivates 
to  partition  the  data  such  that  the  algorithms  exhibit 
less  communication  among  PEs  at  the  cost  of  redun¬ 
dant  computation  within  each  PE.  However,  a  balance 
between  these  two  is  needed  to  attain  speed-ups. 

In  MP-1,  a  fine  grain  massively  parallel  SIMD  ma¬ 
chine,  each  PE  is  a  4  bit  processor.  Interprocessor 
communication  is  supported  through  two  communica¬ 
tion  networks,  1)  Xnet  for  regular  communication  and 
2)  router  for  random  communication.  As  in  the  geo¬ 
metric  hashing  algorithm  the  communication  pattern  is 
irregular,  the  router  network  provides  superior  perfor¬ 
mance  over  the  Xnet  [l7].  It  is  also  experienced  that  the 
ratio  of  unit-floating-point-computation  time  over  unit- 
communication  time  (thru  Xnet  or  thru  contention  free 
router)  is  approximately  1  [8].  It  suggests  to  carefully 
partition  the  data  such  that  both  the  computation  and 
communication  capabilities  of  the  architecture  are  fully 
utilized. 

4.3  Partitioning  and  Mapping 

Three  procedures,  Compute.Keys{),  Vot^),  and  Com- 
pute.Winner{)  shown  in  Figures  9,  10,  and  11  respec¬ 
tively,  correspond  to  the  steps  described  in  the  Par- 
alleLProbe{)  procedure  (i.e.  Figure  3).  The  Com- 
pute.Keys()  procedure  computes  the  transformed  coor¬ 
dinates  of  the  scene  points  and  quantizes  them  according 
to  a  hash  function  /().  The  transformed  and  quantized 
coordinates  are  stored  in  NEWFPQ.  We  use  the  same 
hash  function  as  in[l8].  This  hash  function  distributes 
the  data  uniformly  over  all  the  hash  bins.  The  trans¬ 
formed  coordinates  are  then  used  as  keys  to  access  the 
data  in  the  hash  table.  The  Vt>le()  procedure  routes 
the  keys  to  their  corresponding  hash  locations  stored  in 
PEy(tey\.  The  function  ^0  defines  the  mapping  of  the 
hash  tanle  entries  onto  the  processor  array.  The  loca¬ 
tions  in  the  hash  table  accessed  during  voting  are  stored 
in  CANDIDQ  array.  This  array  is  used  in  computing  the 
final  winner.  The  size  of  this  array  is  much  smaller  than 
the  size  of  the  hash  table  stored  in  each  PE. 

Next,  the  Compute.WinneT{)  procedure  determines 
the  model-basis  pair  having  the  maximum  number  of 
votes.  The  winning  pair  is  then  sent  to  the  control  pro¬ 
cessor  to  perform  the  final  verification. 

Based  on  the  above  described  subtasks  of  the  Parat- 
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Coinpute_Key8(FP,  P) 

In  parallel  for  all  PE,,  0  <  i  <  P  —  1,  do 
for  r  =  0  to  S/P  —  1 

-  Compute  the  transformed  coordinate  of  the 
feature  point  FP[r]  relative  to  the  basis 
and  store  it  in  NEWFP[r]. 

-  Quantize  NEWFP[r]  using  hash  function  /(). 
next  r 

Parallel-end 

end 

Figure  9:  A  parallel  procedure  to  compute  the  coordi¬ 
nates  of  feature  points  of  th  e  scene. 


VotefFPNEW,  P) 

In  parallel,  each  PE,,  ii  <  i  <  P  —  1,  do 
k  :=  0; 

for  j  =  0  to  S/P  —  1 

-  Send  a  vote  (additive  write)  to  processor  and 
location  specified  Iqr  g(S EW F P\j\). 

-  If  a  vote  is  received,  copy  the  contents  of  the 
corresponding  hash-table  entry  in  the  array 
CANDID{k-|-+). 

next  j 
Parllel-end 
end 

Figure  10:  A  parallel  procedure  to  vote  for  the  possible 
presence  of  a  model  in  the  scene. 


ULProbei)  procedure,  several  data  mapping  and  parti¬ 
tioning  strategies  are  developed,  which  affect  the  overall 
execution  time  of  the  recognition  phase. 

In  the  following  we  present  four  algorithms,  which  we 
have  experimented  with.  These  algorithms  differ  with  re¬ 
spect  to  partitioning  and  mapping  of  the  hash  table  onto 
the  processor  array.  Various  strategies  are  employed  to 
take  into  account  practical  considerations,  such  as  avail¬ 
able  memory  in  each  PE,  processor  speed,  and  I/O  speed 
of  the  machines.  In  Algorithm  A,  and  Algorithm  B, 
we  assume  that  each  processor  is  assigned  distinct 
hash  table  locations.  In  Algorithm  C,  each  sub-array 
of  processors  is  assigned  a  complete  copy  of  the  hash 
table.  Each  processor  in  a  sub-array  of  size,  s^,  where 
1  <  s  <  v/P,  has  distinct  entries  of  the  hash  table. 

The  case  of  large  number  of  proces.sors  is  considered 
next.  Algorithm  D  performs  concurrent  processing  of 
multiple  probes  of  the  the  recognition  phase.  The  array 
is  divided  into  disjoint  sets  of  S  processors.  Each  set  of 
PEs  processes  a  probe  using  a  basis  (a  different  basis  for 
each  set). 

Algorithm  A: 

•  Execute  the  Comp.Keys()  procedure  serially  in  the 
control  processor  and  broadcast 

the  encoded  points  (keys)  to  each  processor  in  the 
processor  array. 

•  Each  proces.sor  scans  through  all  the  keys  and  ac¬ 
cumulates  votes  for  the  keys 


Compute-Winner  ( ) 

/*  Each  PE  is  assigned  distinct  model-basis 
pairs  to  compute  the  number  of  votes  cast  to  them.*/ 

In  parallel,  in  each  PE,,  0  <  i  <  P  —  1 
Send  every  element  of  the  CANOIDQ  array  to  the  PE 
assigned  for  computing  the  total  number  of  votes  for 
that  element. 

In  parallel,  in  each  PE,,  0  <  i  <  P  —  1 
Count  the  total  number  of  votes  for  each  distinct 
(model,  basis)  pair  received  and  store  it  in 
VCOUNT[modc/,basis]. 

In  parallel,  in  each  PE,,  0  <  i  <  P  —  1, 

Compute  the  local  maximum  of  VCOUNTQ  array  and 
store  it  in  locaLmax. 

Compute  the  maximum  of  locaLmax  over  the 
entire  processor  array. 

/*  The  (model,  basit)  pair  with  maximum  number 
of  votes  is  the  matched  model  in  the  scene.  */ 

end 

Figure  11:  A  parallel  procedure  to  compute  the  winning 
(model,  basis)  pair. 


which  correspond  to  hash  table  locations  stored  in 
its  local  memory. 

•  Executes  the  Compute.  Winner()  procedure. 

Algorithm  B: 

•  The  control  processor  broadcasts  S/P  scene  points 
to  each  processor 

along  with  a  basis  pair. 

•  Execute  the  Comp.Keys()  procedure 

•  Execute  the  Sort.Keys()  procedure. 

•  Execute  the  Voie()  procedure 

•  Execute  the  Compute.Winner{)  procedure. 
Algorithm  C: 

In  this  algorithm,  we  assume  multiple  copies  of  the  hash 
table  stored  in  the  processor  array. 

•  The  control  processor  broadcasts  S/P  scene  points 
to  each  processor 

along  with  a  basis  pair. 

•  Execute  the  Comp.KeysO  procedure 

•  Execute  the  Sort.Keifs()  procedure. 

•  Execute  the  V'p(e()  procedure  such  that  the  data 
search  is  bounded  with  in  its  sub-array 

•  Execute  the  Compute.  Wtnner()  procedure. 
Algorithm  D: 

In  this  algorithm,  we  assume  that  the  number  of  PE^ 
is  larger  than  the  number  of  feature  points  in  the  scene. 

•  The  control  processor  broadcasts  S/P  scene  points 
to  each  PE. 

•  The  control  processor  broadcasts  a  basis  pair  to  all 
PEs  in  each 

subarray  of  size  S. 
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•  Execute  the  Comp.Keys()  procedure. 

•  Execute  the  Sort.Keys()  procedure. 

•  Execute  the  Vott()  procedure 

•  Execute  the  Compute.Winneii)  procedure. 

4.4  Experiments  and  Summary  of  Performance 
Results 

We  have  used  a  synthesized  model  database,  contain¬ 
ing  1024  models,  each  model  consisting  of  16  randomly 
generated  points.  These  points  are  generated  according 
to  a  Gaussian  distribution  of  zero  mean  and  unit  stan¬ 
dard  deviation.  The  models  are  allowed  to  undergo  only 
rigid  transformation.  However,  results  from  other  trans¬ 
formations  do  not  affect  the  performance  of  the  parallel 
algorithm.  Similarly,  scene  points  are  synthesized  using 
normal  distribution.  We  apply  the  equalization  tech¬ 
niques,  given  in  [18],  to  the  transformed  coordinates, 
i.e.,  for  each  of  the  transformed  point  (u,v),  following 
hash  function  is  applied. 

/(«,i;)  =  (l  — c  ,  atan2(v,  u)) 

The  above  hash  function  uniformally  distributes  the 
data  over  the  hash  space  such  that  the  average  hash  bin 
length  is  constant.  We  assume  a  data  base  of  1024  mod¬ 
els  with  16  points  in  each  model.  This  gives  a  hash  table 
size  of  4M  entries.  Each  entry  may  consist  of  several 
(model,  basis)  pairs.  We  have  experimented  on  various 
data  granularities  in  the  hash  table  comprising  of  aver¬ 
age  bin  lengths  of  1,  4,  8,  16  and  32.  These  granularities 
can  be  chosen  according  to  the  local  memory  available 
within  each  PE.  We  have  executed  these  algorithms  on 
various  sizes  of  CM-5  and  MP-1,  both  in  terms  of  number 
of  PEs  in  the  array  and  local  memory  available  within 
each  PE.  Contrary  to  the  results  reported  in  [18],  we 
claim  that  for  a  single  probe  of  the  recognition  phase, 
machine  sizes  larger  than  S  would  deteriorate  the  time 
performance.  This  is  due  to  the  fact  that  interconnec¬ 
tion  networks  with  larger  diameter  takes  more  time  to 
perform  global  operations.  We  also  show,  in  algorithms 
D,  larger  size  machines  can  be  used  for  concurrent  pro¬ 
cessing  of  multiple  probes  of  the  recognition  phase. 

In  the  following,  we  tabulate  our  results  for  partition¬ 
ing  algorithms  A,  B,  C,  and  D.  Raw  timing  data  are 
included  in  the  Appendix  A.  Table  1  presents  execution 
times  of  various  subtasks  using  partitioning  algorithms 
described  in  the  previous  section.  The  Algorithm  A  ad¬ 
dresses  the  congestion  and  contention  problem  by  com¬ 
puting  the  keys  in  the  control  processor/array  control 
unit  at  the  cost  of  redundant  processing  in  each  proces¬ 
sor  in  the  array.  As  shown  in  Table  1,  for  Algorithm  A, 
the  computation  time  in  the  control  processor  becomes 
the  dominating  factor  in  the  overall  execution  time.  We 
did  not  execute  Algorithm  A  on  MP-1  because  of  the 
large  computation  time  and  insufficient  memory  avail¬ 
able  within  the  control  unit.  We  are  unable  to  execute 
Algorithm  B  on  various  size  of  MP-1  as  the  smallest  size 
MP-1  consists  of  IK  processors.  The  performance  of  Al¬ 
gorithms  A,  B,  and  C  on  various  sizes  of  CM-5  and  MP-1 
is  shown  in  Table  2. 


Larger  MP-ls  have  been  used  for  concurrent  process¬ 
ing  of  multiple  probes  (Algorithm  D)  and  results  are 
reported  in  Table  3.  Several  interesting  observations  on 
the  interplay  between  various  components  of  the  MP-1 
architecture  can  be  made  from  this  table.  As  the  number 
of  PEs  increases  with  the  number  of  concurrent  probes, 
it  affects  various  components  of  the  execution  time.  For 
example,  larger  size  machines  mean  larger  diameter  im¬ 
plying  more  time  for  global  operations.  On  the  other 
hand,  larger  machine  size  reduces  the  load  on  each  pro¬ 
cessor,  hence  less  time  is  spent  on  local  operations. 

Figures  12  and  14  show  the  performance  of  Algorithms 
B  and  C.  The  bin  access  time  and  voting  time  reduces 
linearly  as  the  number  of  copies  of  the  hash  table  in 
the  processor  array  increases.  The  hash  bin  access  time 
refers  to  the  time  taken  to  access  hash  bins  correspond¬ 
ing  to  feature  points  in  the  scene.  The  voting  time  cor¬ 
responds  to  routing  the  information  within  each  voted 
bin,  {model, basis)  pairs,  to  compute  local  maximum  for 
each  pair.  In  Figs.  13  and  15,  we  simulate  worst-case  and 
semiworst-case  scenario.  For  the  worst-case,  we  assume 
that  all  the  scene  points  hash  to  locations  stored  in  a 
single  PE  and  in  the  semiworst-case,  all  the  keys  hash 
to  locations  stored  in  a  small  subset  of  PE^.  The  results 
show  the  performance  of  various  partitioning  strategies 
adopted  in  algorithms  B  and  C.  In  the  case  of  hash  bin 
access,  the  access  time  decreases  linearly  with  the  in¬ 
crease  in  the  number  of  hash  table  copies  resident  in 
the  processor  array.  On  the  other  hand,  beyond  a  cer¬ 
tain  number  of  copies  of  the  hash  table,  the  voting  time 
starts  increasing  (see  Fig.  15).  This  is  due  to  the  in¬ 
creased  network  traffic  generated  by  larger  number  of 
copies. 


Figure  12:  Hash  bin  access  time  vs  Number  of  hash  table 
copies  for  Algorithm  B  and  Algorithm  C  on  a  512  PE 
CM-5. 

In  Table  4,  we  compare  our  results  with  those  reported 
in  [l;  18].  We  assume  no  hash  table  folding,  symmetries, 
and  or  partial  histogramming  on  the  hash  table  data. 
Our  serial  implementation  shows  that  one  probe  of  the 
recognition  phase  takes  about  13.4  seconds  on  a  SUN 
SPARC2  operating  at  25MHz  and  32  Mbytes  of  on  board 
RAM. 
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Figure  13:  Hash  bin  access  time  vs  Number  of  hash  ta¬ 
ble  copies  for  Algorithm  B  and  Algorithm  C  simulating 
worst-case  scenario  on  a  512  PE  CM-5. 


Figure  14:  Voting  time  vs  Number  of  hash  table  copies 
for  Algorithm  B  and  Algorithm  C  on  a  512  PE  CM-5. 


4.4.1  Performance  Comparison:  MP-1  vs 
CM-5 

During  the  implementation  of  geometric  hashing  algo¬ 
rithm,  we  experimented  with  various  aspects  of  the  MP-1 
and  CM-5.  Usually,  SIMD  machines  employ  fine  grained 
massive  parallelism  and  computationally  less  powerful 
processing  elements.  In  MP-1,  we  could  access  ma¬ 
chines  with  upto  16K  processors.  However,  each  pro¬ 
cessor  has  a  4-bit  ALU.  It  takes  2.51  msec  to  encode  a 
scene  point.  The  encoding  process  comprises  of  approx¬ 
imately  7  floating  point  operations  and  5  integer  arith¬ 
metic  operations.  On  the  other  hand,  SPMD  (synchro¬ 
nized  MIMD)  machines  employ  coarse  grain  parallelism 
with  powerful  proces.sors  as  processing  nodes.  It  takes 
0.0825  msec  to  encode  a  scene  point.  However,  com¬ 
munication  intensive  subtasks  perform  poorly  on  CM-5. 
For  example,  computing  maximum  of  data  elements,  one 
element  per  processor,  takes  0.32  msec  on  a  512  proces¬ 
sor  CM-5,  and  it  takes  0.055  msec  on  a  1024  processor 
MP-1.  If  communication  pattern  is  irregular,  such  as 


Figure  15:  Voting  time  vs  Number  of  hash  table  copies 
for  Algorithm  B  suid  Algorithm  C  simulating  worst-case 
scenario  on  a  512  PE  CM-5. 


Figure  16:  Total  execution  time  vs  CM-5  Machine  size 
for  various  algorithms. 


in  the  voting  process,  performance  of  SIMD  machines 
degrades  drastically.  Such  computation  and  commu¬ 
nication  characteristics  suggest  the  use  of  SIMD  and 
SPMD  machines  in  applications  with  varied  character¬ 
istics.  IVaditionally,  SIMD  machines  are  known  to  be 
well-suited  for  low  level  vision  operations.  However,  MP- 
1,  with  an  additional  router  network  motivates  the  use 
of  MP-1  for  applications  with  moderate  computational 
needs  and  regular  global  communication  patterns.  Sev¬ 
eral  heuristic  techniques  in  high  level  vision  fall  in  this 
category.  SPMD  machines,  such  as  CM-5,  are  suitable 
for  applications  with  high  computational  needs  and  mod¬ 
erate  global  communication  requirements.  In  addition, 
in  the  absence  of  efficient  data  partitioning  and  routing 
techniques,  the  performance  of  such  machines  degrades 
for  applications  with  local  neighborhood  communication 
requirements.  The  me.  lory  available  with  each  proces¬ 
sor  also  affects  the  usage  of  the  underlying  architecture. 
Traditionally,  due  to  limitations  of  VLSI  and  cost  con¬ 
siderations,  memory  available  within  each  processor  is 
relatively  less  in  SIMD  machines,  compared  with  SPMD 
machines.  Limited  memory  can  affect  the  performance 
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of  applications  which  require  storage  of  large  volume  of 
on-line  data. 

5  Conclusion 

We  have  presented  scalable  data  parallel  algorithms  for 
geometric  hashing.  Based  on  these  algorithms,  we  have 
obtained  fast  parallel  implementations.  The  implemen¬ 
tations  achieve  much  superior  time  performance  than 
those  known  in  the  literature.  These  implementations 
are  developed  after  carefully  studying  the  characteris¬ 
tics  of  the  underlying  architectures  of  CM-5  and  MP-1, 
i.e.  fai  tree  and  mesh  array,  respectively.  Various  exper¬ 
iments  were  conducted  to  fine  tune  the  partitioning  and 
the  mapping  strategies  to  suit  the  communication  and 
the  computation  capabilities  of  these  machines.  Based 
on  these  experiments,  data  parallel  algorithms  were  de¬ 
signed  to  efficiently  utilize  the  architectural  and  pro¬ 
gramming  features.  This  experimentation  has  assisted 
in  achieving  uniform  distribution  of  work  load  in  the  ma¬ 
chines  during  the  execution  of  algorithms  leading  to  fast 
and  scalable  implementations.  Our  early  work  in  using 
CM-5  and  MP-1  for  high  level  vision  is  very  encourag¬ 
ing  and  brings  a  promising  future  to  the  applications  of 
parallel  processing  techniques  in  realizing  real  time  inte¬ 
grated  vision  systems. 
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pArtitioning  MAchine  Encoding  HA«h  Bin  Voting  Computing  LocaI  Computing  Globnl  ToIaI 
Algorithm  Type  Scene  Point*  Acce««  MAximuro  of  Vole*  MAximum  of  Vote*  Time 


M-5 


M-5 


M-5 


MP-1 


IIKII 


11^1 


Table  1;  Execution  times  (in  msec)  of  various  subtasks  in  a  probe  using  different  partitioning  algorithms  for  a  scene 
consisting  of  1024  feature  points  on  a  256  CM-5  and  IK  MP-1. 


MAchine  Sixe/Type  |  Algorithm  A  (in  m«ec)  I  Algorithm  B  (in  meec)  I  Algorithm  C  (in  m*ec) 


25.48 


64/CM-5  f  61.04  19.75  16.02 


50.02  10.77  8.96 


41.45  6.68  5.74 


51.74  4.41 


XX  48.76  32772 


Table  2:  Execution  times  (in  msec)  of  various  algorithms  on  a  scene  consisting  of  1024  feature  points. 


Number  of  MAchine  Encoding  HA«h  Bin  Voting  Computing  LocaI  Computing  Globnl  Totnl 
Probe*  Sixe  Scene  Point*  Acce««  Max  of  Vote*  Max  of  Vote*  Time 


55 


Table  3;  Execution  times  (in  msec)  of  Algorithm  C  on  a  scene  consisting  of  1024  feature  points  with  concurrent 
processing  of  multiple  probes  on  various  sizes  of  MP-1.  Average  bin  size  is  8. 


#  of  Model* 
(16  point*/modeI) 


Sise/Type 


#  of  Scene 
Point* 


IQII 


Medioni  et.  al. 


8K/CM-2 


8K/CM-2 


6.68  msec 


5.74  msec 


800  msec 


2.0-3.0  sec 


Table  4:  Comparison  with  previous  implementations. 

‘Each  processor  in  CM-5,  CM-2,  and  MP-1  operates  at  32MHz,  7MHz,  and  12.5MHz  respectively. 
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Abstract 

This  paper  presents  a  new  robust  low-computational- 
cost  system  for  recognizing  freeform  objects  in  3D  range 
data  or  in  2D  curve  data  in  the  image  plane.  Objects  are 
represented  by  implicit  polynomials  (i.e.,  3D  algebraic 
surfaces  or  2D  algebraic  curves)  of  degrees  greater  than 
2,  and  are  recognized  by  computing  and  matching  vec¬ 
tors  of  their  algebraic  invariants  (which  are  functions  of 
their  coefficients  that  are  invariant  to  translations,  rota¬ 
tions,  and  general  linear  transformations).  Implicit  poly¬ 
nomials  of  4th  degree  can  represent  complicated  asym¬ 
metric  bee-form  shapes.  This  paper  deals  with  the  de¬ 
sign  of  Bayesian  (i.e.,  minimum  probability  of  error)  rec¬ 
ognizers  for  these  models  and  their  invariants  that  results 
in  low  computational  cost  recognizers  that  are  robust  to 
noise,  partial  occlusion,  and  other  perturbations  of  the 
data  sets.  This  work  extends  the  work  in  [4]  by  devel¬ 
oping  and  using  new  invariants  for  3D  surface  polyno¬ 
mials  and  applying  the  Bayesian  recognizer  to  operating 
on  invariants.  The  recognizer  seems  to  be  ideally  suited 
to  robot  vision,  handprinted  character  recognition,  ATR 
when  used  with  Kimia’s  partitioning  algorithms  [lO, 
5],  and  other  applications. 

1  Introduction 

The  simplest  2D  or  3D  recognition  problem  is  that  a 
boundary  model  is  stored  in  a  database  for  each  of  L 
rigid  objects.  (By  an  object  boundary,  we  mean  the 
3D  object  surface,  and  external  and  internal  bound¬ 
ary  curves  for  a  2D  object.)  Data  along  the  entire 
boundary  or  over  a  portion  of  the  boundary  of  an 
object  to  be  recognized  is  collected  from  a  sensor. 
Object  recognition  is  to  be  realized  by  determining 
the  stored  boundary  model  that  fits  the  sensed  data 
the  best.  By  best,  we  mean  in  the  sense  of  mini¬ 
mum  mean  squared  distance  from  the  data  points  to 
a  boundary  model.  This  should  produce  the  recog¬ 
nizer  functioning  with  the  highest  relative  frequency 
of  correct  recognition.  What  are  the  drawbacks  to 
this  approach?  There  are  two,  both  computational. 
First  is  that,  if  there  are  N  data  points,  order  of 
NL  computations  must  be  made  for  checking  on  the 
mean  square  fits  of  L  riored  object  boundaries  to  N 
data  points.  This  can  oe  considerable  if  L  is  large. 
Second  is  that  the  position  of  the  object  being  sensed 
will  be  different  than  the  position  of  the  object  in  the 
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database  and  the  sensed  data  may  also  have  under¬ 
gone  a  general  linear  transformation,  due,  e.g.,  to  the 
viewing  of  a  target  boundary  on  the  ground  from  an 
arbitrary  aerial  viewing  direction.  Hence,  an  object 
in  the  database  has  to  be  rotated  and  translated  in 
checking  its  match  to  the  sensed  data.  This  usually 
means  decking  a  boundary  model  fit  to  the  data  for 
the  boundary  model  moved  to  many  different  posi¬ 
tions  and  linearly  transformed,  thus  incurring  a  hu^e 
amount  of  computation.  The  approach  presented  in 
this  paper  avoids  both  these  drawbacks. 

2  Recognition  Approach 

The  approach  in  this  paper  is  to  model  3D  and  2D 
objects  of  interest  by  algebraic  surfaces  or  curves, 
respectively,  i.e.,  by  the  zero  sets  of  implicit  polyno¬ 
mials.  The  zero  set  is  the  set  of  points  (z,  y)  in  2D 
(or  (z,  y,  z)  in  3D)  for  which  the  polynomial  function 
/(z,  yj  on  2D  (or  /(z,  y,  z)  on  3D)  is  zero.  Then,  a 
stored  model  is  simply  the  set  of  coefficients  for  the 
polynomial  model.  These  are  global  3D  models,  un¬ 
like  explicit  polynomials  where  z  is  given  as  an  ex¬ 
plicit  function  of  z  and  y  as  in  a  depth  map.  Most 
of  the  early  work  on  implicit  polynomial  curves  and 
surfaces  was  limited  to  qu2ulrics,  thus  dealing  with 
representations  that  had  modest  expressive  power. 
Implicit  polynomials  of  degree  greater  than  2,  on  the 
other  hand,  have  great  modeling  power  for  compli¬ 
cated  objects  and  can  be  fit  to  data  very  well.  In  [?], 
Taubin,  formerly  of  our  laboratory,  has  presented  a 
very  well  organized  and  understandable  introduction 
to  these  polynomials  and  some  of  their  properties, 
and  develop^  very  effective  approaches  to  low  com¬ 
putational  cost  algorithms  for  fitting  these  polyno¬ 
mials.  Hence  we  can  now  use  these  high  degree  im¬ 
plicit  polynomials  for  representing  complicated  ob¬ 
jects.  In  addition,  we  have  developed  a  technique  for 
fitting  polynomicds  with  bounded  zero  sets,  which  re¬ 
sults  in  better  and  more  stable  description  of  objects 
(3). 

For  the  implicit  polynomial  models,  checking  the 
fit  of  a  stored  surface  or  curve  to  data  involves  fitting 
an  implicit  polynomial  to  the  data,  and  then  com¬ 
paring  the  resulting  polynomial  coefficients  with  the 
L  coefficient  vectors  (one  for  each  object)  stored  in 
the  database.  If  the  object  to  be  recognized  is  in  a 
different  position  than  the  object  in  the  database, 
the  coefficients  for  the  best  fitting  polynomial  to  the 
data  will  be  different  than  the  coefficients  for  the 
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same  object  in  the  database.  Our  solution  to  this 
problem  is  to  use  a  vector  of  algebraic  invariants  for 
a  recognizer.  An  algebraic  invariant  is  a  function  of 
the  implicit  polynomial  coefficients  that  is  invariant 
to  rotations  and  translations  for  3D  surfaces  and  is 
invariant  to  tr^mslations  and  general  linear  transfor¬ 
mations  for  2D  curves.  The  required  computation 
for  comparing  the  set  of  coefficients  or  the  set  of  in¬ 
variants  is  roughly  the  order  of  the  number  of  data 
points  (which  is  the  amount  of  computation  required 
to  fit  an  implicit  polynomial  to  the  data  set).  Un¬ 
fortunately,  the  situation  is  not  quite  so  simple,  and 
a  problem  arises  here.  This  paper  presents  the  solu¬ 
tion  to  the  problem  and  the  resulting  recognizer. 

The  problem  that  had  to  be  solved  is  that  small 
changes  in  a  data  set  often  result  in  large  changes 
in  the  coefficients  of  the  best  fitted  polynomial,  and, 
hence,  large  changes  in  the  algebraic  invariants.  The 
reason  for  this  variability  is  due  to  the  fact  that  the 
data  used  in  fitting  the  polynomials  provides  con¬ 
straints  among  the  coefficients  of  a  fitted  polyno¬ 
mial,  but  the  data  may  be  insufficient  to  uniquely 
determine  the  coefficients.  Hence,  since  the  fitted 
curve  and  stored  curve  coefficients  may  differ  greatly, 
we  cannot  compare  the  curves  over  the  local  region 
of  interest  based  on  their  coefficients  or  the  invari¬ 
ants,  which  are  functions  of  these  coefficients.  Our 
solution  to  this  problem  is  to  treat  recognition  as 
Bayesian  statistical  recognition  in  the  presence  of 
noisy  data,  and  to  use  certain  asymptotic  results 
whi^  permit  a  computationally  low  cost  recognizer. 
The  resulting  recognizer  involves  comparison  of  the 
measured  vector  of  invariants,  but  not  using  the  Eu¬ 
clidean  distance.  Rather,  the  error  measure  requires 
the  use  of  a  weighting  matrix  which  is  a  function  of 
the  specific  data  set  being  recognized.  The  required 
computation  for  computing  this  matrix  is  of  the  or¬ 
der  of  the  number  of  data  points,  and  hence,  modest 
and  suitable  for  read  time  recognition.  The  beauty 
of  this  approau:h  is  that  even  though  it  uses  globad 
models  —  implicit  polynomials  and  coefficient  vec¬ 
tors  —  it  behaves  as  though  recognition  is  based 
on  the  matching  of  a  local  data  set  to  a  boundary 
model.  Hence,  it  works  excellently  even  if  the  data 
set  is  over  only  a  portion  of  the  object  boundary, 
which  will  be  the  case  due  to  self  occlusion  if  range 
data  is  taken  for  a  SD  object  from  one  direction,  or 
which  may  be  the  case  if  one  or  more  objects  are 
partially  occluding  the  object  to  be  recognized. 

The  known  invariants  in  the  mathematical  liter¬ 
ature  are  affine  invariants  (i.e.,  quantities  that  are 
invariant  under  translations,  rotations  and  stretch¬ 
ings  along  the  x,  y  and  z  axes)  that  are  functions  of 
only  the  leading  form.  The  leading  form  is  the  part 
of  the  polynomial  that  contains  terms  of  the  high¬ 
est  degree.  For  example,  020*^  +  is 

the  leading  form  of  the  second  degree  implicit  poly¬ 
nomial  f{x,y)  =  020*^  +  aiixy  -f-  002!/^  +  + 

aoij/  +  floo,  in  2D.  We  extend  this  in  [8,  2,  4, 
9]  where  new  large  classes  of  affine  and  Euclidean 
invariants  of  all  the  coefficients  in  a  polynomial 
are  introduced.  An  example  of  a  new  affine  in¬ 
variant  for  a  fourth  degree  polynomial  in  x,y,z, 
f{x,y,z)  =  fonnd  using  our 

symbolic  method  for  discovering  new  algebraic  in¬ 


variants  [4,  2],  is 

216ai3]a202  ~  6480112O130O202  +  1296ao4oOM2 — 
108an2<>i2ia2n  +  972aio3ai30<»2n  —  648ao3i020202ii-b 
2160022  0211  -!•  21601120220  —6480103  01210220  + 

432oo22  0202  O220  —  6480013O2I1O22O  +  12960o040220  — 
38880040O103O331  +  972oo31  01120301  —  6480022O1210301  + 
972ooi3  0i3o03oi  +  972ao3i0io303io  —  648002201120310+ 
9720013  O121O310  —  3888000401300310  +  I2960022O400  — 
38880013  0031 0400  +  I55520004O0400400 

Experimental  results  on  using  these  invariants  for 
object  recognition  are  given  in  section  4. 


3  Asymptotic  parameter 
distribution  and  Bayesian 
Recognition 

Let  ot  denote  the  vector  of  coefficients  of  the  polyno¬ 
mial  f{x,y,z)  that  describes  the  given  object.  We 
assume  that  the  range  data  points  Zi,Z2, . .  ■ ,  Zn 
are  statistically  independent,  with  Zi  having  proba¬ 
bility  density  function  (pdf) 


1  PjZi)  ' 

2tT2  II  V/(^.)  IPJ 


(1) 


The  assumption  is  that  Z,-  is  a  noisy  Gaussian  mea¬ 
surement  of  the  object  surface  in  the  direction  per¬ 
pendicular  to  the  boundary  at  its  closest  point. 
This  model  is  introduced  and  discussed  in  [l,  6, 
4].  Thus,  the  joint  probability  of  the  data  points 
is 


P(Z^  I  «)  = 


f(Zi) 


(27r.T7)^  2(r7^jrv/(Z.)IP^ 

.  (2) 

Being  able  to  write  this  joint  probability  for  a  data 
set  in  terms  of  a  complicated  curve  or  surface  is  an 
important  result  and  permits  the  application  of  a 
large  range  of  tools  from  statistics  and  probability 
theory.  The  maximum  likelihood  estimate  qn  of  a 
given  the  data  points  is  the  value  of  a  that  maxi¬ 
mizes  (2).  A  very  useful  tool  for  solving  the  prob¬ 
lems  of  object  recognition  amd  parameter  estimation 
is  an  asymptotic  approximation  to  the  joint  likeli¬ 
hood  function,  (2),  which  can  be  shown  to  have  a 
Gaussian  shape  in  a  [l,  6],  i.e.. 


p(Z^  I  a)  « 

(p(Z^  I  otAr)]  exp{-i(a  -  dA/)‘'*'jv(a  -  d/v)} 

(3) 

where  '9^  is  the  second  derivative  matrix  hav¬ 
ing  i,jth  component  -Q^^lnpiZ^  \  a) 

Hence,  all  the  useful  information  about  a  is  sum¬ 
marized  in  the  quadratic  form  in  the  exponent  of 
equation  (3).  If  is  not  singular,  then  it  is  the 
inverse  covariance  matrix  of  d^v.  The  matrix  is 
called  the  Fisher  Information  matrix  of  djv .  Various 
extremely  useful  generalizations  of  (3)  are  developed 
in  l6]. 

The  asymptotic  approximation  (3)  gives  an  under¬ 
standing  of  the  extent  to  which  the  data  constrains 
the  coefficients  of  the  best  fitting  polynomial  [4]. 


862 


The  next  section  deals  with  using  this  approxima¬ 
tion  for  designing  a  metric  based  on  the  geometric 
invariants  for  comparing  two  polynomial  zero  sets 
over  the  region  where  the  data  exists. 

3.1  Mahalanobis  distance  between  two  sets 
of  Invariants 

The  scenario  for  recognition  that  we  consider  in  this 
paper  is  one  where  we  have  a  set  of  objects  labeled 
/  =  1, 2, . . Zr  in  the  database,  all  modeled  by  poly¬ 
nomials  of  the  same  degree.  Let  G(  denote  the 
vector  of  invariants  for  the  polynomial  describing 
object  /.  Then,  given  a  set  of  range  data  points, 
=  {Zi,  Z2, .  ■  ■ ,  Zjif},  the  optimum  recognition 
rule  is  ’choose  I  for  which  p(Z^  |  G/)  is  maximum’. 
Thus,  the  recognition  problem  reduces  to  computing 
the  likelihood  of  the  data  given  G.  In  [6],  we  have 
shown  that 


p(Z^  I  G)  «  [p(Z^  I  as)]  (2t)^ 
exp{-i(G  -  GNY^^iG  -  Gyv)} 


(4) 


where  'i'®  and  are  the  Information  matrices 
of  the  vector  of  invariants  and  a  vector  of  nuisance 
parameters,  respectively,  and  du  is  the  number  of 
nuisance  parameters. 

Using  (4)  for  the  simplest  case  of  recognition, 
the  optimum  recognition  rule  becomes  -  ’Choose 
I  for  which  the  Mahalanobis  distance,  (Gy  — 

Gyv)‘’®’^(G;  —  Gyv)i  is  minimum.’  This  is  be¬ 
cause,  the  only  part  of  (4)  that  is  a  function  of  / 

is  exp{-i(G,  -  Giv)‘»^(G,  -  Gyv)}. 

The  beauty  of  this  recognizer  is  that  the  computa¬ 
tional  cost  is  negligible,  but  the  recognizer  is  equiv¬ 
alent  to  checking  how  well  the  data  fits  the  models 
stored  in  the  database  for  different  linear  transfor¬ 
mations  of  the  models  for  which  the  computational 
cost  is  enormous. 

In  summary,  object  recognition  using  invariants  is 
done  as  follows. 

1.  Fit  the  best  polynomial  to  the  data  set. 

2.  Compute  the  invariants  Gn  which  are  functions 
of  the  coefficients  of  the  polynomial. 

3.  Compute  the  Mahalanobis  distance,  (Gj  — 

Gyv)*'f®(G/  —  Gyv)  to  each  object  in  the 
database  and  pick  the  /  for  which  it  is  a  mini¬ 
mum. 

This  computational  cost  for  step  1  is  linear  in  the 
number  of  data  points,  and  typically  is  a  fraction  of 
a  second  on  a  SPARC  10  for  200  data  points  and 
a  4th  degree  implicit  polynomial  curve.  In  step  2, 
invariants  such  as  that  in  section  2  must  be  com¬ 
puted.  Computation  time  for  5  to  10  of  these  is  less 
than  that  for  step  1 .  The  only  time  consuming  com¬ 
putation  in  step  3  is  the  computation  of  This 
matrix  is  given  by 

♦G  =  (DG)t‘  4-yv(DG)t 

where  f  implies  pseudo-inverse  and  D(G)  is  the  Ja¬ 
cobian  of  the  transformation  from  a  to  G.  The 


costly  computation  here  is  that  for  'i yy  which  is  lin¬ 
ear  in  the  number  of  data  points  and  is  of  the  order 
of  the  computation  in  step  1.  Hence,  the  computa¬ 
tion  of  the  Mahalanobis  distance  for  200  data  points 
and  a  4th  degree  curve  is  a  fraction  of  a  second  on 
a  SPARC  10  and  can  be  sped  up  by  orders  of  mag¬ 
nitude  with  parallel  architectures. 

Experimental  results  illustrating  the  use  of  the 
Mahidanobis  distance  for  recognition  are  given  in  the 
next  section. 

4  Experimental  Results 

The  experiments  illustrate  the  use  of  the  Maha¬ 
lanobis  distance  in  the  space  of  invariants  for  recog¬ 
nizing  2D  and  3D  objects  from  real  data  that  may 
be  partial  and  that  is  noisy.  The  experiments  also 
illustrate  the  fact  that  the  Mahalanobis  distance  has 
better  discriminatory  power  than  does  the  Euclidean 
distance. 

The  first  set  of  experiments  illustrate  the  perfor¬ 
mance  of  the  recognizer  for  3D  objects.  The  ob¬ 
jects  in  this  experiment  are  keyboard  mice.  Figure 
1  shows  the  four  mice  used  in  this  experiment.  Fig¬ 
ures  2(ai-(d)  are  the  data  sets  and  the  polynomial 
fits  for  tne  mice  in  standard  position.  (The  polyno¬ 
mial  fits  were  obtained  using  our  approach  for  fitting 
bounded  polynomials).  The  datasets  were  obtained 
using  the  Brown  and  Sharpe  Microval  Manual  coor¬ 
dinate  measuring  machine.  All  the  data  sets  atre  well 
fit  by  fourth  degree  polynomials  in  x,  y,  z.  These 
are  the  four  objects  in  the  database. 

Fibres  3(a)-(d)  are  the  data  sets  and  polynomiad 
fits  for  the  rotated  and  translated  versions  of  the 
mice  in  the  database.  We  used  7  invariants  for  a 
fourth  degree  polynomial  in  x,  y,  z.  All  of  them 
are  listed  in  [2].  The  goal  in  this  experiment  is  to 
recognize  the  mice  in  Figures  3(a)-(d)  using  the  Ma¬ 
halanobis  distance  measure  and  compare  the  results 
with  those  using  the  Euclidean  distance. 

Tables  1  and  2  show  the  Mathalamobis  amd  the  Eu- 
clideain  distances,  respectively,  between  the  vector  of 
invariants  for  the  polynomial  fits  to  the  rotated  mice 
in  Figure  3  and  the  vectors  of  invariants  for  the  four 
mice  in  the  database.  The  Mahalanobis  distance 
measure  does  a  great  job  of  discriminating  between 
the  right  object  and  the  rest.  Also,  the  M^alanobis 
distaince  has  much  better  discriminatory  power  than 
does  the  Euclidean  distance. 

The  next  experiment  illustrates  the  use  of  the  Ma¬ 
halanobis  distance  for  recognizing  2D  and  3D  objects 
from  partial  data. 

Figure  4  shows  the  partial  data  (with  the  polyno¬ 
mial  fit  superimposed)  for  the  mouse  in  Figure  2(a). 
The  partial  data  in  this  experiment  is  what  a  stereo 
sensor  would  see  when  looking  at  the  mouse  from 
a  point  near  the  bottom  left  corner.  The  Maha- 
l^ulobis  and  Euclidean  distances  between  the  vector 
of  invariants  for  the  polynomial  fit  to  the  occluded 
object  and  the  stored  vectors  of  invariants  are  : 
Mahalanobis  distance: 

Mousel  :  1.0  Mouse2  :  1065 
Mouse3  :  30.31  Mouse4  :  1.004 
Euclidean  distance: 

Mousel  :  1.0  Mouse2  :  18.39 
Mouse3  :  1.619  Mouse4  :  0.901 
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The  Mahalanobis  distance  to  Mousel  is  the  small¬ 
est.  However,  the  Mahalanobis  distance  to  Mouse4 
is  almost  the  same  as  that  to  Mousel.  This  is  be¬ 
cause  the  occluded  data  does  not  contain  the  curved 
front  part  of  Mousel,  and  since  that  is  the  part  that 
really  distinguishes  Mousel  from  Moused,  it  is  hard 
to  distinguish  between  them  based  on  the  partial 
data.  The  distances  to  Mouse2  and  MouseS  are  large 
compared  to  those  to  Mousel  and  Moused.  The  Eu- 
clidezm  distance  does  not  give  good  recognition  re¬ 
sults  with  partial  data.  In  fact,  the  Euclidean  dis¬ 
tance  from  the  occluded  object  to  Moused  is  smaller 
than  that  to  Mousel. 

The  data  sets  for  the  2D  examples  are  handwrit¬ 
ten  characters.  The  objects  in  the  database  are  the 
handwritten  characters,  ‘a’,  ‘q’,  ‘g’  and  ‘w’,  shown  in 
Figures  5(a)-(d).  The  data  sets  are  well  fit  by  fourth 
degree  polynomials  in  x,  y.  Figure  6(a)  is  another 
instance  of  the  h2mdwritten  character  ‘w’  that  is  a 
rotated,  translated,  occluded  and  noisy  version  of 
the  one  in  the  database.  We  fit  a  fourth  degree  poly¬ 
nomial  to  the  occluded  object  in  order  to  compare 
its  invariants  with  those  for  the  unoccluded  database 
objects.  Figure  6(b)  is  the  fourth  degree  polynomial 
fit  to  the  data  set  in  6(a).  Three  invariants  for  a 
fourth  degree  polynomi^  in  x,y  obtained  using  our 
approach  are 

1.  3aj3  —  8004022  +  2013O31  +  3031  —  32040O04  — 

8022O40, 

2.  30o4 + 2O04O22 +0 13031  -f-2o04O40 +2O22O40+3O40 , 

3.  O22  —  3013031  +  I2004O40, 

Since  the  invariants  should  be  independant  of  multi¬ 
plication  of  the  coefficients  by  a  constant,  these  three 
functions  yield  only  two  invariants.  One  set  of  two 
invariants  is  ^  and  Thus,  for  object  recognition, 
we  use  the  Mahalanobis  distance  between  the  ratios 
of  invariants.  The  Medialanobis  distance  from  the 
letter  in  Figure  6(a)  to  the  letters  ‘a’,  ‘g’,  ‘q’  and  ‘w’ 
in  the  datsmase  are  : 

‘a’:1.00  ‘q’:12.9  ‘g’:12.2  ‘w’:d.81 
(The  distance  to  ‘a’  in  the  database  is  normalized 
to  have  value  1.0.)  The  distance  to  ‘w’  is  minimum. 
However,  the  distance  is  small  to  ‘a’  and  ‘q’.  This 
is  because,  the  data  set  in  6(a)  also  fits  the  model 
for  ‘a’  and  ‘q’  as  shown  in  Figures  6(c)  and  6(d). 
The  experiment  illustrates  that  even  under  a  large 
amount  of  occlusion,  the  recognizer  comes  up  with 
the  best  possible  results. 
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TABLE  1  (Mahalanobis  distances)  ; 


Mousel 

(rotated) 

Mouse2 

(rotated) 

Mouses 

(rotated) 

Moused 

(rotated) 

Mousel 

Mou8e2 

Mouses 

Mouse4 

l.OOOe+00 

2.46Se+01 

2.622e+02 

2.872e+00 

S.517e+02 

l.OOOe+00 

S.717e+0S 

2.818e+0S 

l.S79e+01 

1.560e+02 

l.OOOe+00 

6.096e+01 

S.519e+00 

2.109e+01 

1.489e+02 

l.OOOe+00 

TABLE  2  (Euclidean  distances)  : 

Mousel 

(rotated) 

Mou8e2 

(rotated) 

Mouses 

(rotated) 

Moused 

(rotated) 

Mousel 

Mouse2 

Mouses 

Moused 

l.OOOe+00 

1.822e+00 

l.S51e+01 

1.446e+00 

2.5S5e+00 

l.OOOe+00 

2.S22e+00 

l,S27e+00 

9.614e+00 

1.125e+00 

l.OOOe+00 

2.4S5e+01 

2.741e+00 

1.547e+00 

7.568e+00 

l.OOOe+00 
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Abstract 

Neighboring  points  on  a  smoothly  curved  surface  have 
similar  surface  orientations  and  illumination  conditions. 
Hence,  their  brightness  values  can  be  used  to  compute 
the  ratio  of  their  reflectance  coefficients.  Based  on  this 
observation,  we  develop  an  efficieut  algorithm  that  esti¬ 
mates  a  reflectance  ratio  for  each  region  in  an  image  with 
respect  to  its  background.  The  region  reflectance  ratio 
represents  a  physical  property  of  a  region  that  is  invari¬ 
ant  to  the  illumination  conditions.  The  ratio  invariant  is 
used  to  recognize  objects  from  a  single  brightness  image 
of  a  scene.  We  conclude  with  experimental  results  that 
demonstrate  the  power  of  using  reflectance  and  geomet¬ 
ric  properties  of  objects,  simultaneously. 

1  Introduction 

Object  recognition  has  been  an  active  area  of  machine 
vision  research  for  the  past  two  dec2^es  [1].  The  tradi¬ 
tional  approach  has  been  to  recover  geometric  features 
from  images  and  then  use  these  features  to  hypothesize 
and  verify  the  existence  of  three-dimensional  objects  in 
the  image.  Edges  and  vertices  are  examples  of  geometric 
features  often  used  by  recognition  systems.  In  the  past, 
little  attention  has  been  given  to  the  use  of  other  phys¬ 
ical  properties  of  objects  for  recognition.  In  addition  to 
its  geometry,  an  object  may  be  characterized  by  intrinsic 
properties  such  as  reflectance,  roughness,  and  material 
type.  Clearly,  the  representation  of  an  object  using  all  of 
these  intrinsic  properties  is  useful  only  if  the  recognition 
system  is  able  to  compute  the  properties  from  images. 

In  this  paper,  we  present  a  method  for  computing  the 
reflectance  of  regions  in  a  scene,  with  respect  to  their 
backgrounds,  from  a  single  image.  The  result  is  a  phys¬ 
ical  property  of  each  scene  region  that  is  invariant  to 
the  intensity  and  direction  of  illumination.  This  photo¬ 
metric  invariant,  referred  to  as  the  reflectance  ratio,  pro¬ 
vides  valuable  information  for  recognition  tasks.  The  re¬ 
flectance  ratios  (photometric  features)  of  object  regions 
and  the  spatial  configuration  (geometric  features)  of  the 
regions  are  used  to  represent  the  object. 

•This  research  was  supported  in  part  by  DARPA  Contract  No. 
DACA  76-92-C-0007  and  in  part  by  the  David  and  Lucile  Packard 
Fellowship.  R.  M.  Bolle  is  supported  by  the  IBM  T.J.  Watson 
Research  Center,  Yorktown  Heights,  N.Y.  10598,  U.S.A. 


The  problem  of  computing  the  reflectance  of  regions 
in  a  scene  was  first  addressed  by  Land  [2].  In  general, 
image  brightness  is  the  product  of  surface  reflectance 
and  illumination.  Land  developed  the  retinex  theory 
that  suggests  computational  steps  for  recovering  the  re¬ 
flectance  (or  lightness)  of  scene  regions  in  the  presence 
of  varying  illumination.  Subsequently,  several  hardware 
implementations  for  the  retinex  theory  were  proposed 
[3],  [4].  The  main  idea  underlying  Land’s  lightness  com¬ 
putation  is  global  consistency.  The  lightness  value  com¬ 
puted  for  any  particular  region  must  be  consistent  with 
those  computed  elsewhere  in  the  image.  However,  realis¬ 
tic  images  include  shadows,  occlusions,  and  noise.  Eau:h 
one  of  these  factors  can  cause  a  region  boundary  to  go 
undetected  or  the  computed  lightness  of  a  region  to  be 
erroneous.  Such  errors  can  greatly  affect  the  lightness 
values  computed  for  all  other  regions  in  the  image.  For 
this  reason,  Land’s  global  method  is  not  applicable  to 
most  real  images. 

In  this  paper,  we  develop  an  alternative  scheme  for 
computing  the  ratio  of  the  reflectance  of  a  region  to  that 
of  its  background.  The  image  is  first  segmented  into  re¬ 
gions  of  constant  (but  unknown)  reflectance.  Next,  a  re¬ 
flectance  ratio  is  computed  for  each  region  and  its  back¬ 
ground  using  only  points  that  lie  close  to  the  region’s 
boundary.  In  this  approach,  the  reflectance  ratio  com¬ 
puted  for  any  particular  region  is  not  affected  by  those 
computed  for  regions  elsewhere  in  the  image.  Land’s 
analysis  [2]  was  restricted  to  planar  (two-dimensional) 
scenes  with  patches  of  constant  reflectance.  In  contrast, 
our  derivation  of  the  reflectance  ratio  is  based  on  the 
analysis  of  regions  that  lie  on  curved  surfaces.  In  the 
case  of  curved  surfaces,  image  brightness  variations  re¬ 
sult  from  both  illumination  variations  as  well  as  surface 
normal  changes.  For  curved  surfaces,  our  reflectance  ra¬ 
tio  invariant  is  valid  when  a  region  and  its  background 
have  the  same  distribution  (scattering)  function  but  dif¬ 
ferent  reflectance  coefficients  (albedo). 

Recently,  Finlayson  [5]  proposed  computing  his¬ 
tograms  using  ratios  in  different  color  channels  for  object 
recognition.  Histograms,  however,  are  in  general  sensi¬ 
tive  to  the  scale  and  rotation  of  objects  in  the  scene 
and  hence  are  not  effective  for  three-dimensional  object 
recognition  and  pose  estimation.  Here,  we  use  the  re- 
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flectance  ratio  invariant  to  recognize  objects  from  a  sin¬ 
gle  image.  This  approach  is  very  effective  in  the  case  of 
man-made  objects  that  have  printed  characters  and  pic¬ 
tures.  Each  object  is  assumed  to  have  a  set  of  regions, 
each  with  constant  reflectance.  The  reflectance  ratio  and 
center  of  each  region  are  used  to  represent  objects  using  a 
hash  table.  Recognition  and  pose  estimation  algorithms 
are  presented  that  use  the  reflectance  ratios  of  scene  re¬ 
gions  to  index  the  hash  table.  The  result  is  a  hypothesis 
for  the  existence  of  an  object  in  the  image.  This  hypoth¬ 
esis  is  verified  using  the  reflectance  ratios  and  locations 
of  other  regions  in  the  scene.  Recognition  results  are  pre¬ 
sented  for  realistic  scenes  with  occlusion,  shadows,  and 
illumination  variations.  These  results  show  the  simulta¬ 
neous  use  of  reflectance  and  geometry  to  be  a  powerful 
approach  to  object  recognition  problems. 

2  Reflectance  Ratio  Invariant 

The  reflectance  of  a  surface  depends  on  its  roughness 
and  material  properties.  In  general,  incident  light  is 
scattered  by  a  surface  in  different  directions.  This  dis¬ 
tribution  of  reflected  light  can  be  described  as  a  function 
of  the  angle  of  incidence,  the  angle  ofemittance,  and  the 
wavelength  of  the  incident  light.  Consider  an  infinites¬ 
imal  surface  patch  with  normal  n,  illuminated  with 
monochromatic  light  of  wavelength  A  from  the  direction 
s,  and  viewed  from  the  direction  v.  The  reflectance  of 
the  surface  patch  can  be  expressed  as:  r(s,  v,  n.  A).  Now 
consider  an  image  of  the  surface  patch.  If  the  spectral 
distribution  of  the  incident  light  is  e(A)  and  the  spectral 
response  of  the  sensor  is  s(A),  the  image  brightness  value 
produced  by  the  sensor  is: 

I  =  J  s(A)e(A)  r(8,v,n,  A)dA  (1) 

If  we  assume  the  surface  patch  is  illuminated  by  “white” 
light  and  the  spectral  response  of  the  sensor  is  constant 
within  the  visible-light  spectrum,  then  s(A)  =  s  and  e(A) 
=  e.  We  have: 

I  =  s  e  p  /Z(s,  v,n)  (2) 

where  p  R(s,  v,  n)  is  the  integral  of  r(8,  v,  n.  A)  over  the 
visible-light  spectrum.  We  have  decomposed  the  result 
into  /?(.)  which  represents  the  dependence  of  surface  re¬ 
flectance  on  the  geometry  of  illumination  and  sensing, 
and  p  which  may  be  interpreted  as  the  fraction  of  the 
incident  light  that  is  reflected  in  all  directions  by  the  sur¬ 
face.  Incident  light  that  is  not  reflected  by  the  surface  is 
absorbed  or  transmitted  through  the  surface.  Two  sur¬ 
faces  with  the  same  distribution  function  /?(.)  can  have 
different  reflectance  coefficients  p. 

As  a  result  of  the  white-light  assumption,  the  re¬ 
flectance  coefficient  p  is  independent  of  wavelength.  This 
enables  us  to  represent  the  reflectance  of  the  surface  ele¬ 
ment  with  a  single  constant.  The  same  can  be  achieved 


by  using  an  alternative  approach  which  does  not  require 
making  assumptions  about  the  spectral  distribution  of 
the  incident  light  and  the  spectral  response  of  the  sen¬ 
sor.  Consider  a  narrow-band  filter  with  spectral  response 
/(A),  placed  in  front  of  the  sensor.  Image  brightness  is 
then: 

^  -  J  fWs{\)e{X)r{s,v,n,X)dX  (3) 

Since  the  filter  is  narrow-band,  it  essentially  passes  a  sin¬ 
gle  wavelength  A'  of  reflected  light.  Its  spectral  response 
can  therefore  be  expressed  as: 

/(A)  =  6(A'-A)  (4) 

The  image  brightness  measured  with  such  a  filter  is: 

/  =  s'e'r(8,v,n,A')  (5) 

where  s'  =  s(X')  and  e'  =  e(A').  Once  again,  the  re¬ 
flectance  function  can  be  decomposed  into  a  geometrical 
function  and  a  reflectance  coefficient: 

I  =  s'  e'  p'  R'{s,v,n)  (6) 

In  this  case,  R'(.)  represents  the  distribution  of  reflected 
light  for  a  particular  wavelength  of  incident  light.  On 

the  other  hand,  for  white-light  illumination,  ff(.)  repre¬ 
sents  the  distribution  computed  as  an  average  over  the 
entire  visible-light  spectrum.  However,  the  individual 
terms  in  both  (2)  and  (6)  represent  similar  effects.  Since 
we  have  used  the  white-light  illumination  assumption  in 
our  experiments,  we  will  use  the  following  expression  for 
image  brightness  in  our  discussion: 

I  =  kpR{s,\,n)  (7) 

The  constant  k  =  s.e  accounts  for  the  brightness  of  the 
light  source  and  the  response  of  the  sensor.  The  exact 
functional  form  of  R(s,  v,  n)  is  determined  to  a  great  ex¬ 
tent  by  microscopic  structure  of  the  surface;  generally 
ii(  )  includes  a  diffuse  component  and  a  specular  com¬ 
ponent  [6].  Once  again,  the  reflection  coefficient  p  is  the 
fraction  of  incident  light  that  is  reflected  by  the  surface. 
It  represents  the  reflective  power  of  the  surface  and  is 
sometimes  referred  to  as  surface  albedo. 

Consider  two  neighboring  points  on  a  surface  (Figure 
1).  For  a  smooth  continuous  surface,  the  two  points  may 
be  assumed  to  have  the  same  surface  normal  vectors. 
Further,  the  two  points  have  the  same  source  and  sensor 
directions.  Hence,  the  brightness  values,  Ii  and  of 

the  two  points  may  be  written  as: 

h  =  IrPiRi(s,v,n)  (8) 

h  =  kp^R2(s,\,n)  (9) 

The  main  assumption  made  in  computing  the  reflectance 
ratio  is  that  the  two  points  have  the  same  reflectance 
functions  (Ri  =  R2  =  R)  but  their  reflectance  coefficient 
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Figure  1;  Neighboring  points  on  a  surface. 


Pi  and  P2  may  differ.  An  example  is  that  of  two  neigh¬ 
boring  Lambertian  points  that  have  different  albedo  val¬ 
ues  because  they  lie  in  regions  that  have  different  shades 
or  colors.  Then,  the  image  brightness  values  produced 
by  the  two  points  are: 

h  =  kpiR(8,-v,n)  (10) 

I2  =  k  p2R{a,v,n) 

The  ratio  of  the  reflectance  coefficients  of  the  two  points 
is: 

P=  hlh  =  P1IP2  (11) 

Note  that  p  is  independent  of  the  reflectance  function, 
illumination  direction  and  intensity,  and  the  surface  nor¬ 
mal  of  the  two  points.  It  is  a  photometric  invariant  that 
is  easy  to  compute  and  does  not  vary  with  the  position 
and  orientation  of  the  surface  with  respect  to  the  sen¬ 
sor  and  the  source.  Further,  it  represents  an  intrinsic 
surface  property  that  can  be  used  for  object  recognition. 

We  have  assumed  that  the  scene  is  illuminated  by  a 
single  light  source.  Now  consider  the  same  scene  illu¬ 
minated  by  several  light  sources.  The  brightness  of  any 
point  can  be  written  as: 

/  =  p[A:ii?(si,v,n)  +  A:2i2(82,v,n)-t-..-bfc„fl(8„,v,n)](12) 

where  si,S2,...,8„  are  the  directions  of  the  n  sources 
that  are  visible  to  the  surface  point  under  consideration 
and  ki,  k2,  ■  ■  ■,  kn  are  proportional  to  the  brightness  of 
the  n  sources.  Since  the  reflectance  ratio  is  computed 
using  neighboring  points,  it  can  be  assumed  that  both 
points  are  illuminated  by  the  same  set  of  sources.  Then, 
from  (11)  and  (12)  we  see  that  the  reflectance  ratio  p  is 
unaffected  by  the  presence  of  multiple  light  sources. 

Note  that  by  definition  p  is  unbounded;  if  the  second 
surface  point  is  black,  I2  —  0,  then  p  =  00.  From  a  com¬ 
putational  perspective,  this  poses  implementation  prob¬ 
lems.  Hence,  we  use  a  different  definition  for  p  to  make 
it  a  well-behaved  function  of  the  reflectance  coefficients 
Pi  and  P2' 

P—{h-  h)/(h  +  h)  =  (pi  -  P2)I{P\  +  P2)  (13) 


Now,  we  have  —1  <  p  <  1.  We  will  use  this  definition  of 
the  reflectance  ratio  in  the  following  sections. 

3  Reflectance  Ratio  of  a  Region 

To  this  point,  we  have  focused  on  two  neighboring 
points.  Now  consider  a  surface  region  that  has  con¬ 
stant  reflectance  coefficient  pi  and  is  surrounded  by  a 
background  region  with  constant  reflectance  coefficient 
P2.  We  are  interested  in  computing  the  reflectance  ratio 
P(S)  of  the  surface  region  S  with  respect  to  its  back¬ 
ground.  The  brightness  of  the  entire  region  cannot  be 
assumed  constant  for  following  two  reasons,  (a)  The 
surface  may  be  curved  and  hence  the  surface  normal  can 
vary  substantially  over  the  region,  (b)  While  the  illumi¬ 
nation  may  be  assumed  to  be  locally  constant,  it  may 
vary  over  the  region.  These  factors  can  cause  bright¬ 
ness  variations,  or  shading,  over  the  region  and  its  back¬ 
ground  as  well.  However,  the  reflectance  ratio  can  be  ac¬ 
curately  estimated  using  neighboring  (or  nearby)  points 
that  lie  on  either  side  of  the  boundary  between  the  region 
and  the  background.  The  reflectance  ratio  for  a  region 
can  then  be  determined  as  an  average  of  the  reflectance 
ratios  computed  along  the  boundary  of  the  region.  The 
computed  ratio  is  also  a  photometric  invariant;  it  is  in¬ 
dependent  of  the  three-dimensional  shape  of  the  surface 
and  the  illumination  conditions. 

Details  of  the  reflectance  ratio  algorithm  are  given  in 
[8].  Due  to  space  limitations,  we  will  simply  outline  the 
main  steps  of  the  algorithm.  The  algorithm  can  be  di¬ 
vided  in  two  parts.  First,  a  sequential  labeling  algorithm 
[7]  is  used  to  segment  the  image  into  connected  regions. 
During  sequential  labeling,  the  reflectance  ratio  of  neigh¬ 
boring  pixels  is  used  as  a  measure  of  the  “connectivity” 
between  the  pixels.  In  the  second  stage,  a  reflectance  ra¬ 
tio  for  each  segmented  region  is  computed  as  the  mean 
of  the  ratios  computed  for  all  points  on  the  boundary  of 
the  region.  The  algorithm  is  computationally  efficient  in 
that  reflectance  ratios  of  all  scene  regions  are  computed 
in  just  two  raster  scans  of  the  image  [8]. 

4  Object  Recognition 

In  this  section,  we  apply  reflectance  ratios  to  the  prob¬ 
lem  of  object  recognition.  The  recognition  methods  pre¬ 
sented  here  are  effective  for  objects  that  have  markings 
with  different  reflectance  coefficients.  Man-made  objects 
with  pictures  and  text  printed  on  them  are  good  exam¬ 
ples  of  such  objects. 

Learning  Object  Models: 

Since,  our  objective  is  to  recover  the  three-dimensional 
pose  of  an  object  from  a  single  brightness  image,  the  ob¬ 
ject  model  must  include  reflectance  ratios  of  the  object 
as  well  as  the  three-dimensional  coordinates  of  the  cen¬ 
troids  of  each  region.  This  is  done  using  a  range  finder. 
We  use  the  image  sensor  of  the  range  finder  to  also  ob- 
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tain  a  brightness  image  of  the  object.  As  a  result,  the 
range  and  brightness  images  of  the  object  are  registered. 
The  reflectance  ratio  algorithm  is  applied  to  the  bright¬ 
ness  image  and  the  ratios  (Pm)  and  centroids  (Xm)  (in 
the  image)  of  the  object’s  regions  are  determined.  Next, 
the  range  map  is  used  to  obtain  the  three-dimensional 
coordinates  (Xm)  of  points  of  the  object  surface  that 
correspond  to  the  region  centroids  in  the  image.  We 
assume  that  though  the  object  surface  may  be  curved, 
each  constant  reflectance  region  is  small  compared  to 
the  size  of  the  object  and  hence  can  be  assumed  to  be 
planar.  Under  this  assumption,  centroids  of  regions  in 
the  image  correspond  to  centroids  of  the  regions  in  the 
3-D  scene.  Using  the  above  approach,  a  ratio-centroid 
list  La  —  ((Xi,  P i),  (X2,  P2),  ■  ■  ■  ,(Xmi  P m)i  .  ■  ■)  is  ob¬ 
tained  for  each  object.  Here,  Xm,  m  =  1, . . . ,  M  are  the 
3-D  centroids  of  the  regions  and  Pm,  m  =  1, . . . ,  M  are 
the  reflectance  ratios. 

Next,  a  h£ish  table  [9]  is  initialized.  All  object  mod¬ 
els  are  stored  in  the  same  hash  table.  The  indices  in 
the  hash  table  are  invariants  that  can  be  computed  from 
a  single  image  of  the  scene.  There  are  no  useful  geo¬ 
metric  invariants  that  can  be  computed  from  the  spatial 
arrangement  of  the  region  centroids  [10].  This  is  because 
object  rotation  in  the  scene  changes  the  relative  config¬ 
uration  of  the  region  centroids  in  the  image.  Hence,  we 
rely  on  the  photometric  invariance  of  reflectance  ratios 
for  indexing  into  the  hash  table.  We  select  three  re¬ 
gions,  i,  j,  and  k  on  the  object  and  use  their  reflectance 
ratios  to  obtain  an  index  <  Pi,Pj,Pk  >•  Indices  are 
formed  using  only  those  region  triplets  (i,j,k)  whose 
centroids  in  3-D  space  lie  within  the  radius  of  coherence 
Da  ■  This  ensures  that  the  number  of  indices  generated 
is  O(^),  with  N  the  number  of  visible  regions  on  the 
object,  and  not  combinatorial  in  N.  Associated  with 
each  index  in  the  hash  table  is  an  entry.  In  the  entry  are 
stored,  the  object  identifier  Ad/,  and  the  3-D  coordinates 
of  the  centroids  (X,',Xj, X*)  of  the  three  regions  used 
in  the  index.  The  entry  also  includes  the  ratio-centroid 
pairs  (Xm,  Pm),  of  other  object  regions  that  are  used  for 
object  verification  and  pose  estimation. 

The  above  procedure  is  applied  to  all  sets  of  three  re¬ 
gions  in  the  list  La-  Each  object  is  typically  represented 
by  several  indices  and  entries  in  the  hash  table.  This 
process  is  repeated  for  all  objects.  Ad/,  7  =  1, . .  .,0,  of 
interest  to  the  recognition  system.  The  resulting  hash  ta¬ 
ble  represents  the  complete  object-model  database  which 
is  ready  for  use  by  the  recognition  system. 

Recognition  and  Pose  Estimation 

Though  model  acquisition  requires  the  use  of  both  a 
brightness  and  a  range  image  of  each  object,  recogni¬ 
tion  and  pose  estimation  is  accomplished  using  a  sin- 
gle  brightness  image.  The  reflectance  ratio  algorithm 
is  applied  to  the  scene  image  to  obtain  the  list  Lft  = 


((xi.  Pi),  (x2,  P2),  •  •  •)•  For  recognition,  a  set  of  three 
regions  is  selected  from  the  list  Lr.  Consider  the  three 
regions  (i,j,k).  This  set  is  used  only  if  the  image  cen¬ 
troids  of  the  regions  j  and  k  lie  within  the  radius  of 
coherence  Dr  from  the  centroid  of  the  region  i.  The 
ratios  of  the  three  regions  are  used  to  form  the  index 
<  Pi,Pj,Pk  >■  If  this  index  does  not  have  an  en¬ 
try  in  the  hash  table,  the  next  set  of  three  regions  is 
selected  from  Lr.  If  an  entry  does  exist,  we  have  a 
hypothesis  for  the  object  (say  Mr)-  The  entry  in¬ 
cludes  the  3-D  centroids  of  the  regions  (i,j,k)  and  a 
set  of  centroid-ratio  pairs  for  other  regions  on  the  ob¬ 
ject  Mr-  Assuming  the  object  hypothesis  is  correct, 
we  have  a  correspondence  between  the  image  centroids 
(x,-,Xj,xt)and  the  3-D  centroids  (X,, Xj,Xt)  in  the  en¬ 
try.  Under  the  weak-perspective  assumption,  the  trans¬ 
formation  T  from  the  3-D  scene  points  to  2-D  image 
points  can  be  computed  from  the  three  corresponding 
3-D  centroids  and  image  centroids  using  the  alignment 
technique  proposed  by  Huttenlocher  and  Ullman  [11].  In 
general,  however,  there  exist  two  solutions  to  the  trans¬ 
formation  [11]: 

x  =  T/fi(X)  and  x  =  T/f2(X)  (14) 

Weinshall  [12]  has  shown  that  instead  of  computing  these 
two  transformations  the  inverse  of  the  Grammian  of  the 
points  Xt ,  \j ,  and  Xt  can  be  used  to  predict  the  image 
coordinates  Xo  of  a  fourth  3-D  point  X^  in  the  entry. 
Again,  two  solutions  to  x„  exist  but  if  the  initial  object 
hypothesis  is  correct,  one  of  the  two  solutions  is  likely  to 
be  close  to  one  of  the  centroids  in  the  list  Lr.  Further, 
the  relectance  ratio  Po  (in  the  entry)  and  P„  (in  the  list 
Lr)  must  be  similar.  The  point  . Xo  is  not  gauranteed 
to  be  in  the  list  Lr  since  it  may  not  be  visible  to  the 
sensor  or  it  may  be  occluded  by  other  objects  in  the 
scene.  In  any  case,  for  the  object  to  be  verified,  one  or 
more  projections  of  the  3-D  regions  in  the  entry  must 
match  in  location  and  ratio  with  regions  in  the  list  Lr. 
If  so,  the  object  Mr  has  been  recognized  and  its  pose 
is  given  by  either  Trj  or  Tr^. 

At  this  point,  all  regions  used  as  indices  and  those 
that  are  verified  are  removed  from  the  list  Lr.  A  new 
set  of  three  regions  is  selected  from  the  list  and  used 
to  form  the  next  index.  This  process  is  repeated  until 
either  all  objects  in  the  hash  table  have  been  recognized 
or  all  regions  in  the  list  Lr  have  been  explained. 

5  Experiments 

In  this  section,  we  present  experimental  results  related 
to  the  invariance  of  reflectance  ratios  as  well  as  the  appli¬ 
cation  of  ratios  to  object  recognition.  Figure  2(a)  shows 
a  brightness  image  of  an  object  with  several  constant  re¬ 
flectance  regions.  The  image  was  obtained  under  ambi¬ 
ent  lighting  conditions.  Figure  2(b)  shows  the  result  ob¬ 
tained  by  applying  the  reflectance  ratio  algorithm.  Ratio 
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values  between  -1.0  and  1.0  are  offset  and  scaled  to  lie 
between  0  and  255  image  brightness  levels.  Note  that  all 
letters  in  the  word  “KRYLON”  have  similu  ratio  values. 
The  white  dot  in  the  center  of  each  region  represents  the 
centroid  of  the  region. 

The  invariance  of  computed  ratios  to  the  illumination 
direction  is  illustrated  in  Figure  3.  The  direction  of  a 
single  light  source  is  varied  (about  the  axis  of  the  ob¬ 
ject)  from  -70  degrees  (left  of  the  object)  to  20  degrees 
(right  of  the  object)  in  increments  of  10  degrees.  As 
seen  from  the  figure,  the  reflectance  ratio  for  region  “K” 
demonstrates  remarkable  invariance  to  illumination  di¬ 
rection. 

The  recognition  experiments  were  conducted  on  man¬ 
made  objects  with  letters  and  pictures  printed  on  them. 
The  printed  regions  have  reflectance  coefficients  that  de¬ 
pend  on  the  shade  or  color  of  the  paint  used  to  print 
them.  Figure  4(a)  shows  the  model  acquisition  results 
for  a  3-D  object.  The  range  image  was  obtained  using 
a  light  stripe  range  finder.  The  vertices  of  the  trian¬ 
gle  displayed  are  the  centroids  of  three  regions  whose 
reflectance  ratios  were  used  as  indices  in  the  hash  ta¬ 
ble.  Other  nearby  regions  that  are  included  in  the  hash 
table  entry  for  object  verification  and  pose  estimation 
and  their  centroids  are  indicated  by  black  boxes.  All 
regions  used  for  model  acquisition  and  recognition  are 
assumed  to  be  planar  though  the  objects  they  lie  on  may 
be  curved.  Under  this  assumption,  the  centroid  of  an  im¬ 
age  region  corresponds  to  the  projection  of  the  centroid 
of  the  region  in  3-D  space.  The  planarity  assumption 
is  generally  found  to  be  reasonable  for  regions  that  are 
small  compared  to  the  size  of  the  object. 

Recognition  amd  pose  estimation  is  done  using  a  sin¬ 
gle  brightness  image  of  the  scene  (see  Figure  4(b)).  The 
scene  consists  of  several  3-D  objects  in  different  orien¬ 
tations  and  positions.  It  includes  occlusions,  shadows, 
and  non-uniform  illumination.  The  reflectance  ratio  al¬ 
gorithm  was  applied  to  the  scene  image  and  a  total  of 
18  constant  reflectance  regions  were  detected.  The  in¬ 
dex  triangle  shown  in  the  model  image  is  found  and  ver¬ 
ified  in  the  scene  image.  The  set  of  three  regions  in  the 
scene  image  produce  a  hypothesis  for  the  object.  Other 
regions  in  the  object  model  are  used  to  verify  this  hy¬ 
pothesis  using  the  alignment  technique  [11].  The  actual 
and  projected  centroids  of  the  verification  regions  are  in¬ 
dicated  by  black  and  white  boxes,  respectively.  Some  of 
the  verification  regions  are  not  found  in  the  scene  image 
since  they  are  occluded  by  other  objects. 

6  Conclusion 

We  conclude  with  the  main  points  of  this  paper: 

•  We  presented  a  photometric  invariant,  called  region 
reflectance  ratio,  that  is  computed  from  a  single 
brightness  image. 


•  We  have  used  reflectance  ratios  to  recognize  objects 
from  images.  This  approach  is  in  constrast  to  pre¬ 
vious  recognition  methods  that  rely  solely  on  geo¬ 
metric  features  for  recognition  and  pose  estimation. 

•  The  recognition  technique  presented  here  is  capable 
of  automatically  learning  models  of  the  objects  of 
interest. 
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Reflectance  Ratio 


Wl  «)  «)  20  0  20 


Source  Direction  (dcerccst 

Kigtire  Invariance  of  reflectance  ratios  to  the  direction 
of  illumination. 


(b)  Object  recognition  and  pose  estimation. 

Figtire  '1:  Model  acquisition  and  object  recognition  re 


suits  obtained  for  a  three-dimensional  recognition  prob 


lem. 


Robotic  Three  Dimensional  Dual-Drive  Hybrid  Control  for 
Tactile  Sensing  and  Object  Recognition* 

Kelly  A.  Korzeniowski  and  William  A.  Wolovich 

Laboratory  for  Engineering  Man/Machine  Systems 
Division  of  Engineering, 

Brown  University, 

Providence,  RI  02912 


1  Abstract 

This  work  describes  the  development  and  testing  of 
a  dual-drive  surface  tracking  controller  that  en¬ 
ables  a  robot  to  track  along  any  specified  path  on 
the  surface  of  an  object.  The  dual-drive  controller 
computes  the  normal  and  tangent  vectors  relative 
to  movement  along  the  path.  The  result  is  con¬ 
trolled  movement  in  3-D  on  the  surface  of  an  object. 
This  tactile  data  collection  method  is  referred  to  as 
“object-dependent”  sensing  because  the  location  of 
the  sensing  paths  is  driven  by  comparisons  made  by 
the  recognizer  to  a  model  data  base.  It  is  assumed 
that  the  path  is  generated  by  an  external  recognizer 
in  such  a  way  that  the  data  points  collected  by  tac¬ 
tile  sensing  silong  the  path  will  maximize  the  prob¬ 
ability  of  correctly  identifying  the  object.  The  ap¬ 
plication  for  such  a  data  collection  system  is  object 
recognition  tasks  in  environmental  exploration  and 
manipulation. 

2  Overview 

In  the  complete  “object-dependent”  tracking  sys¬ 
tem,  we  envision  an  external  recognition  program 
that  uses  partial  data  sets  collected  by  tactile  sen¬ 
sors  on  the  object’s  surface  to  attempt  an  identifi¬ 
cation  and  to  direct  future  data  collection  to  limit 
uncertainty  in  making  an  identification.  “Object- 
dependent”  tactile  sensing  with  the  3-D  dual-drive 
controller  is  an  improvement  over  the  general  3- 
D  object  tracking  algorithm  [Korz9l]  with  the  2-D 
dual-drive  controller.  The  3-D  dual-drive  controller 
allows  the  robot  to  track  along  any  path  with  the 
end  effector  in  any  orientation.  Whereas  the  2-D 
dual-drive  controller  limits  tracking  to  the  horizon¬ 
tal  and  vertical  planes  in  the  base  frame.  By  adding 
recognizer  information  to  the  tracking  scheme,  tac¬ 
tile  sensing  is  no  longer  limited  to  general  tracking 
methods  that  may  be  spending  time  collecting  ex¬ 
traneous  data  or  ignoring  features  on  an  object  that 
are  critical  to  its  identification. 

'This  work  was  partially  funded  by  the  NSF  in  con¬ 
junction  with  the  Advanced  Research  Projects  Agency 
of  the  Department  of  Defense  under  Contract  No.  IRI- 
8905436 


Many  applications  of  tactile  sensing  for  data  col¬ 
lection  concentrate  on  driving  the  position  of  the 
robot  by  machine  vision  in  order  to  collect  discrete 
data  points  on  the  surface  of  an  object.  The  work 
presented  here  differs  in  the  respect  that  the  dual¬ 
drive  controller  allows  the  robot  to  stay  in  contact 
with  a  surface  and  in  general  the  result  is  that  more 
data  points  can  be  collected  in  a  shorter  period  of 
time.  This  paper  will  show  that  tactile  force  sensing 
can  be  used  to  control  the  robot  and  could  be  the 
main  data  collection  method.  Machine  vision  would 
be  used  as  a  secondary  data  collection  method  that 
supplements  tactile  surface  tracking  by  providing  in¬ 
formation  about  the  bounding  outline,  and  an  esti¬ 
mate  for  the  surface  normal  and  thus  the  approach 
vector  for  tactile  surface  contact  points  at  the  be¬ 
ginning  of  each  path. 

This  controller  is  implemented  and  tested  using 
an  IBM  7565  Cartesian  robot  equipped  with  strain 
gauges  on  the  end  effector.  The  tracing  controller 
is  designed  so  that  the  robot  will  be  able  to  track 
any  free  form  complex  object.  Experimental  results 
are  presented  to  show  that  with  the  3-D  dual-drive 
controller,  the  robot  is  able  to  track  an  arbitrary 
path  on  a  complex  3-D  object. 

3  Prior  Results  -  Dual-Drive  in  2-D 

The  2-D  dual-drive  force/velocity  controller  is  de¬ 
scribed  in  [KazaSS].  Given  the  ^ue  of  the  magni¬ 
tude  for  the  desired  force  and  velocity,  the  action  of 
the  controller  is  to  zero  both  the  force  and  veloc¬ 
ity  errors.  The  result  is  that  the  robot  maintains 
contact  with  an  object,  moving  along  the  surface. 

In  this  hybrid  controller,  separate  servo  loops  han¬ 
dle  the  force  and  position  calculations.  The  force 
servo  loop  is  a  dsunping  controller  with  saturation. 
Basically,  the  force  error,  the  difference  in  magni¬ 
tude  between  the  desired  and  actual  force  vectors,  is 
converted  to  a  velocity  value  by  a  damping  constant. 
The  resulting  velocity  is  compared  to  the  desired  ve¬ 
locity  and  integrated  to  obtain  a  reference  position 
for  the  robot. 

The  dual-drive  controller  assumes  that  the  normal 
and  tangent  vectors  will  always  be  orthogonal.  This 
relation  between  the  velocity  and  the  force  can  be 
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Figure  1:  Determining  the  Sign  of  Force  and  Velocity 


maintained  if  the  tracking  surface  is  sufficiently  stiff 
and  the  friction  between  the  surface  and  the  end  ef¬ 
fector  is  nedigible.  If  the  surface  is  frictionless,  the 
normal  willbe  in  the  same  direction  as  the  force  vec¬ 
tor  F  (see  Figure  1).  Using  the  force  sensor  readings 
the  outward  unit  normal  and  unit  tangent  vectors 
are  given  by, 

N  =  j^(F.,F,)  (1) 

T  *  |^(-F„F,)  (2) 

In  the  case  of  a  stiff  surface,  the  velocity  due  to 

force  corrections  is  smsdl  and  V  is  tangential.  The 
outward  unit  normal  and  unit  tangent  vectors  are, 

N  =  |^(V„-V.)  (3) 

T  =  i^(V..V.)  (4) 

where  1P|  =  ^  and  |Vl  = 

These  four  equations  for  the  unit  normal  and  unit 
tangent  vectors  ave  four  different  dual-drive  imple¬ 
mentations.  See  [Kaza88]  for  more  details. 


xsgn 

state  1  or  4 

- 

state  2  or  3 

ysgn 

state  1  or  2 

- 

state  3  or  4 

Table  1:  Combinations  for  the  Sign  of  Force  and 
Velocity 

In  the  tracking  application  considered  here,  the 
complete  tracking  trajectory,  the  shape  of  the  ob¬ 
ject,  is  not  known  prior  to  tracking.  As  the  robot 
moves  around  an  object,  the  sinis  of  the  x-y  com¬ 
ponents  of  the  actuu  force  and  velocity  change  as 
shown  in  Fijnire  1.  These  changes  are  summarized 
in  Table  1.  The  equations  for  determining  the  sign 
of  the  actual  force  and  velocity  vectors  are, 

«Sn(V«et)  =  sgn[xagn(V^)  -  ysgn{Vx)]  (5) 
8gn(Faet)  =  agn[xagn{FM)  -  »«sn(F,)]  (6) 

The  configuration  of  the  dual-drive  controller  im¬ 
plied  by  the  equations  (1)  through  (6),  was  used  to 
track  and  collect  data  points  to  identify  2-D  objects. 
See  [Brad90l  for  more  details. 


Figure  2:  Top  View  -  Cross  Section  of  an  Object 


A  general  8-D  object  tracking  algorithm  was  de¬ 
veloped  using  the  2-D  dual-drive  controller  [Korz9l]. 
The  robot  tracked  around  an  unknown  object,  col¬ 
lecting  horizontal  planar  slices  of  data  for  use  in 
recognition.  The  subject  of  the  remainder  of  this 
report  is  the  expansion  of  the  2-D  dual-drive  control 
equations  for  use  in  tracking  arbitrary  paths  in  3-D 
space. 

4  General  Surface  lYacking 
Dual-Drive  Control  in  3-D 

The  dual-drive  controller  operates  on  the  assump¬ 
tion  that  the  actual  normal  and  tangent  vectors  will 
be  orthogonal.  One  of  the  requirements  for  this  is 
that  surface  firiction  must  be  negligible.  A  low  fric¬ 
tion  point  for  surface  tracking  occurs  when  the  end 
effector  is  aligned  near  the  surface  normal.  It  has 
been  found  experimentally  that  this  restriction  can 
be  relaxed  so  that  the  angle  between  the  end  effector 
and  the  surface  normal,  9  may  actually  be  within  a 
determined  range,  i.e.  |  0  |<  0thr ahold-  If  the  sur¬ 
face  normal  greatly  varies  along  the  tracking  line 
and  I  0  |>  9thrt$hoidt  the  end  effector  is  reoriented, 
then  the  dual-drive  equations  are  recalculated  and 
tracking  continues. 

The  extension  from  dual-drive  control  in  2-D  to 
3-D  can  be  illustrated  in  the  following  way.  A  recog¬ 
nition  program  generates  a  bounded  path  that  will 
be  projected  onto  a  surface.  The  path  is  projected 
by  passing  a  plane  between  the  bounds  on  the  path 
generated  by  the  recognizer.  The  orientation  of  the 
plane  should  be  near  the  surface  normal.  This  will 
become  the  orientation  of  the  end  effector,  and  the 
robot  will  track  the  path  moving  in  the  plane.  Fig¬ 
ure  2  illustrates  the  determination  of  the  tracking 
path. 

The  direction  of  the  vector  connecting  the 
bounded  points  of  the  path  is  the  tangential  direc¬ 
tion  of  motion  for  the  robot.  As  the  surface  changes 
locally,  the  normal  and  tangent  vectors  adjust  and 
the  result  is  that  the  robot’s  movement  occurs  in 
the  2-D  plane.  Dual-drive  tracking  occurs  relative 
to  the  2-D  plane  in  the  dual-drive  Da  frame.  This 
plane  can  be  oriented  in  3-space  and  given  the  ^ 
propriate  mapping  between  the  two  frames  the  result 
is  dual-drive  tracking  in  the  world  frame  O. 

By  using  the  orientation  coordinate  transforms- 


Figure  3:  Frame  Axes  Between  Base  and  Dual-Drive 

tion  matrices,  for  yaw  pitch  and  roll  rotations 
[Wolo87],  to  find  the  mapping  for  the  force/velocity 
and  position  components  between  the  O  and  Da 
frames,  ,  one  is  able  to  make  dual-drive  2-D 
force  velocity  calculations  in  reference  to  the  di¬ 
rected  tracking  path.  The  components  of  force  and 
velocity  in  the  Da  frame  are  a  function  of  the  3-D 
force  components  of  the  force  and  velocity  in  the  O 
frame.  The  resulting  2-D  motion  in  the  dual-drive 
frame  is  mapped  to  3-D  motion  in  the  base  frame. 

5  Mapping  Vector  Quantities 
Between  fVame  Axes 


transformation  matrix 

function  of  angle 

RS' 

V>2 

R^, 

Rn> 

R?; 

p 

dT' 

rcu# 

7 

R^: 

i>2 

R%- 

_ 4. . 

i>2 

Table  2:  Relating  Orientation  Matrices  to  Angles 

For  general  yaw  angle  rotations  yl>,  the  calcula¬ 
tions  for  the  mappings  between  frame  axes  is  per¬ 
formed  in  two  steps.  Velocity  readings  are  given  by 
the  system’s  sensors  in  reference  to  the  O  frame, 
(vxo ,  Vyo)  v^o)-  force  strain  gauges  are  assigned 
to  associated  axes  in  frame  O,  {fxo>  fyot  f*o)-  I® 
Figure  3  (a),  the  *o,  axis  corresponds  to  the  yaw  of 
the  end  effector.  The  tool  is  oriented  along  the  xo. 
axis.  In  order  to  make  the  necessary  calculations, 
the  velocities  are  needed  in  the  Oa  frame.  First  the 
Xo.  axis  in  frame  Oa  is  yawed  by  an  aunount  equal 
to  V"!  =  to  rotate  the  axes  to  line  up  with 

the  X,  Y,  or  Z  axis  in  the  base  frame  O.  This  al¬ 
lows  the  force  and  velocity  along  the  xq,  and  yo. 
axes  to  be  expressed  in  terms  of  the  forces  and  ve¬ 
locities  in  the  xq  and  yo  axes.  The  orientation  ro¬ 
tation  matrix  used  in  this  mapping  is  ,  a  yaw 
matrix.  The  next  yaw  rotation,  ^2  =  ^  —  Vonod^O®, 
is  applied  to  frame  O  to  align  the  axes  in  frame  O' 


Figure  4:  3-D  Dual-Drive  Ck>ntroller  -  Block  Dia¬ 
gram 


with  the  base  frame.  This  is  a  necessary  calcula¬ 
tion  to  move  the  axes  to  a  common  starting  point  in 
order  to  generalize  the  equations  and  correctly  ap¬ 
ply  the  following  rotations.  The  rotation  between 
the  O'  and  T*  frames,  a  pitch  rotation  p,  brings 
the  XT'  axis  in  alignment  with  the  surface  normal. 
The  rotation  between  the  7®  and  O'  frames,  a  roll 
rotation  7,  brings  the  ypi  axis  in  alignment  with 
the  vector  that  connects  the  bounding  points  on  the 
tracking  path.  This  is  the  direction  of  motion  for 
the  robot.  A  final  yaw  rotation  of  il>i  degrees  is 
made  between  the  O'  and  Da  frames.  The  dual¬ 
drive  calculations  in  frame  Da  are  made  with  Fp.  = 
f{fxo,fyo,fzo)  =  {fxD.JvD,)  and  Vb.  = 
f(vxo,vyo,vzo)  =  (v*D,,vyp,)  along  with  the  as¬ 
sociated  xsgn  and  ysgn  for  the  vectors.  The  ma¬ 
trix  equations  become,  Fp,  =  Fo, 

and  Vb.  =  R^^R%I^',B^tR%Vo.  In  Figure  3 
(b),  a  similar  ctdculations  is  made  between  the  O 
and  O  frames  for  mapping  the  calculated  displace¬ 
ments  back  to  the  base  O  frame.  The  yaw  rota¬ 
tion  that  is  applied  first  is  V*!  then  V’2-  The  dis¬ 
placements  are  computed  Fo,  =  fip^D.tPyo.)  = 
(p®o,iPyo,>P*o,)-  The  matrix  equation  is  Po  = 
R^' RqiR^'iR^iR^" Pd,-  The  matrices  and  the  re¬ 
spective  trzmsposes  ue  functions  of  the  rotation  an¬ 
gles  as  listed  in  Table  2. 

The  range  on  the  total  yaw  angle  rotation  is, 
0”  <  V*  <  360®.  This  allows  the  robot  to  track 
around  an  object.  The  pitch  rotation  is  related  to 
the  orientation  of  the  end  effector.  The  range  on  the 
pitch  rotation  is  —90®  <  p  <  90®.  The  roll  rotation 
is  related  to  the  direction  of  motion  given  by  the  vec¬ 
tor  connecting  the  end  points  of  the  tracking  path 
generated  by  the  recognition  program.  The  range 
on  the  roll  rotations  is  —90®  <  7  <  90®.  This  range 
along  with  positive  and  negative  velocity  specifica¬ 
tions  for  the  robot’s  direction  of  motion,  allows  the 
robot  to  track  along  any  given  direction  of  motion. 

The  complete  block  diagram  for  the  3-D  dual¬ 
drive  controller  is  shown  in  Figure  4.  The  orientation 
mapping  matrices  are  included  in  the  diagram. 
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6  Matrix  Multiplication 

It  was  decided  that  the  yaw  and  roil  angles  will  be 
expressed  as  positive  euid  the  pitch  angle  will  be  ex¬ 
pressed  as  negative.  The  matrices  reflect  this  choice. 

The  force  readings  are  taken  in  reference  to  the 
Oa  frame  according  to  the  way  the  end  effector 
is  aligned  along  the  axes.  These  force  readings, 
Fo^  =  {fxo,,fyo,^fzo,)  are  mapped  to  the  Da 
frame,  Fd,  =  (/ro.  i  /»d,  )•  Note  that  by  definition, 
only  the  X  and  Y  components  of  the  forces  in  the 
Da  frame  are  used  in  the  dual-drive  calculations. 

FD,  =  R%‘,Rt.B^‘.R%lFo, 

_ 

— 

— 5V>2  0  \  /  1  0  0  \ 

sih  cih  0  I  (  0  C7  S7  1 

0  0  1  /  \  0  -57  C7  / 

mnltiplied  by 

/iCON/c^frj  5V^0\ 

I  0  1  0  I  I  cih  0  1 

\001/\0  0  1/ 

The  resulting  equations  for  Fd,  are, 

/»D  =  +  »*V'aC7)/o,  +  (c^aih  ~  ciha^hn)fo^ 

-K-sV>257)/o, 

ho  =  -  sV^C7cV>a)/o,  -I-  +  c^i>2C‘f)fo, 

+(ci>357)/o. 

The  velocity  readings  are  given  by  the  sensors  in 
reference  to  the  O  frame.  These  velocity  readings 
Vo  =  (vxoi^voi^ao)  are  mapped  to  the  Da  frame, 
Vd,  =  >  Wyo,  )•  1°  th®  F)a  frame  by  definition 

there  is  no  velocity  along  the  zp,  axis.  The  velocity 
equations  to  be  used  in  the  dusd-drive  calculations 
are  given  by  the  following  matrix  multiplications. 

Vd.  =  Fg;  Fg' «?;  R%7  Fg.  Vo 

po.  _ 

"d,  - 

cV’2  — sV'a  0  \  /  1  0  0  \ 

5^2  cV^  0  I  (  0  C7  C7  I 

0  0  1  /  \  0  —57  C7  / 

rnnkiplied  by 

(cp  0  -5p  \  /  C^2  ai)2  0  \ 

0  1  0  )  I  -5V'2  cV'2  0  ) 

5P  0  cp  /  \  0  0  1  / 

multiplied  by 

cV>i  5V»1  0  \ 

-5V'i  ctfri  0  I 

0  0  1  y 

The  resulting  equations  for  Vd.  are, 


v*D.  =  »xo  (cV’i(c*tfr2cp  -  ai\  {ayapcih  -  c-jain)) 
-ai)\{ci>2Cpai>2  -  afhiatapath  +  ncil>u)) 

“  5^(575pc^fr2  -  cya^)) 
+c^i(cihcpa^  -  ai)2iayaparf>2  +  C7C^)) 
+v.o(-cthap  -  aihaycp) 

v»D.  =  +  cihiayapcih  -  cyaih)) 

-5^i(5^2cp  4-  cih(ayapa^  +  cycih))) 
+»*o(*V’i(cifri(5^cpc^!>2  +  c^(ayapci>2  -  cyaih))) 
+c^i(5^3cp  +  ch(ayapa^  +  cych))) 

■¥Vxo(-ahap  +  cihaycp) 

The  displacements  for  the  robot  must  be  spec¬ 
ified  in  reference  to  the  O  frame.  The  dual¬ 
drive  calculations  produce  the  displacements  Pd.  = 
{PxD.tPfo.)-  These  are  mapped  to  the  O  frame, 
Fo  —  (PxotPpoiPxo)- 

Po  =  R2,'.RtRT‘FD'RD‘PD. 

R%.= 

cih  —aih  \  /  cp  0  ap  \  /  1  0  0 

ah  ch  ®)|  ®  ^  0)(0c7  —ay 

0  0  1  J  \  —ap  0  cp  /  \  0  57  cy 

multiplied  by 

(ch  ah  0  \  /  cin  -ah  0  \ 

-ah  ctfr2  0  )  I  ah  ch  0  1 

0  0  1  /  V  0  0  1  y 

The  resulting  equations  for  Po.  are 

Pxo.  =  Pxo.  {chicpc^h  -  ah(ayapch  -  cyah)) 
+3h{cpchah  +  ch(ayapch  -  cyah))) 

-  ahiayapch  -  cyah)) 
+ch{cpcil>2ah  +  ch(ayapch  -  cyah))) 

Pyo.  =  P*D.  {chicpahch  -  ah{ayapah  +  cych)) 
+ah(cpa'^h  +  ch(ayapah  +  cych))) 

-  ah(ayapah  +  cych)) 
+c^i(cpa^h  +  ch{ayapah  +  cych)))) 

P‘o.  =PxDAch{-apch- ahaycp)) 

+ah\-apah  +  aycpch)) 

+P*D,  {-»h  [-apch  -  ahaycp)) 
■+ch{-apah  +  aycpch))) 


7  Experiments  and  Applications 

The  3-D  dual-drive  controller  was  implemented 
and  tested  using  an  IBM  7565  Cartesian  robot. 
Presently,  the  actions  of  the  recognition  system  are 
simulated.  The  location  and  bounds  on  the  tracking 
path  and  the  orientation  of  the  end  effector  are  sup¬ 
plied  manually.  Given  the  end  effector  orientation 
and  the  approach  vector,  the  robot  moves  towzird 
the  surface  until  contact  is  made  as  indicated  by  the 
force  sensing  strain  gauges.  At  this  point  surface 
tracking  and  data  collection  commences. 
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Figure  5:  Velocity  Data  Plots,  |  Vietired  1  = 
l.Oin/sec 


Figure  6:  Force  Data  Plots,  |  Fde,irtd  |  =  l.llbs 


Controller  modifications  are  added  to  monitor  0 
and  the  force  at  the  end  effector.  If  surface  con¬ 
tact  is  lost  during  tracking  the  dual-drive  controller 
is  temporarily  deactivated  until  contact  is  regained 
via  a  surface  searching  routine.  The  robot  moves 
in  an  outward  turing  square  spiral  until  contact  is 
regained,  then  surface  following  continues. 

A  reading  lamp  was  chosen  to  show  the  perfor¬ 
mance  of  the  3-D  dual-drive  controller.  In  Figure 
5  and  Figure  6  the  velocity  components  >  Wyo,  > 

I  VD,  I  and  force  components  /yo..  I  fo,  1 
relative  to  the  Da  frame  are  shown  in  the  following 
graphs.  This  data  was  collected  as  the  robot  moved 
along  a  typical  tracking  path.  In  the  force  mag¬ 
nitude  I  /i),  I  and  the  velocity  magnitude  1  vp,  | 
graphs,  the  magnitude  of  the  desired  force  and  ve¬ 
locity  is  indicated  on  each  of  the  graphs.  Again  the 
performemce  of  this  controller  is  comparable  to  the 
2-D  case  as  reported  in  [KazaSS]. 

8  Conclusions  and  Future  Work 

This  report  has  explained  the  development  of  a  3-D 
dual-drive  controller.  The  application  for  this  con¬ 
troller  is  tracking  line  segments  on  an  object’s  sur¬ 
face  in  order  to  recognize  the  object.  It  is  assumed 
that  an  external  recognition  program  generates  the 
location  of  the  tracking  line.  Data  points  that  are 
collected  as  the  robot  moves  along  the  line  will  limit 
the  uncertainty  with  which  an  identification  can  be 
made. 

Given  a  desired  force  and  velocity,  the  dual-drive 
controller  zeroes  both  force  and  velocity  errors  so 
that  the  robot’s  end  effector  moves  across  the  sur¬ 
face.  The  result  is  3-D  controlled  tracking  motion. 
It  has  been  demonstrated  that  the  controller  can  be 
successfully  applied  to  tracking  real-world  objects. 

Work  is  currently  underway  to  develop  a  3-D 
tracking  algorithm.  The  chief  purpose  of  the  algo¬ 
rithm  will  be  to  combine  continuous  and  discontinu¬ 


ous  tracking  methods  in  order  to  more  intelligently 
apply  the  controllers  to  handle  changes  in  surface 
along  the  tracking  path  interval.  Prior  knowledge 
about  surfaces  that  appear  in  the  environment  may 
also  be  included  depending  on  the  requirements  of 
the  tracking  algorithm  and  object  recognition.  The 
purpose  of  this  work  will  be  to  develop  an  algorithm 
that  will  enable  the  robot  to  track  complex  real- 
world  objects  on  a  specified  tracking  interval  using 
the  3-D  dual-drive  controller. 
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Abstract 

We  attempt  to  solve  the  problem  of  imperfect  data 
produced  by  state-of-the-art  edge  detectors  through  the 
implementation  of  laws  of  Perceptual  Grouping,  de¬ 
rived  from  the  psychology  held. 

We  introduce  a  saliency-enhancing  operator  capa¬ 
ble  of  highlighting  features  (edges,  junctions  etc.) 
which  are  considered  ‘important’  psychologically.  It 
also  infers  features  which  are  not  detected  by  low-level 
detectors.  We  show  how  to  extract  salient  curves  and 
Junctions  and  generate  a  description  ranking  these  fea¬ 
tures  by  the  likelihood  of  them  occurring  accidentally. 

We  also  treat  the  problem  of  illusory  contours  ap¬ 
parent  in  end-point  formations.  The  scheme  is  particu¬ 
larly  useful  as  a  gap  filler  and  in  the  presence  of  a  large 
amount  of  noise.  It  is  interesting  to  note  that  all  opera¬ 
tions  are  parameter-free,  non-iterative  and  are  linear 
with  the  number  of  edges  in  the  input  image. 

1  Introduction 

An  area  which  is  likely  to  improve  results  in  com¬ 
puter  vision  is  the  one  of  perceptual  grouping.  Percep¬ 
tual  Grouping  can  be  classified  as  a  mid-level  process 
directed  toward  closing  the  gap  between  what  is  pro¬ 
duced  by  state-of-the-art  low-level  algorithms  (such  as 
edge  detectors)  and  what  is  desired  as  input  to  high  level 
algorithms  (perfect  contours,  no  noise,  no  fragmenta¬ 
tion,  etc.).  Many  researchers  resort  to  using  synthetic 
data  as  their  input  because  of  these  weaknesses. 

Methods  for  edge  labeling  (like  [Clowes  1971], 
[Waltz  1975])  assume  perfect  segmentation  and  connec¬ 
tivity,  and  define  constraints  which  are  only  valid  under 
these  assumptions.  These  methods  cannot  work  on 
‘real’  edges.  Other  methods,  like  shape  from  contour 
[Ulupinar  1991],  and  representation  of  objects  ([Rom 

*  This  research  was  supported  by  the  Advanced  Research  Projects 
Agency  of  the  Department  of  Defense  and  was  monitored  by  the  Air 
Force  Office  of  Scientific  Research  under  Contract  No.  F49620-90- 
C-0078.  The  United  States  Government  is  authorized  to  reproduce 
and  distribute  reprints  for  governmental  purposes  notwithstanding 
any  copyright  notation  hereon. 


and  Medioni  1993],  [Sugihara  1984]),  also  rely  heavily 
on  the  connectedness  of  the  edges,  and  can  boiefit  hom 
the  removal  of  noise  (erroneous  segments).  Pattern  rec¬ 
ognition  sdiemes  (like  [Stein  and  Medioni  1992])  rely 
on  (at  least)  partial  connectedness  of  the  edges,  and  can¬ 
not  function  if  the  edge  image  is  very  fragmented.  Also, 
the  amount  of  noise  is  directly  proportional  to  the  com¬ 
putational  cost  of  finding  ‘real’  objects  in  a  scene. 

Using  global  perceptual  considerations  when  at¬ 
tempting  to  connect  fragmented  edge  images  can  allevi¬ 
ate  many  of  the  above  problems,  as  shown  later.paper. 

In  a  previous  p^r  ([Guy  and  Medioni  1992a]),  we 
have  introduced  a  general  idgorithm  ct^ble  of  high¬ 
lighting  features  due  to  co-curvilinearity  and  proximity. 
We  suggested  the  Extension  Field  as  a  ‘voting  pattern’ 
representing  a  laige  family  of  smooth  curves,  all  at 
once.  Here  we  further  explore  the  properties  of  the  Ex¬ 
tension  Field.  We  also  suggest  specialized  fields  that  can 
be  used  within  the  same  computational  paradigm  to  re¬ 
veal  perceptual  phenomena  such  as  end-point  forma¬ 
tions  and  straight  lines.  Experimental  results  with  dif¬ 
ferent  types  of  distortion  applied  to  the  irqjut  are  also 
present^.  We  start  by  explaining  the  origituil  algo¬ 
rithm,  then  describe  some  of  the  propeities  of  the  Extot- 
sion  Field.  We  follow  by  a  comparison  with  the  classical 
Hough  transform  and  conclude  by  offering  a  set  of  other 
fields  for  special  purposes. 

1.1  Perceptual  Grouping 

Perceptual  Groupng  refers  to  a  class  of  visual  i^ie- 
nomena  where  clustering  of  physically  non-connected 
elements  in  the  image  occurs.  This  task  is  equivaloit  to 
a  figure-ground  discrimination  when  patterns  are  em¬ 
bedded  in  noise.  Figure  1  depicts  examples  of  perceptu¬ 
al  groupings  which  are  of  interest  to  us,  and  considered 
to  be  the  result  of  a  pre-attentive  process.  Such  process¬ 
es  are  known  to  take  several  huridreds  of  millisecorxis 
(200-500  ms)  to  complete,  and  are  thus  not  likely  to  uti¬ 
lize  any  high-level  reasoning  mechanism  in  the  brain 
[Bofr,etal.  1986]. 

The  circle  in  the  middle  of  figure  1(a)  is  easily  dis¬ 
tinguishable  from  its  noisy  background.  Furthermore, 
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we  tend  to  fill  the  gaps,  or  more  precisely,  we  are  able  to 
complete  the  circle  mentally.  The  same  holds  for  the 
geometrical  patterns  in  1(b).  Figure  1(c)  depicts  a  dot 
formation.  Again,  grouping  of  certain  dots  is  possible, 
and  salient  curves  are  noticeable.  A  more  striking  exam¬ 
ple  of  illusory  contours  is  found  in  the  Kanizsa  illusion 
[Kanizsa  1976]  shown  in  Figure  1(d).  Here  we  perceive 
edges  which  have  no  physical  support  whatsoever  in  the 
original  signal. 

The  Gestalt  psychologists  (e.g.  [Boff ,  et  al.  1986]) 
were  among  the  first  to  address  the  issues  of  pre-atten- 
tive  perception.  Many  Taws  of  grouping’  were  formu¬ 
lated,  but  none  put  in  any  computational  (or  algorith¬ 
mic)  language.  Furthermore,  the  rules  tend  to  supply 
conflicting  explanations  to  many  stimuli.  This  makes 
the  computational  implementation  of  such  laws  non¬ 
trivial.  With  input  in  the  form  of  edges,  the  laws  most 
relevant  to  our  work  relate  to  proximity  and  good  con¬ 
tinuation. 

Lowe  [Lowe  1987]  discusses  the  Gestalt  notions  of 
co-linearity,  co-curvilinearity  and  simplicity  as  impor¬ 
tant  in  perceptual  grouping.  Ahuja  and  Tliceryan  [  Ahuja 
and  Thceryan  1989]  suggest  methods  for  clustering  and 
grouping  sets  of  points  having  an  underlying  perceptual 
pattern. 

£)olan  and  Weiss  [Dolan  and  Weiss  1989]  demon¬ 
strate  a  hierarchical  approach  to  grouping  relying  on 
compatibility  measures  such  as  proximity  and  good 
continuation.  Mohan  and  Nevada  [Mohan  and  Nevada 


(a)  (b) 

r 

(d) 

Figure  I  (a)  &  (b)  Two  instances  of  perceptual  ar¬ 
rangements.  (c)  A  dot  formation,  {d)  The  Kanizsa 
Square. 


(c) 


1989a]  assume  a-priori  knowledge  of  the  contents  of  the 
scene  (i.e.  aerial  images).  A  model  of  the  desired  fea¬ 
tures  is  then  defined,  and  groupings  are  performed  ac¬ 
cording  to  that  model.  In  a  later  work  [Mohan  and 
Nevatia  1989b],  groupings  based  explicitly  on  symme¬ 
tries  are  suggested,  but  the  first  connecdvity  steps  are 
performed  locally. 

Sha’ashua  and  Ullman  [Sha’ashua  and  UUman 
1988]  suggest  the  use  of  a  saliency  measure  to  guide  the 
grouping  process,  and  to  eliminate  erroneous  features  in 
the  image.  The  scheme  prefers  long  curves  with  low  to¬ 
tal  curvature,  and  does  so  by  using  an  incremental  opd- 
mizadon  scheme  (similar  to  dynamic  programming). 

The  main  features  of  some  of  the  more  important 
works  are  summarized  in  table  1,  and  contrast^  with 
our  scheme.lt  is  interesdng  to  note  that  virtually  all  pro¬ 
posed  algorithms  use  local  operators  to  infer  more  glo¬ 
bal  structures.  Also  note  that  many  of  the  schemes  are 
iterative,  relying  on  one  relaxation  (or  minimizadon) 
scheme  or  another,  and  are  similar  in  that  sense.  The 
main  differences  are  in  the  choice  of  the  compadbility 
measures  or  the  function  to  minimize. 

2  Overview  of  Our  Approach 

As  was  demonstrated  before,  the  physical  evidence 
extracted  locally  from  images  (e.g.  through  edge  detec¬ 
tors)  is  in  many  cases  ambiguous  and  does  not  fully  cor¬ 
respond  to  the  human  percepdon  of  the  image.  It  is  thus 
necessary,  we  believe,  to  impose  global  perceptual  con¬ 
siderations  at  the  low-level  process. 

In  our  method,  each  site  (pixel  or  other  cell)  collects 
votes  from  every  segment  in  the  image.  These  votes 
contain  orientation  and  strength  information  preferred 
by  the  vodng  segment.  A  measure  of  ‘agreement’  (in 
terms  of  orientation)  is  now  computed,  and  sites  which 
have  high  agreement  values  are  considered  salient.  In 
more  technical  terms,  a  vector  field^  is  generated  by 
each  segment,  and  a  function  over  the  whole  space  de- 
teimines  points  of  saliency.  A  subsequent  step  links  ar¬ 
eas  of  high  saliency  to  produce  a  descripdon  in  terms  of 
curves  and  junctions. 

Our  voting  scheme  is  somewhat  related  to  the 
Hough  transform  approach  [Hough  1962],  but  can  de¬ 
tect  shapes  defined  by  ihcit properties  (smoothness  etc.) 
rather  than  by  their  exact  shape  Oines,  circles,  etc.).  A 
study  of  the  relations  between  our  scheme  aixl  the 
Hough  transform  is  given  in  section  11. 

The  process  is  likely  to  produce  features  more  sim¬ 
ilar  to  what  we  perceive,  both  in  terms  of  saliency  and 
connectivity.  Also,  since  noise  is  not  likely  to  produce 
high  ‘agreement’  values,  it  becomes  attenuated  and  thus 
reduces  the  complexity  of  the  image  (e.g.  in  terms  of  the 
number  of  useful  edges). 


1.  Which  wc  later  call  the  ExteiKhn  Field. 
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Table  1:  Comparison  of  different  grouping  techniques 


(Lx)we  1987] 

[Ahuja  and 
Tuceryan  1989] 

[Dolan  and 
Weiss  1989] 

[Mohan  and 
Nevada 
1989b] 

[Sha’ashua  and 
Ullman  1988] 

Our  scheme 

Operator 

Local 

Local 

Local 

Local 

Extensible  Oocal) 

Global 

Primitives 

Straight  lines 

Dots 

Straights 

curves 

strught  lines  and 
curves 

d0ai,Rnes«nl 

curves 

Control 

One  pass 

Iterative 

Iterative 

Relaxation 

Progressive 

Parallel-progres¬ 
sive  (iteradve) 

Oie-pasteraM)- 

lulion 

Noise  immuiiity 

Not  clear 

Good 

Good 

Good 

Moderate 

Very  good 

Scale 

One 

One 

Hierarchy 

One 

One 

One 

Parameters 

Yes 

No 

Yes 

Yes 

Yes 

None 

Pre-attentive 
(Domain  free) 

Yes 

Yes 

Yes 

No  (yes) 

Yes 

Yes 

Special  feature 

First 

Dot  clustering, 
parameter  free 

Multi-reso¬ 

lution 

Symmetry, 
high-level  con. 

Saliency  map 

SelieiKiyinaix, 

tmiSedl,p8nine- 

ter-jfee 

Sensitive  computa¬ 
tions 

Yes 

No 

Yes? 

Yes 

Yes 

None 

2.1  Model  of  the  Input 

We  would  like  to  associate  with  each  site  of  an  im¬ 
age  a  direction,  strength  and  a  degree  of  uncertainty  for 
that  direction.  This  can  be  nicely  approximated  by  an  el- 
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direction 


Figure  2  A  simple  input  site  model.  Every  site  is 
associated  with  a  preferred  direction,  strength 
and  eccentricity  (or  uncertainty). 

lipse  (as  illustrated  in  figure  2).  With  such  an  input  mod¬ 
el,  one  site  could  be  classified  as  being  a  part  of  a  curve 
with  known  orientation  and  no  uncertainty,  while  anoth¬ 
er  as  being  a  point  with  completely  uncertain  orienta¬ 
tion. 


We  can  thus  use  as  input  either  a  thresholded  output 
of  any  edge  detector  (with  no  linking)  or  even  an  un- 
thiesholded  version  of  the  edge  detector  output.  It  can 
be  verified  experimentally  that  our  system  yields  almost 
the  same  results  with  different  choices  of  this  threshold, 
as  long  as  a  sufficient  number  of  useful  features  are 
pre.sent  [Guy  and  Medioni  1992b). 


the  direction  of  the  local  edge.  This  value  is  accurate 
only  up  to  the  resolution  of  the  set  of  masks,  and  can  be 
used  as  an  uncertainty  measure  in  our  scheme. 

Obviously,  our  ability  to  extract  useful  features  is 
directly  proportional  to  certainty,  because  the  informa¬ 
tion  content  is  reduced  when  uncertainty  is  introduced 
in  the  input 

2.2  Model  of  the  Output 

Our  model  of  the  output  is  related  to  Ullman  and 
Sha’ashua’s[Sha’ashua  and  Ullman  1988]  in  the  sense 
that  a  saliency  map  is  first  constructed  from  an  edge  im¬ 
age,  and  higher-level  features  are  inferred  later.  The  sa¬ 
liency  map  assigns  a  value  and  a  direction  to  every  po¬ 
sition  in  the  image. 

Ideally,  such  saliency  map  should  assign  large  val¬ 
ues  of  likelihood  along  illusory  lines  (as  well  as  along 
physical  curves),  and  also  specify  a  direction  of  most 
probable  continuation  of  any  given  segment.  This  will 
enable  us,  at  a  later  stage,  to  group  features  by  following 
the  salient  connections  between  the  primitives.  The  map 
should  assign  a  value  of  zero  to  areas  of  no  saliency. 
Furthermore,  curves  with  strong  saliency  should  be  as¬ 
signed  larger  values  than  weaker  curves. 

2.3  Rationale  for  the  Extension  Field 


Uncertainty  with  respect  to  orientation  is  inherent  in 
the  process  of  edge  detection  since,  in  many  cases,  a  dis¬ 
crete  set  of  oriented  masks  arc  convolved  with  the  im¬ 
age,  and  the  one  with  the  laigest  rcspon.se  determines 


In  order  to  define  saliency  qualitatively,  we  start  by 
writing  down  the  major  constraints  which  govern  our 
mechanisms  of  saliency  (at  least  according  to  the  theory 
of  the  Gestalt). 


2  J.l  The  perceptual  constraints 

Our  underlying  goal  is  to  keep  the  interpretation  as 
simple  as  possible  in  the  'Gestalt'  sense.  This  translates 
into  four  major  constraints; 

1)  Co-curvilinearity  -  In  the  lack  of  other  cues,  smooth 
continuation  is  the  only  interpretation,  and  so  is  co- 
curvilinearity. 

2)  Constancy  of  curvature  -  We  tend  to  extend  a  curve 
of  some  constant  curvature  with  the  same  curvature, 
keeping  the  interpretation  as  simple  and  regular  as 
possible,  yet  consistent  with  our  sensory  informa¬ 
tion.  This  principle  is  called  PrSgnanz  by  Gestaltists 
(see  Figure  3). 


0  ^ 


^  -JW 

Figure  3  An  obscured  figure  (a)  triggers  the  per¬ 
ception  of  single  shines  (b),  instep  of  the  more 
complex  (c)  and  (d). 


3)  Favoring  low  curvatures  over  large  ones  -  Humans 
seem  to  connect  fragmented  line  segments  in  a  way 
that  the  increase  in  total  curvature  is  minimum  (see 
[Sha’ashua  and  Ullman  1988]). 

4)  Proximity  -  Qoser  segments  influence  each  other 
more  than  distant  ones. 


With  that  in  mind,  we  have  devised  a  technique  that 
implicitly  imposes  the  above  constrains  in  the  form  of 
an  Extension  Field  emanating  from  each  edge  segment, 
as  discussed  in  the  next  few  sub-sections. 


2J.2  Extending  a  Curve 

Given  a  line  segment  we  ask  the  question:  What  is 
the  shape  of  the  most  ‘natural’  extension,  based  on  the 
mentioned  constraints? 


Figure  4  What  is  the  shape  of  the  most  'natural’  ex¬ 
tension  to  a  given  curve? 


Many  approaches  in  the  past  (e.g.  [Dolan  and  Weiss 
1989),[Lowe  1987])  used  the  tangent  of  the  end-point  to 
determine  the  best  extension.  This  approach  cannot  al¬ 
ways  work  properly  for  three  reasons: 

1)  The  tangent  is  very  sensitive  to  noise  and  may  intro¬ 
duce  large  errors. 

2)  The  end-point  may  not  be  determined  uniquely  if 
the  curve  in  question  is  fragmented. 

3)  The  extension  can  only  be  a  straight  line  (first  or¬ 
der),  thus  not  taking  into  consideration  the  global 
shape  of  the  curve. 


We  would  thus  want  to  consider  all  the  curve  in 
such  a  way  that  the  extension  is  smooth,  influenced 
most  by  the  behavior  of  close-by  points  of  the  curve,  but 
can  assume  any  form.  Note  that  such  extension  can  be 
constructed  even  if  the  curve  is  fragmented,  and  is  ro¬ 
bust,  being  the  ‘average’  of  many  segments. 

We  go  further  than  that.  Rather  then  having  only  the 
best  extension,  we  would  like  to  list  cdl  possible  exten¬ 
sions  in  the  order  of  their  likelihood.  This  suggests  the 
use  of  some  kind  of  a  field  (or  flow)  emanating  from  the 
end-point  of  a  segment  The  idea  of  a  field  plays  a  major 
role  in  our  scheme,  as  we  will  show  later. 

233  Best  Connection  between  TWo  Line  Segments 

The  situation  occurs  whenever  a  ‘compatibility  fig¬ 
ure^’  of  two  close-by  segments  is  computed  in  order  to 
decide  whether  these  segments  should  be  grouped  to¬ 
gether.  When  determining  the  compatibility  of  two  lines 
we  would  like  to  consider  for  each  line  its  best  extension 
(or  extension  field,  as  discussed  in  3.1)  and  arrive  at 
some  compromise  as  to  the  best  path  between  them.  In 


Figures  What  is  the  best  path  between  these  two 
curves? 

other  words,  each  of  the  curves  votes  (using  its  field)  for 
a  family  of  curves.  If  the  curves  should  really  be  con¬ 
nected  then  some  extension  from  curve  1  would  align 
with  another  extension  of  curve  2,  to  form  the  compro¬ 
mise.  The  idea  of  a  compromise  between  extension 
fields  is  also  central  to  the  approach,  and  will  be  pre¬ 
sented  in  a  more  formal  way  later. 

It  is  not  reasonable  to  expect  that  extension  curves 
from  two  different  extension  fields  will  align  through¬ 
out  their  extent.  It  is  more  likely  that  such  extensions 
align  locally  in  many  places.  For  that  reason,  the  exten¬ 
sion  field  will  consist  of  local  best  candidates  for  exten¬ 
sions.  In  the  next  section  we  define  the  exact  shape  and 
usage  of  the  Extension  Field. 

3  Extension,  Point,  and  intermediary  Fields 
3.1  The  Extension  Field 

Definition:  An  Extension  Field  is  a  non-normal- 
ized  probability  directional  vector  field  describing  the 
contribution  of  a  single  unit-length  edge  element  to  its 
environment  in  term  of  length  and  direction. 

In  other  words,  it  votes  on  the  preferred  direction 


2.  A  measure  of  'agreement'  between  two  curves,  based  on  the  dif¬ 
ference  of  the  end-point  tangents  and  separation,  (as  used  by  [Dolan 
and  Weiss  1989],  [Lowe  1987],  [Mohan  and  Nevada  1989a]) 

3.  Which  is  not  necessarily  the  best  continuation  of  any  of  the  exist¬ 
ing  curves,  or  even  a  connection  between  the  given  end-points. 
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and  the  likelihood  of  existence  of  every  point  in  space 
to  share  a  curve  with  the  original  segment.  The  field  is 
of  infinite  extent,  although  in  practice  it  disappears  at  a 
prettefined  distance  from  the  edge.  Figure  1  depicts  the 
Extension  Field. 

3.2  Design  of  the  Extension  Fieid  (Orientation 
and  strength) 

Since  we  favor  small  and  constant  curvature,  field 
direction  at  a  given  point  in  space  is  chosen  to  be  tan¬ 
gent  to  the  osculating  circle  passing  through  the  edge 
segment  and  that  point,  while  its  strength  is  proportional 

Chosen  orientation 


Figure  6  Assigning  a  direction  for  every  point  in 
space 


to  the  radius  of  that  circle.  Also,  the  strength  decays 
with  the  distance  from  the  origin  (the  edge  segment). 
The  choice  of  a  circular  extension  agrees  with  the  con¬ 
straint  of  smallest  total  curvature. 

In  trying  to  computationally  evaluate  the  various 
constraints  over  a  given  curve  we  find  that  a  (somewhat 


Edge  element 
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Figure  1  The  basic  Extension  Field,  (a)  Direction, 
and  (b)  Strength. 


revised^)  measure  of  Total  Curvature  (as  used  by 
[Sha’ashua  and  Ullman  1988])  eiKompasses  most  of  die 
desired  constraints.  We  define  the  Tot^  Curvature  (TQ 
to  be: 


rc.|gr„,,  „>.  (1) 

This  is  the  integral  of  the  curvature  (brought  to 
some  power)  along  the  curve,  where  6  is  the  tangent 
along  the  curve  y.  The  variable  a  is  traditionally  taken 
to  be  equal  to  2,  but  it  can  be  shown  that  dK  choice  of  a 
circle  as  the  connecting  curve  in  the  scenario  shown  in 
Hgure  6  minimizes  the  TC  for  all  values  of  a  greater 
than  2  [Guy  and  Medioni  1992b].  A  larger  a  would  just 
penalize  sharp  turns  more  than  a  smaller  one.  We  drus 
set  the  orientations  of  the  field  elements  to  be  tangoit  to 
a  circular  arc  connecting  the  origin  and  that  point  in 
^ce. 

Although  different  grouping  laws  compete,  we 
claim  that  by  finding  the  correct  weighting  values 

once,  and  given  a  sufficient  amount  of  useftil  data^, 
we  can  resolve  most  conflicts.  The  best  way  to  deter¬ 
mine  these  values  is  by  considering  an  intentionally  am¬ 
biguous  or  undecidable  case.  The  assignment  of  actual 
probabilities  to  the  field  is  thus  performed  as  follows: 
We  consider  two  short  edge  xgm^nts,  perpendicular  to 

each  other  and  rqrart^  (see  center  of  Figure  7).  This  sce- 
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Figure  7  5  arrangements  of  two  segments,  each 
with  a  d^erent  separation  angle.  Angles  much 
smaller  than  90  degrees  suggest  a  corner,  while 
angles  much  larger  than  ^  degrees  suggest  a 
smooth  connection. 


nario,  we  claim,  is  a  middle  point  between  a  clear  choice 
of  a  connection  by  a  sharp  junction  and  a  connection  by 
a  smooth  curve.  We  regaid  this  to  be  a  most  competitive 
scenario  in  terms  of  grouping  of  the  two  line  segments. 
We  thus  assign  probabilities  to  the  field  elements  in  such 
a  way  that  all  paths  connecting  these  segments  are  as¬ 
sign^  roughly  the  same  saliency,  and  there  does  not  ex¬ 
ist  any  single  best  path  between  the  two.  More  precisely, 
we  set  the  field  element  strengths  such  that  ^1  values 
within  the  marked  triangular  region  are  the  same.  Such 


4.  In  {Sha’ashua  and  Ullman  1988],  a=2  is  used,  and  the  absolute 
value  is  redundant 

5.  A  more  precise  definition  of  'sufficient'  is  given  in  section  10.S. 
while  addressing  the  issue  of  noise. 

6.  This  scenario  is  termed  'the  maximum  undecidability  arrange¬ 
ment'. 
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a  scenario,  when  repeated  for  all  scales,  removes  all  de¬ 
grees  of  freedom  as  to  the  choice  of  values  for  the  field. 

The  reason  all  values  beyond  the  two  main  diago¬ 
nals  are  zeros,  is  a  technical  one.  Having  a  segment  vote 
for  a  point  in  space  which  is  more  than  90  degrees  away 
(along  a  circle)  could  potentially  cause  unrelated  seg¬ 
ments  to  vote  for  the  same  curve,  even  though  such  a 
curve  should  not  connect  them.  Also,  it  is  quite  obvious 
that  extending  a  curve  beyond  the  90  degree  point  does 
not  satisfy  the  minimum  curvature  constraint. 

3.3  The  Point  Field 

A  dot  image,  is  a  degenerated  case  of  an  edge  im¬ 
age,  where  the  edges  have  no  direction.  Such  maximum 

uncertainty^  in  the  input  fits  our  input  model  well,  and 
allows  us  to  handle  such  cases  in  a  uniform  way.  Obvi¬ 
ously,  perception  is  weakened  by  the  loss  of  orientation 
data,  and  we  are  only  able  to  handle  cases  with  a  mod¬ 
erate  amount  of  noise.  The  only  applicable  perceptual 
law  is  that  of  proximity. 

A  suitable  field  must  have  circular  symmetry,  and  in 
practice  is  constructed  by  convolving  our  original  ex¬ 
tension  field  with  a  ‘multi-directional*  edge  segment  A 
typical  iiq)ut  is  shown  in  Figure  1(c),  where  a  broken 
sine  wave  and  a  random  set  of  points  on  a  circle  are  em¬ 
bedded  in  noise. 

Another  scenario  where  the  point  field  could  be  use¬ 
ful  is  in  images  where  co-curvilinear  formations  be¬ 
tween  features  other  than  dots  are  present.  In  such  cases 
we  would  like  to  treat  the  image  as  if  it  was  made  out  of 
non-directional  tokens  (or  dots),  and  apply  the  point 
field  to  it. 

We  unify  the  maximum  certainty  (Extension)  field 
and  the  Point  field  (and  all  fields  in  between)  by  consid¬ 
ering  a  continuum  of  eccentricities  associated  with  the 
multi-directional  edge.  The  Extension  field  is  construct¬ 
ed  from  a  degenerated  instance  of  an  ellipse  (a  line), 
while  the  Point  field  is  created  from  a  ‘circular’  edge 
point.  This  makes  full  use  of  our  input  model  (in  section 

2.1)  and  allows  treatment  of  mixed  images^  in  a  consis¬ 
tent  and  unified  way. 

4  Computation  of  the  Saliency  Map 
4.1  Directional  Convolution 

The  process  of  computing  the  saliency  map  can  be 
thought  of  as  a  directional  convolution  with  the  above 
field  (mask).  The  resulting  map  is  then  a  function  of  a 
collection  of  fields,  each  oriented  along  a  corresponding 
short  segment  Each  site  accumulates  the  ‘votes’  for  its 
own  preferred  direction  and  strength  from  every  other 
site  in  the  image.  These  values  are  combined  at  a  site  as 
described  next  When  long  curves  arc  present  in  the  in- 


7.  w.r.t.  orientation 

8.  i.e.  having  varying  certainty  measures  (or  a  mixture  of  dots  and 
lines) 


put,  they  are  ‘cut’  into  unit  length  segments  and  each 
convolved  with  the  field.  We  will  discuss  (in  section  6) 
the  properties  of  the  resulting  field  when  applied  to  a 
long  curve  (either  straight  or  curved).  The  strength  of 

the  field  is  proportional  to  the  strength^  of  the  segmoit, 
so  that  stronger  segments  have  stronger  votes,  through¬ 
out  the  space. 

Note  that,  although  the  process  is  local  in  essoice, 
the  fields  impose  some  glol^  order,  and  one  line  seg¬ 
ment  can  implicitly  ’vote’  for  a  large  curve  without  any 
explicit  glob^  reastming  involved. 

4.2  Combination  at  each  Site 

Ideally,  we  would  want  an  averaged  majority  vote 
regarding  the  preferred  orientation  of  a  given  position. 
In  practice,  we  treat  the  contributions  to  a  site  as  being 
vector  weights,  and  compute  moments  of  the  resulting 
system.  Such  a  physical  model  behaves  in  the  desired 
way,  giving  both  the  preferred  direction  and  some  mea¬ 
sure  of  the  agreement  We  use  the  direction  of  the  prin¬ 
cipal  axis  (EVmin)  of  that  physical  model  as  the  chosen 
orientation  (See  equation  (2)). 


This  acts  as  an  approximation  to  the  desired  major¬ 
ity  vote,  without  the  need  to  consider  the  individual 
votes,  but  rather  the  statistics  of  the  ensemUe. 


Figure  8  The  principal  axis  of  the  votes  collected  at  a 
site  is  taken  as  an  approximation  of  the  preferred 
direction. 


The  saliency  map  strength  values  are  taken  as  the 
values  of  the  corresponding  \max  at  each  site.  So,  large 
dues  would  indicate  that  a  curve  is  likely  to  pass 
L  ough  this  point  This  miq>  can  be  further  enhanced  (as 
stkvwn  later)  by  ctHisidering  the  eccentricity,  or 
1  -  iXmin/hnax) .  When  that  value  is  multiplied  by  the 
previous  saliency  mtq>  we  achieve  better  selectivity,  and 
only  curves  are  highlighted.  This  results  in  a  map  de¬ 
fined  by  'kmax-Xmin. 

4.3  Justification 

BasicaUy,  what  we  are  looking  for  is  a  function  that 
accepts  positive  vectors  as  input  and  results  in  a  mea¬ 
sure  of  the  agreement  in  their  orientation.  The  result 
should  satisfy  several  criteria: 


9.  in  terms  of  contract. 
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•  We  want  the  result  to  be  normalizable,  so  that  we  can 
compare  different  sites  on  a  standard  scale. 

•  The  measure  needs  to  be  monotonically  increasing 
with  the  addition  of  positive  contributions. 

•  It  should  give  higher  values  to  ‘better’  (more  direct¬ 
ed)  spatial  arrangements  of  vectors. 

•  We  want  the  affect  of  proximity  to  be  independent  to 
the  affect  of  agreement. 

It  is  easy  to  show  how  the  model  behaves  when  a 
single  vector  is  added  to  it  Assume  the  variance-cova¬ 
riance  matrix  is  as  follows  at  state  t: 


m'li 


(3) 


The  sum  of  the  eigenvalues  is  the  trace  of  the  ma- 


tnx: 


;i' .  +X'  = 


'mm  max  ^**20  ^02 


(4) 


Now  adding  a  rtew  vector  v  =  [itcose.Asine]^  to 
the  system  will  result  in  a  new  state  t+J  : 

CJ  =  '"20  +  '"02+  («sine)2  = 

11^  W 

ffijo  +  »Iq2  +  « 

Note  that  the  angle  6  has  disappeared  on  the  r.h.s.  of 
(S).  This  means  that  the  sum  of  eigenvalues  is  indepen¬ 
dent  of  the  orientations  of  the  voting  vectors  and  can 
hence  be  used  as  an  indicator  of  proximity  (a  wider 
sense  of  proximity  of  course),  and  as  a  primitive  salien- 
cy  measure. 

Equation  (S)  can  obviously  be  written  as: 

N 

\  .  +x  =  y  /t?  (6) 

i=\ 

Where  N  is  the  number  of  segments  in  the  original  im¬ 
age. 

We  define  the  eccentricity  E  =  i  -  X  .  /X  as  a 
measure  of  agreement.  Obviously  this  value  is  between 

0  and  1^^.  Our  intuitive  notion  of  ‘agreement’,  or  of  a 
majority  vote  on  a  continuous  scale,  is  consistent  with 
the  ^ve  definition.  This  means  that  in  all  cases  where 
we  feel  that  collection  A  has  better  ‘agreement’  than 
collection  B,  the  corresponding  eccentricity  values  will 
share  the  same  relationship  (i.e.  E(A)>E(B)),  This  is  not 
to  say  that  both  functions  are  equal,  but  merely  that  both 
are  monotonic. 

Eccentricity  values  by  themselves  cannot  perform 
as  saliency  measures  since  sites  with  very  little  voting 
strength  can  produce  high  eccentricity  v^ues.  In  fact, 
consider  a  site  far  away  from  where  the  ‘action’  is, 
which  accepts  exactly  one  vote  (This  can  happen  in 


10.  Since  X^j^  S  X^^  and  they  are  both  non-negative. 


practice).  The  eccentricity  value  is  1 .  but  the  site  is  of  no 
importance. 

However,  consider  itself.  Obviously, 


X  .  -I-X 

imii  mix 


^X  .  +x 

mix  nun  mix 


(7) 


By  (7)  it  is  bounded  from  both  sides  by  the  proxim¬ 
ity  measure  in  (6)  muf  has  the  eccentricity  coded  into  it 
When  the  value  leans  towards  the  left  side  of  (7),  eccen¬ 
tricity  is  low  and  vice-versa. 

Thus,  Xmax  is  chosen  as  the  raw  salieiKy  mea¬ 
sure  in  our  scheme. 


This  choice  however,  may  still  amf^ify  locatitms 
which  are  very  strong  in  terms  of  number  of  votes,  but 
weak  in  eccentricity  The  product  of  E  and  Xp,„  i»d- 
duces  the  desired  result,  termed  the  enhanced  saliency 
measure  SM,  or: 

SM=X  (i-X./X  )=X  -X.  (8) 

mix  ^  mm  max'  mix  mm  ^  ' 

Thus,  Xmax-Xmin  is  chosen  as  the  enhanced  sa¬ 
liency  measure. 

It  is  important  to  note  that  other  functions  of  the 
eigenvalues  can  also  satisfy  the  same  conditions  of 
monotonicity,  but  the  ones  chosen  seem  to  be  the  sim¬ 
plest  possible  indicators  of  the  desired  behavior. 


5  Detection  of  Junctions 

A  junction  is  defined  as  a  salient  point  having  a  low 
eccentricity  value. 

Regular  (non-junction)  points  along  a  curve  are  ex¬ 
pected  to  have  high  eccentricity  values.  On  the  other 
hand,  junction  points  are  expected  to  have  low  eccen¬ 
tricity,  since  votes  were  accumulated  from  several  dif¬ 
ferent  directions.  By  combining  the  eccentricity  and  the 
eigenvalue  at  a  point,  we  acquire  a  continuous  measure 
of  the  likelihood  of  that  site  being  a  junction.  rede¬ 
fine  our  previous  definition  of  eccentricity  slightly,  so 
that  low  eccentricity  scores  high,  on 


(E=X  .  /X  )  =»05ES1 

^  nun  mix' 


(9) 


The  product  of  our  new  eccentricity  measure  and 
the  raw  saliency  measure  \nax  yields  the  junction  sa¬ 
liency  operator: 


E  X  =  (X  .  /X  )  X  =  X  . 

mix  '  mm  mix'  mix  mm 


(10) 


This  process  creates  a  Junction  Saliency  map.  Inter¬ 
estingly  enough,  this  map  evaluates  to  just  Xmin  at  every 
site  (as  shown  in  (10)).  which  sim[4y  means  that  the 
largest  non-eccentric  sites  are  good  candidates  for  junc¬ 
tions.  By  finding  all  local  maxima  of  the  junction  map 
we  localize  junaions  (see  results  in  Figure  IS). 


11.  For  example,  accumulation  points  and  junctions!  (where 

^min  ~  ^ 
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6  Properties  of  the  extension  field 

6.1  A  longer  line  implies  a  stronger  and  more 
directed  field,  but  up  to  a  point. 

Using  a  simple  example  we  demonstrate  the  behav¬ 
ior  of  the  field  when  extending  a  straight  line.  Figure  9 
shows  a  cross-section  of  a  saliency  map  computed  on  a 
series  of  straight  lines  with  increasing  lengths.  Qearly. 


Figure  9  Saliency  of  a  line  as  a  function  of  length.  It 
converges  to  some  value  which  is  the  infinite  straight 
integr^  of  the  extension  field. 


the  saliency  grows  as  a  function  of  the  length,  and  the 
map  becomes  more  direaed  (thinner  ridge).  Also,  sa¬ 
liency  conveiges  to  some  finite  value  which  is  just  the 
infinite  integral  along  the  main  axis  of  the  Extension 
Field.  This  observation  can  be  used  to  estimate  absolute 
saliency. 

6.2  The  Non-maximum  Suppression  Phenomena 
We  have  mentioned  the  superior  selectivity  of  the 
Enhanced  Saliency  map.  To  illustrate  this  behavior,  we 
look  at  the  eccentricity  only  map  of  a  smiight  liiK 
(Figure  l(Ka))  Note  how  low  the  eccentricity  is  close  to 


(a)  (b)  (c) 


(d)  (e)  (f) 

Figure  10  (a)  Eccentricity  only  map  of  a  saliency  map 
of  a  straight  line,  (b)  raw  saliency  map.  (c)  Product 
of  (a)  and  (b).  (d)  -(f)  same  for  a  perfect  circle. 

where  the  real  line  passes.  In  Figure  10(b)  we  see  the 


raw  saliency  map  of  the  same  line.  The  Enhanced  Sa¬ 
liency  map  is  simply  the  product  of  the  two  maps,  point 
by  point,  and  will  obviously  sharpen  the  edges  of  the 
correct  curves,  thus  creating  a  Non-maximum  suppres¬ 
sion  afrea(Figure  10(c)).  Similar  maps  for  a  perfect  cir¬ 
cle  are  depicted  in  Hgure  10(d-0- 

The  low  eccenuicity  in  the  vicinity  of  the  coriea 
curve  is  due  to  the  large  variance  of  votes  in  these  areas. 
These  votes  almg  the  correct  curve,  when  convolved, 
all  vote  for  all  sites,  thus  making  the  sites  almost  non- 
directional. 

Contamination 

It  may  look  at  first  sight  that  the  field,  while  voting 
for  the  correct  curve,  contaminates  the  envirotunent  by 
voting  for  many  other  cells  in  the  image.  This  can  be  re¬ 
garded  as  noise,  and  is  inherent  to  the  process.  However, 
while  the  fields  voting  for  a  curve  agree  along  the  curve, 
they  disagree  in  any  other  area  of  the  image.  This  means 
that  the  contribution  of  a  complete  curve  to  the  environ¬ 
ment  is  almost  isotropic  and  cannot  affect  the  direction 
of  true  votes  in  any  area.  It  has  been  shown  in  the  previ¬ 
ous  section  that  close-by  areas  get  low  agreement  val¬ 
ues,  and  far  away  areas  get  low  voting  weight.  In  both 
cases,  the  interference  is  small. 

This  is  true  for  all  smooth  curves  in  the  image.  Ran¬ 
dom  segments^^  do  contaminate  the  environment,  and 
their  effect  is  reduced  through  the  robustness  of  the  vot¬ 
ing  scheme.  In  the  experimental  results  section  we  em- 
inrically  test  for  allowable  levels  of  noise. 

7  Special  Purpose  Fields:  The  Straight  Field 

Special  purpose  fields  are  fields  synthesized  to  en¬ 
hance  a  special  feature  in  an  image.  For  example,  in 
aerial  images,  a  desired  feature  could  be  rectangular 
rooftops.  It  is  possible  to  construct  a  saliency  operator 
to  enhance  all  straight  line  formations  in  the  image,  and 
at  the  same  time  suppress  other  smooth  curves. 

We  now  derive  the  shape  of  a  field  ct^able  of  find¬ 
ing  all  straight,  or  almost  straight  line  formations  in  a 
cluttered  directional  edge  image.  This  can  be  done  in 
two  ways. 

The  hrst  method  would  just  use  the  part  of  our  orig¬ 
inal  Extension  Field  that  applies  to  straight  lines.  This 
simply  means  cutting  off  all  field  elements  that  are  out¬ 
side  some  pre-defined  angle.  The  actual  angle  is  of 
course  a  function  of  the  amount  of  error  one  wishes  to 
allow  for  a  straight  line. 

A  second  and  better  motivated  method  would  be  to 
generate  a  new  field  (mask)  by  convolving  with  a  piece 
of  a  straight  line.  This  last  method  is  more  in  tl^  spirit 
of  our  previous  special  fields,  and  also  retains  the  no¬ 
tions  of  feature  ftizziness  and  continuum  of  saliency 
values.  Again  the  length  of  the  straight  line  with  which 


12.  segments  that  should  not  belong  to  any  ctnve 
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we  convolve,  will  determine  the  amount  of  desired  error 
in  our  definition  of  straightness. 


8  Analysis  of  the  End-point  field 

In  many  cases  we  tend  to  interpret  end-points  as  be¬ 
ing  a  parti^y  occluded  line.  If  enough  end-points  are 
available,  they  support  the  hypothesis  of  a  shape  oc¬ 
cluding  a  collection  of  lines.  Illusory  edges  appear  to 
outline  the  said  occluding  shape.  Figure  1 1  illustrates  an 
end-point  scenario. 


Figure  11  An  end-point  formation,  (a)  A  center  egg¬ 
like  shape  is  not  only  perceived  but  also  looks 
whiter  (trfter  [Bqff,  et  al.  1986]).  (b)  An  invisible 
circle  occludes  lines.  No  sensation  of  a  circle  is 
evident,  because  angles  of  intersection  are  not 
suitable,  (c)  The  inner  circle  is  perceived,  but  the 
outer  one  is  not! 


8.1  Straight  angles  in  T-Junctions  are  more 
likely  than  any  other 

We  derive  the  distribution  of  T-junction  angles  in  a 
random  world  and  show  that  close-to-straight  T-junc¬ 
tion  angles  are  more  likely  to  appear  than  any  other  an¬ 
gle.  We  claim  that  our  human  perception  uses  this  prop¬ 
erty,  and  thus  attempts  to  perform  perceptual  tasks  on 
end-points  stimuli  only  when  the  illusory  intersection 
angles  justify  it.  In  Figure  1 1  (a)  all  lines  meet  the  illuso¬ 
ry  contour  with  almost  straight  angles,  thus  making  the 
shape  ‘visible’.  In  Figure  1 1(b)  the  lines  are  occluded  by 
an  exact  circle,  but  the  angles  are  much  more  acute.  No 
perception  of  shape  is  evident,  even  though  the  end¬ 
points  trace  an  exact  circle. 

Claim:  Given  two  unit  size  segments,  we  indepen¬ 
dently  drop  each  one  of  them  on  a  finite  board,  with  uni¬ 
form  probability  with  respect  to  position  and  angle.  We 
then  look  at  the  distribution  of  the  intersection  angles 
between  the  two  segments.  We  claim  that  angles  close 
to  90  degrees  are  more  likely  to  a[^ar  then  acute  inter¬ 
section  angles.  (We  do  not  count  cases  where  the  seg¬ 
ments  do  not  intersect.)  The  proof  can  be  found  in  [Guy 
and  Medioni  1992b]. 

Curved  objects  can  be  approximated  by  linear  seg¬ 
ments  to  any  degree  of  accuracy*^  and  should  thus  sat¬ 
isfy  the  same  distribution.  The  same  is  true  for  lines  of 
different  lengths.  It  is  easy  to  see  that  the  length  of  the 
segment  does  not  change  the  probability  of  a  given  in¬ 
tersection  angle. 


13.  Non  fractal  curves. 


8.2  Convex  T-junctions  are  more  common 

We  believe  that  the  a  priori  probability  of  having 
convex  T-junctions  is  larger  than  concave  T-junctions. 
This  observation  can  be  illustrated  through  a  simple  ex¬ 
ample,  as  shown  in  Figure  12.  Many  other  experiments 


(a)  (b) 


Figure  12  Convex  and  concave  T-junctions.  The 
Kanizsa  square  has  two  valid  3-D  interpretations, 
but  only  one  of  them  is  perceived  (b).  The  other  one 
in  (a)  requires  us  to  imagine  concave  T-Junctions. 

(such  as  in  [Boff ,  et  al.  1986])  indicate  that  end-point 
formations  that  require  concave  T-junctions  are  not  nor- 
maUy  perceived.  Also,  real  objects  tend  to  have  longer 
convex  boundaries,  than  concave  ones,  and  since 
boundary  length  is  proportional  to  the  probability  of  an 
intersection,  convex  T-junctions  are  more  common.  We 
claim  that  humans  use  this  property  and  tend  to  hypoth¬ 
esize  mainly  convex  T-junctions. 

8.3  Building  the  End-Point  Field 

The  angle  distribution  derived  previously  suggests 
convolving  our  original  Extension  field  with  a  multi-di¬ 
rectional  edge  having  a  diameter  function  of  a  sinusoid, 
as  was  done  for  the  point  field,  and  shown  in  Figure  13. 


Figure  13  A  multi-directional  edge  constructor  for 
the  End-Point  Field,  (envelope  shown) 


This  kind  of  field  will  vote  strongly  for  straight  angle  T- 
junctions,  and  weakly  for  other  directions.  This  merely 
means  that  more  support  is  needed  for  more  acute  an¬ 
gles.  Results  are  given  in  section  10. 

9  Complexity  Issues 

A  naive  way  to  implement  the  algorithm  requires 
O(n^k)  operations,  where  n  is  the  side  size  of  the  im¬ 
age,  and  k  is  the  number  of  edge  elements  in  the  input 
image.  In  practice,  the  local  density  of  edgels  restricts 
the  useful  scope  of  the  field.  This  means  that  a  smaller 
finite  field  can  be  used.  The  complexity  becomes  now 
O  (k) .  This  last  modification  has  the  disadvantage  of  not 
being  able  to  bridge  gaps  larger  then  the  size  of  the  field. 
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Alternatively,  instead  of  computing  a  dense  saliency 
map,  we  can  compute  the  saliency  of  existing  edgels 
only.  This  results  in  complexity  of  O  (I?) ,  and  can  be 
useful  as  a  focus  of  attention  map.  This  mode  allows 
then  for  a  second  pass  on  the  salient  features  only. 


10  Results 


10.1  On  Synthetic  Images 


We  have  tested  our  approach  with  the  synthetic  data 
shown  earlier  in  Figure  1 .  The  saliency  map  produced  is 


Figure  14  The  Saliency  maps  of  images  in  figure  1. 
shown  (strength  only)  as  grey-level  image  in  figure  14 
and  the  result  of  following  the  path  of  highest  saliency 
produces  a  “clean”  circle.  Figure  14  also  shows  the  re¬ 
sult  of  the  same  procedure  for  the  other  scenes.  Figure 
IS  shows  an  example  of  the  steps  involved  in  producing 


Figure  15  Extracting  the  most  salient  features,  (a)  Ec¬ 
centricity  enhanced  map.  (b)  Junction  saliency  map. 
and  (c)  linking. 


a  high-level  description  of  a  given  image,  using  the 
junction  map  in  conjunction  with  the  saliency  map 

10.2  Real  Images 

In  figure  16  we  show  a  real  image  example.  The 


Figure  16  Example  of  a  real  image,  (a)  A  tape  dispens¬ 
er  image,  (b)  All  edges,  (c)  Eccentricity  enhanced 
saliency  nutp. 


original  image  was  processed  with  a  simple  edge  detec¬ 
tor  (5x5  step  masks),  without  any  linking.  Note  that  the 
edge  image  is  fragmented  and  has  a  lot  of  noisy  seg¬ 
ments.  Figure  16(c)  shows  the  resulting  Saliency  map. 


10.3  Point  Field 

We  tested  our  system  on  the  image  in  figure  17(a). 
Initially,  the  system  was  run  using  the  Point  field.  This 
resulted  in  a  saliency  map  with  orientation  data.  A  sec¬ 
ond  phase  of  computation  was  then  performed,  using 
the  directional  Extension  field  (Figure  1).  That  stage 
produced  the  final  saliency  map  as  shown  in  figure 
17(b). 


(a)  (b) 


Figure  17  (a)  A  non-cUrectional  input  image,  (b)  Sa- 
liency  map,  cfter  emptying  the  Point  field  and  the 
directional  extension  field 

10.4  Straight  Field 

We  tested  our  straight  line  operator  on  the  previous 
example,  as  shown  in  Figure  17.  Cleaily  the  straight 


Figure  18  The  enhanced  saliency  map  of  the 
straight  field 

line  appears  to  be  salient.  The  ellipse  is  stiU  vaguely  vis¬ 
ible,  since  it  can  be  viewed  as  being  straight  to  some  de¬ 
gree. 

10.5  Noise  Breakdown  Point 

Since  we  want  our  algorithm  to  mimic  human  per¬ 
ceptual  capabilities,  we  actually  prefer  to  fail  where  hu¬ 
mans  fail,  unlike  the  Hough  transform,  for  example, 
which  may  find  line  formations  even  in  a  very  cluttered 
image,  where  such  formations  are  not  visible,  and  could 
be  a  result  of  accidental  alignments. 

We  have  performed  a  controlled  experiment  to  de¬ 
termine  the  amount  of  random  noise  which  still  allows 
for  a  correct  following  of  a  curve,  given  a  constant  fac¬ 
tor  of  existing  edge.  The  test  is  whether  all  points  along 
the  perceptual  circle  belong  the  local  maxima,  as  de¬ 
fined  before.  Only  uniform  noise  is  space  and  orienta¬ 
tion  was  applied  to  the  original  image.  The  circle  has  a 
radius  of  25  pxels  and  consists  of  33  segments  and  the 
percentages  of  noise  applied  are  2,4,  and  6  percent.  This 
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means  that  about  a  quarter  of  the  circle  edge  exists.  Up 
to  4  percent  of  noise,  the  circle  (or  laige  parts  of  it)  is  re¬ 
coverable.  With  S  percent  noise  and  more,  the  saJiency 
map  degrades,  and  it  is  no  longer  possible  to  apply  the 
following  algorithm.  More  detailed  results  can  be  found 
in  (Guy  and  Medioni  1992b]. 

10.6  The  End-Point  field 

We  tested  the  end-point  field  with  the  synthetic  im¬ 
age  in  Figure  1 1.  It  is  clear  that  the  outer  circle  receives 


Figure  19  Results  of  (^plying  the  end-point  field. 
Note  that  the  outer  circle  is  not  highlighted!  (a) 
Original  image,  (b)  Saliency  map.  (c)  cfter  thresh¬ 
olding  single  votes. 

relatively  low  saliency,  and  can  be  completely  removed 
if  we  consider  single  votes  in  the  plane  as  wrong  hy¬ 
potheses  (see  Figure  20). 

11  Comparison  with  the  Hough  IVansform 
The  classical  Hough  transform  ([Duda  and  Hart 
1972],[Hough  1962])  was  invented  to  detect  co-linear 
formations  within  a  do/  image  in  the  presence  of  noise. 

11.1  The  classical  Hough  transform 

Ctmsider  the  Hough  transform  for  straight  lines.  A 
2D  array  of  accumulators  is  constructed  and  each  token 
in  the  image  votes  for  a  family  of  straight  lines  which 
may  pass  through  that  dot  Using  the  common  (0,d)  line 
parameterization  it  is  easy  to  show  that  the  voting  pat¬ 
tern  is  a  sinusoid. 

The  next  phase  in  the  process  is  to  search  for  peaks 
in  the  parameter  space  array.  That  line  is  only  defined  in 
terms  of  the  above  two  parameters  and  further  process¬ 
ing  is  needed  to  localize  the  actual  segment  in  the  image 
plane.  Note  that  this  process  is  absolutely  global,  as  it 
ignores  distances  between  contributing  candidates.  If 
some  orientation  data  is  known  at  a  site,  less  candidates 
are  voted  for,  and  the  ability  to  reveal  the  desired  shape 
gets  better. 

The  same  scheme  can  be  extended  for  other  shapes 
[Ballard  1981],  by  extending  the  parameter  space. 

11.2  Our  scheme  as  a  Hough  transform 

Our  system  can  be  viewed  as  a  Hough  Transform 
where  the  parameter  space  is  the  image  plane  itse^. 
This  allows  for  many  more  degrees  of  freedom  in  the 
choice  of  shapes,  and  in  the  basic  definition  of  the  de¬ 
sired  shapes.  This  choice  of  the  parameter  space  allows 

us  to  define  other  voting  patterns^^  which  enable  us  to 
encode  the  constraints  (see  2.3.1). 


Our  scheme  is  thus  capable  of  finding  shapes  de¬ 
scribed  by  their  properties  (smoothness  etc.)  rather  than 
by  their  exact  analytical  parameters  (as  in  the  Hough 
Transform). 

For  example,  to  implement  a  circle  finder  using  our 
scheme,  we  would  simply  use  our  Extension  Field,  but 
with  all  field  strengths  set  to  the  value  1.  Coding  of  the 
other  constraints  is  where  the  strengths  of  the  individual 
field  elements  come  into  play.  This  is  not  possible  with 
the  original  Hough  Transform. 

The  down  side  is  that  the  result  is  not  found  at  an 
isolated  peak  of  the  parameter  space,  but  rather  as  a  con¬ 
tinuous  ridge  of  peaks  (in  the  case  of  the  Extension 
Field).  Also,  note  that  the  parameter  field  we  use  is  al¬ 
ways  the  image  plane,  which,  in  many  cases,  is  smaller 
than  a  classical  Hough  parameter  space.  This  ‘Compac¬ 
tion’  of  data  is  the  cause  for  the  low  selectivity  com¬ 
pared  to  Hough  transforms. 


Table  2:  Comparison  between  the  Hough 
IVansform  and  Our  Scheme 


Our  scheme 

Hough  Transform 

Complexity 

Same  for  aU 
sluq>es  and  prop¬ 
erties 

Grows  with  the  num¬ 
ber  of  parameters 

Space  Require¬ 
ments 

Constant 
[0(  image  site)) 

Grows  with  the  num¬ 
ber  of  parameters  and 
resolution. 

Properties  (Fam¬ 
ilies  of  shqres) 
coding 

Yes,  by  defining 
suitable  fields 

Not  in  any  obvious 
way 

Analytical 

shapes 

Yes,  by  the  same 
method 

Yes 

>^ting  pattern* 

2-D 

l-D** 

Localization  of 
shape 

Yes 

Not  always.  Further 
processing  is  needed 
to  find  location 

Means  of  output 

A  ridge  of  .Tiax* 
ima 

A  peak  in  parameter 
space 

Choice  of  bin 
size 

Always  image 
resolution 

Hard  (May  be  cru¬ 
cial) 

Selectivity 

Low  (Because  of 
compaction) 

High 

Transformation 

Non-Linear 

Linear 

a.  Our  nomenclature.  Refer  to  text  for  definition. 

b.  Or  at  least  one  dimension  less  that  of  the  problem. 


14.  Similar  to  the  sinusoid  defined  for  the  straight  line  detector 
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However,  for  Perceptual  Grouping  this  loss  of  se¬ 
lectivity  seems  to  be  an  advantage.  Note  that  the  Hough 
Transform  could  find  formations  (e.g.  straight  lines) 
even  when  they  are  not  salient  to  humans,  because  the 
notion  of  interfering  features  does  not  exist*^.  This 
makes  the  Hough  Transfonn  a  linear  Transformation 
which  is  not  suitable  for  perceptual  grouping.  Our 
scheme,  on  the  other  hand,  is  highly  non-linear  and 
takes  into  account  the  interference  by  computing  the  ec¬ 
centricity  measure  at  each  site.  Table  2  summarizes  the 
main  differences  and  similarities  of  the  two  methods. 

12  Summary  And  Conclusion 

We  have  introduced  a  unified  way  to  extract  percep¬ 
tual  features  in  edge  images.  By  ‘unified’  we  mean  that 
all  low-level  features  (edgels,  points)  are  treated  in  a 
uniform  way,  and  no  special  cases  exist.  The  scheme  is 
threshold-free  aixl  non-iterative.  It  is  especially  suitable 
for  parallel  implementation,  since  computations  of  the 
saliency  maps  are  independent  for  each  site,  and  parallel 
algorithms  for  line  following  are  known  and  can  easily 
be  adapted.  Also,  calculations  are  simple  and  stable,  as 
no  curvatures  or  any  other  derivatives  need  to  be  com¬ 
puted  on  the  digital  curves. 

The  system  can  rank  features  based  on  their  percep¬ 
tual  importance.  This  allows  a  real-time  application  to 
process  as  many  features  as  time  permits. 

Some  of  the  issues  which  have  not  been  addressed 
^  the  resolution  dependency  of  the  description.  At  this 
time,  only  one  level  of  description  is  possible.  Also,  we 
have  not  tried  to  localize  end-points  of  curves  ending 
abruptly.Since  all  computations  are  performed  on  a  dis¬ 
crete  grid,  quantization  and  rounding  errors  restrict  the 
selectivity  and  amount  of  clutter  the  system  can  handle. 
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Abstract 

This  paper  describes  a  new  transform  to  extract 
the  edge  contours  and  skeletons  of  image  regions  at 
multiple  scales.  The  application  of  the  transform 
to  detecting  edge  structure  is  explained  in  detail.  It 
is  argued  that  linear  processing  based  approaches, 
such  as  convolution  and  matching,  have  the  funda¬ 
mental  deficiency  of  using  a  priori  models  of  edge 
geometry.  The  proposed  transform  avoids  this  lim¬ 
itation  by  letting  the  structure  “emerge,”  bottom- 
up,  from  interactions  among  pixels,  in  analogy  with 
statistical  mechanics  and  particle  physics.  The 
transform  involves  global  computations  on  pairs  of 
pixels  followed  by  vector  integration  of  the  results, 
rather  than  scalar  local  linear  processing.  An  at¬ 
traction  force  field  is  computed  over  the  image. 
Pixels  belonging  to  the  same  region  are  mutually 
attracted  whereas  those  across  edges  repel  each 
other.  Scale  is  an  integral  parameter  of  the  force 
computation.  The  resulting  groupings  of  pixels  rep¬ 
resent  multiscale  image  structure.  The  properties 
desired  in  multiscale  edge  detection  are  given,  and 
it  is  theoretically  and  experimentally  shown  that 
the  transform  possesses  these  properties.  Along 
with  their  contours,  the  transform  niso  extracts 
skeletons  of  mutiscale  regions.  Experimental  re¬ 
sults  with  synthetic  and  real  images  are  given  to 
demonstrate  the  properties  of  the  transform. 


1  Introduction 

This  paper  is  concerned  with  the  problem  of  detecting 
low  level  structure  in  images.  It  introduces  a  transform 
to  facilitate  integrated  edge  and  region  detection  at  all 
geometric  and  photometric  scales  at  which  structure  is 
present  in  a  given  image.  At  each  scale,  the  transform 
helps  identify  groupings  of  pixels  which  comprise  homo¬ 
geneous  regions  surrounded  by  edges  by  distinguishing 

‘This  work  was  supported  by  the  Defense  Advanced  Re¬ 
search  Projects  Agency  and  the  National  Science  Founda¬ 
tion  under  grant  IRI-8902728.  Thanks  to  Mark  Tabb  who 
produced  the  experimental  results  included  in  this  paper. 


the  locations  of  region  edges  and  skeletons.  It  is  argued 
that  the  usual  convolution  and  matching  approach  has 
inherent  limitations  in  edge  detection.  The  main  moti¬ 
vation  and  contribution  of  the  approach  is  to  avoid  a  pri¬ 
ori  models  of  edge  geometry  but  still  allow  detection  of 
region  boundaries  with  arbitrary  curvature.  Instead  of 
searching  for  and  fitting  spatial  models  of  edges  and  re¬ 
gions  over  image  neighborhoods  to  select  edge  locations 
and  orientations,  the  structure  is  allowed  to  “emerge” 
from  “interactions”  among  the  pixels.  A  pixel’s  interac¬ 
tion  with  all  other  pixels  is  considered  instecid  of  testing 
specific  neighborhoods  for  a  specific  structure. 

A  central  feature  of  the  proposed  transform  is  to  com¬ 
pute  affinities  of  pixels  for  grouping  with  other  pixels, 
such  that  the  groupings  reflect  the  image  structure.  For 
pir^ls  on  either  side  of  a  region  boundary,  the  transfor- 
m-  tion  yields  high  affinities,  but  there  is  little  affinity 
between  pixels  across  regions.  The  strength  of  inter¬ 
action,  and  therefore  affinities,  between  pixels  depend 
upon  their  distances  and  contrasts,  and  this  relates  the 
computed  affinities  to  the  different  scales  present.  Due 
to  the  global  interpixel  interaction  allowed,  the  trans¬ 
form  can  be  viewed  as  collecting  spatially  distributed 
evidence  for  edges  and  skeletons  and  making  it  avail¬ 
able  at  their  locations.  In  this  sense,  the  transform  per¬ 
forms  Gestalt  analysis.  The  transform  brings  out  the 
image  structure  which  m^dces  its  subsequent  detection 
easier.  The  objective  of  this  paper  is  to  introduce  the 
transform;  algorithms  for  structure  detection  from  the 
transformed  image  are  beyond  its  scope. 

In  the  rest  of  this  paper,  we  will  first  illustrate  in  de¬ 
tail  the  use  of  the  transform  for  edge  detection,  and  then 
briefly  discuss  its  use  for  region  skeleton  extraction.  For 
concreteness,  we  will  assume  that  image  regions  have 
uniform  gray  levels  and  are  surrounded  by  step  edges, 
although  the  approach  extends  to  other  types  of  re¬ 
gions  and  edges.  Section  2  discusses  some  basic  desired 
characteristics  of  edge  detection  and  the  motivation  for 
the  proposed  approach.  Section  3  proposes  a  family 
of  transforms  intended  to  achieve  the  above  character¬ 
istics,  and  describes  some  of  its  properties  of  interest. 
Section  4  proposes  a  specific  transform  from  the  fam¬ 
ily  and  examines  its  performance  for  edge  detection  in 
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greater  detail.  Section  5  briefly  discusses  the  applica¬ 
tion  of  the  transform  to  extracting  skeletons  of  regions. 
Section  6  describes  some  experiments  conducted  to  test 
the  performance  of  the  transform. 

2  Background  and  Motivation 
This  section  explains  the  motivation  behind  the  ap¬ 
proach  presented  in  this  paper.  For  concreteness,  we 
focus  on  the  problem  of  edge  detection.  The  arguments 
extend  to  the  dual  case  of  region  detection  which  is  dis¬ 
cussed  in  Sec.  6.  We  first  examine  the  critical  aspects 
of  the  performance  of  an  edge  detector  (Sec.  2.1)  which 
suggests  certain  basic  desired  characteristics  of  multi¬ 
scale  edge  detection  (Sec.  2.2).  Then  we  explain  salient 
design  features  of  the  approach  proposed  in  this  paper, 
and  how  they  help  meet  the  desired  performance  char¬ 
acteristics. 

2.1  Two  Aspects  of  Edge  Detection 

There  are  two  major  aspects  of  edge  detection  which  are 
of  central  importance  to  its  performance.  The  first  one 
has  to  do  with  the  photometric  and  geometric  model  of 
an  edge.  An  edge  separates  two  different  regions  and 
thus  two  different  types  of  gray  level  populations.  It 
is  common  to  treat  the  problem  of  edge  detection  as 
mainly  that  of  selecting  a  point  along  the  intensity  pro¬ 
file  across  edge,  assuming  such  a  profile  can  be  extracted 
from  the  image.  Accordingly,  a  model  of  the  intensity 
profile  is  used  to  precisely  define  an  edge  and  to  opti¬ 
mally  detect  its  location.  Different  types  of  intensity 
models  of  an  edge  have  been  proposed,  according  to  the 
nature  of  the  two  populations  and  the  spatial  profile 
of  the  transition  from  one  to  the  other  across  the  edge 
[13,  12,  3,  5].  To  meet  the  assunption  that  edge  profile 
through  a  pixel  can  be  identified,  it  is  common  in  edge 
detection  work  to  use  a  model  of  edge  curvature.  The 
requirement  for  the  identification  of  edge  profile  may 
be  explicit  or  implicit.  The  geometric  model  constrains 
the  number  of  possible  ways  in  which  subdivisions  of  the 
pixel  neighborhood  into  two  regions  can  be  made.  Each 
subdivision  must  be  implicitly  or  explicitly  tested  for 
the  presence  of  an  edge  profile  in  some  direction.  For 
example,  the  assumption  of  local  straightness  of  edge 
is  common  which  makes  it  very  easy  to  select  neighbor¬ 
hoods  on  the  two  sides  of  an  edge.  [10]  assumes  that  the 
edge  is  locally  straight  (and  that  the  intensity  changes 
linearly  along  a  direction  parallel  to  the  edge.)  Nalwa 
and  Binford  [11]  assume  straightness  to  extract  a  sample 
edge  profile.  Even  the  computation  of  gradient  which  is 
common  to  many  edge  detectors  [9]  implicitly  assumes 
local  edge  straightness.  The  same  can  be  said  about 
Laplacian  based  edge  detectors.  The  use  of  straightness 
is  very  explicit  in  the  different  types  of  discrete  edge 
masks  each  of  which  is  meant  to  detect  a  different  edge 
orientation  [14].  To  detect  intensity  facets  meeting  at 
an  edge  [6],  a  model  of  edge  geometry  is  required  so 
candidate  neighborhoods  from  each  side  of  the  edge  can 
be  identified.  The  work  on  optimal  edge  detection  (e.g., 


[4]  )  is  thus  also  subject  to  the  validity  of  the  assumed 
model  of  the  edge  geometry. 

The  issue  of  the  validity  of  the  models  of  edge  pro¬ 
file  has  been  addressed  and  different  models  of  edge 
(step,  ramp,  roof)  have  been  mentioned.  However,  the 
limitations  and  impact  of  the  assumptions  made  about 
edge  geometry  have  received  much  less  attention.  These 
assumptions  cause  significant  errors  in  the  results  as 
can  be  seen,  for  example,  from  the  performance  of  the 
Laplacian-of-Gaussian  for  different  edge  geometries  [2]. 

The  second  major  aspect  of  edge  detection  is  related 
to  scale  [15,  16,  8].  There  are  two  important  ways  in 
which  edges  are  associated  with  scale;  geometric  and 
photometric.  One  example  of  geometric  scale  is  struc¬ 
ture  size.  Only  large  structures  may  be  visible  at  a 
coarse  geometric  scale  while  smaller  sizes  and  features 
may  be  detected  at  finer  scales.  An  analogous  process 
characterizes  contrast  across  an  edge.  An  edge  contour 
which  separates  two  regions  of  a  given  contrast  at  a 
given  scale  may  not  be  detected  at  a  higher  scale.  Thus, 
there  are  scale  variations  associated  with  both  geometric 
and  photometric  sensitivity  to  detail.  The  exact  num¬ 
ber  and  parameters  of  scales  present  in  a  given  image 
is  a  priori  unknown.  Therefore,  for  an  edge  detector  to 
work  at  multiple  scales,  it  should  automatically  iden¬ 
tify  the  scales  present  and  detect  edges  corresponding 
to  each  scale. 

2.2  Desired  Characteristics  of  Multiscale 
Edge  Detection 

The  above  discussion  leads  us  to  the  following  desired 
characteristics  of  an  edge  detector. 

A.  Shape  Invariance:  The  edge  should  be  correctly 
detected  r-^gardless  of  the  local  curvature.  Thus,  an 
edge  point  must  be  detected  at  only  one  and  correct 
location,  regardless  of  whether  the  edge  in  the  vicinity 
of  the  point  is  straight,  curved  or  even  contains  a  corner. 

B.  Contrast  Scaling:  It  should  be  possible  to  detect 
edges  according  to  their  contrast.  For  example,  as  scale 
increases,  the  required  contrast  of  detectable  edges  may 
increeise. 

C.  Geometric  Scaling:  It  should  be  possible  to  detect 
edges  of  regions  according  to  their  sizes.  For  example,  as 
scale  increases  the  required  size  of  detectable  geometric 
features  may  increase. 

D.  Stability  and  Automatic  Scale  Selection:  Im¬ 
age  structures  at  different  scales  correspond  to  locally 
invariant  (stable)  descriptions  with  respect  to  geometric 
and  contrast  sensitivities.  Since  for  an  arbitrary  image 
these  scales  are  a  priori  unknown,  they  should  be  iden¬ 
tified  automatically. 

2.3  Limitations  of  Convolution,  and  the 
Proposed  Approach 

The  approach  we  present  in  this  paper  has  been  moti¬ 
vated  by  the  desire  to  .satisfactorily  address  both  aspects 
of  edge  detection  discussed  in  Sec.  2.1  and  to  achieve 
the  desired  characteristics  listed  in  Sec  2.2. 
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The  basic  limitations  put  on  the  performance  of  an 
edge  detector  by  the  use  of  models  of  edge  geometry  sug¬ 
gest  that  no  linear,  convolution  based  approach  could 
be  satisfactory.  This  is  because  the  convolution  kernel 
represents  a  template  for  the  edge,  expected  edge  geom¬ 
etry.  In  the  digital  case,  one  could  attempt  to  circum¬ 
vent  this  problem  by  exhaustively  considering  all  possi¬ 
ble  edge  geometries  in  a  neighborhood.  But  the  number 
of  resulting  kernels  will  fast  increase  with  neighborhood 
size  and  will  be  prohibitively  large  for  any  reasonable 
size  neighborhood. 

The  approach  presented  in  this  paper  uses  a  method 
involving  computations  on  pairs  of  pixels  followed  by 
vector  integration  of  the  results,  rather  than  scalar, 
weighted  averaging  over  pixel  neighborhoods.  This 
avoids  making  assumptions  about  edge  geometry.  The 
inspiration  for  the  proposed  solution  comes  from  physics 
where  microscopic  homogeneity  of  physical  properties 
leads  to  islands  of,  say,  similar  particles  or  molecules. 
An  island  shape  is  congruent  with  the  space  occupied  by 
a  set  of  contiguous,  similar  particles,  whatever  the  com¬ 
plexity  of  the  boundary!  The  particles  group  together 
and  coalesce  into  regions  based  on  the  similarity  of  their 
physical  property  only.  The  homogeneity  of  the  phys¬ 
ical  property  then  characterizes  the  resulting  regions. 
As  an  alternate  analogy,  the  grouping  process  is  like 
the  alignment  of  microscopic  domains  in  large  areas  of 
a  ferromagnetic  material,  with  a  reversal  of  magnetiza¬ 
tion  direction  taking  place  across  the  boundary  between 
two  similar  poles  facing  each  other.  The  key  process  is 
that  of  interaction  among  particles,  which  leads  to  bind¬ 
ings  among  them  based  on  their  physical  properties. 

The  problem  of  edge  detection  has  similarities  to  the 
above  physical  process.  The  goal  is  to  find  a  partition 
of  the  image,  regardless  of  the  boundary  complexity, 
such  that  each  cell  of  the  partition  is  homogeneous,  say, 
in  gray  level.  The  physical  analogy  suggests  a  formu¬ 
lation  of  the  edge  detection  process  in  terms  of  a  suit¬ 
ably  defined  process  of  interpoint  interaction  -  a  process 
that  would  group,  bottom-up,  each  set  of  points  of  the 
same  property,  say  gray  level,  corresponding  to  a  region. 
Since  points  are  the  primitives  of  structure,  they  could 
group  together  to  follow  any  region  boundary.  Being 
a  parameter  of  the  grouping  process,  different  degrees 
of  acceptable  homogeneity  within  a  region  would  yield 
groupings  over  different  regions,  making  scale  an  inte¬ 
gral  part  of  structure  detection. 

In  the  next  section,  we  propose  a  family  of  transforms 
which  achieves  the  above  grouping.  These  transforms 
can  be  viewed  as  yielding  an  attraction-force  field  in 
which  all  points  are  mutually  attracted  except  those 
across  edges. 

3  A  Family  of  Transforms  for 
Multiscale  Edge  Detection 
In  this  section  we  first  describe  a  family  of  trans¬ 
forms  and  discuss  how  these  transforms  possess  the  de¬ 
sired  characteristics  listed  in  Section  2.  The  discussion 


throughout  this  section  will  address  general  properties, 
in  qualitative  terms,  since  the  family  of  transforms  has 
only  been  characterized  in  general  terms.  A  specific 
transform  from  this  family  will  be  examined  in  Section 
4  with  more  quantitative  performance  analysis.  To  sim¬ 
plify  the  presentation,  the  regions  are  defined  as  being 
homogeneous  in  gray  level.  We  will  consider  the  case  of 
uniform  gray  level  regions  whenever  doing  so  simplifies 
analysis  without  loss  of  applicability  of  the  results  to 
the  case  of  regions  having  only  statistical  homogeneity. 

3.1  A  Family  of  Transforms 

Consider  a  transform  over  the  image  which  computes 
an  attraction-force  field  wherein  the  force  at  each  point 
denotes  its  net  affinity  to  the  rest  of  the  image.  The 
force  vector  points  in  the  direction  in  which  the  point 
experiences  a  net  attraction  from  the  points  in  the  rest 
of  the  image.  For  example,  a  point  inside  a  region  would 
experience  a  force  towards  the  interior  of  the  region. 
This  force  is  computed  as  the  resultant  of  attraction- 
forces  due  to  all  other  image  points.  Let  F(p,  q)  denote 
the  magnitude  of  the  force  vector  F(p,  q)  with  which  a 
pixel  P  at  position  p  is  attracted  by  another  pixel  Q  at 
location  q.  Thus,  F(p,q)  =  F(p,q)fpq  ,  where  fpq 
denotes  the  unit  vector  in  the  direction  from  P  to  Q, 
i.e., 

f  -  q~p 

llq-p|| 

In  the  continuous  image  plane,  an  image  is  trans¬ 
formed  into  a  continuous  vector  field.  The  resultant 
attractive  force  vector  Fp  at  P  is  given  by 

Fp=  /  F(p,q)fpqdq  (1) 

Image 

In  the  discrete  case, 

Fp=  V  F(p,q)fpq 
qe  Image 

We  must  now  specify  what  forms  could  the  force  func¬ 
tion  F(p,q)  take.  We  will  do  so  by  considering  the 
properties  that  F  must  possess.  The  properties  will  de¬ 
fine  a  family  of  transforms.  Any  choice  of  F  that  has 
these  necessary  properties  will  suffice.  A  specific  choice 
of  F  will  yield  one  member  of  this  family. 

Since  the  presence  of  an  edge  must  be  determined  by 
its  immediate  vicinity  (adjoining  regions)  rather  than  by 
distant  points  across  other  intervening  regions,  the  force 
should  be  a  decreasing  function  of  distance.  This  will 
be  accomplished  by  making  the  force  exerted  on  a  given 
pixel  P  by  another  pixel  Q  to  be  inversely  proportional 
to  the  distance  between  P  and  Q.  Further,  a  pixel  should 
be  attracted  more  to  a  pixel  within  its  own  region  than 
to  one  in  a  different  region.  This  is  accomplished  by 
making  the  force  to  be  inversely  proportional  to  the 
difference  between  the  gray  levels  of  P  and  Q.  The  rate 
at  which  the  force  decreases  with  distance  determines 
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the  geometric  scale  (size)  of  the  regions  whose  structure 
is  reflected  in  the  force  field.  Similarly,  the  rate  at  which 
the  force  decreases  with  gray  level  difference  determines 
the  photometric  scale  captured  by  the  force  field.  We 
will  use  <Ts  and  <Ta  as  the  geometric  and  gray  level  scale 
parameters.  The  larger  their  values,  the  larger  will  be 
the  force  on  a  pixel  due  to  another  pixel  having  a  given 
distance  and  gray  level  difference,  respectively. 

We  will  now  examine  some  basic  properties  of  the 
force  field  F.  These  properties  serve  to  illustrate  quali¬ 
tatively  the  basic  motivation  behind  the  use  of  the  fam¬ 
ily  of  transforms  of  Eqn  (1).  More  quantitative  and  de¬ 
tailed  evaluation  of  the  performance  will  be  presented 
for  a  specific  transform  in  Section  3. 

1.  Magnitude  and  Directionality:  The  magnitude 
of  Fp  increases  as  P  moves  closer  to  the  region  bound¬ 
ary.  The  direction  of  Fp  points  toward  region  interior. 

Proof: 

Consider  a  homogeneous  gray  level  region  W  sur¬ 
rounded  by  another  region  X  of  gray  level  contrast  C. 
Now  consider  an  arbitrary  point  P  inside  W  and  a  disk 
of  radius  r  centered  at  P.  For  sufficiently  small  values  of 
r,  the  disk  is  contained  inside  W,  and  because  of  gray 
level  isotropicity  around  P,  the  resultant  force  on  P  due 
to  the  pixels  inside  the  disk  is  0.  Figure  la  illustrates 
this  for  the  simple  case  where  W  is  a  half  plane.  As  r 
increases  to  some  value  R,  the  disk  will  touch  W.  For 
r  >  R,  the  force  on  P  due  to  the  pixels  inside  the  disk 
will  be  nonzero  because  of  asymmetery.  Due  to  the 
gray  level  difference  between  P  and  the  pixels  outside 
W,  there  will  be  a  net  reduction  in  the  force  on  P  from 
the  direction  of  intersection  with  X,  and  therefore,  the 
net  force  on  P  will  point  away  from  the  region  of  in¬ 
tersection  (Fig.  la).  That  is,  the  resultant  force  Fp 
on  P  and  the  direction  PQ  will  satisfy  the  condition 
Fp  •  fpq  <  0.  For  a  region  boundary  having  a  simple 
shape  around  the  point  nearest  to  P,  the  larger  (r-R) 
is  ,  the  larger  will  be  the  anisotropicity,  and  hence,  the 
larger  the  magnitude  of  Fp.  Therefore,  for  any  fixed 
r  >  R,  the  closer  the  point  P  is  to  the  border  (i.e., 
the  smaller  R  is),  the  larger  the  magnitude  of  Fp.  As 
P  moves  closer  to  the  boundary  the  force  continues  to 
point  away  from  X  and  its  magnitude  increases. 

For  regions  having  more  complicated  shapes  than  a 
rectangle,  as  r  increases  there  will  in  general  be  multiple 
connected  components  of  intersection  between  the  disk 
and  the  surrounding  regions,  due  to  the  complex  shape 
of  the  region  boundary.  Therefore,  the  rate  of  decrease 
in  the  value  of  Fp  will  depend  on  the  exact  shape  of 
W’s  boundary.  Fig.  1  illustrates  Fp  for  the  case  of 
a  rectangular  region  (Fig.  lb)  and  an  arbitrary  region 
(Fig.  Ic). 

2.  Orthogonality:  Consider  a  point  P  just  inside  the 
boundary  of  a  region  W,  such  that  the  boundary  within 


a  disk  D  of  radius  r  centered  at  P,  r  »  <t,,  is  symmetric 
about  P.  Then  the  direction  of  Fp  points  into  the  region 
and  is  normal  to  the  boundary  at  P. 

Proof: 

Since  r  ^  o-,  by  assumption,  the  force  Fp  at  P  gets 
little  influence  from  the  points  outside  disk  D.  Consider 
a  point  Q  inside  the  region  as  well  as  D,  and  along  the 
normal  to  the  boundary  at  P.  Since  the  boundary  of 
W  within  D  is  symmetric  about  P  (Fig.  2),  the  line 
through  P  and  Q  divides  the  region  of  intersection  of 
W  and  D  into  two  symmetric  halves.  For  each  point  M 
in  one  half,  there  exists  another  point  M’  in  the  other 
half  such  that  the  components  of  the  force  at  Q,  orthog¬ 
onal  to  PQ,  due  to  M  and  M’  are  equal  and  opposite. 
Thus,  the  net  force  at  all  points  along  PQ,  including  P, 
is  along  PQ  and,  from  the  directionality  property,  away 
from  the  boundary. 

3.  Smoothness:  If  F(p,  q)  is  a  continuous  (or  differ¬ 
entiable)  function  for  any  p  and  q,  and  the  intensity 
value  within  any  image  region  is  continuous  (or  differ¬ 
entiable),  then  so  is  F  at  all  nonboundary  points  of  the 
region. 

Proof: 

Consider  a  nonboundary  point.  Then  Fp  is  given  by 

^(p.qlfpqdq 

.  Consider  a  point  at  p  -f  dp  in  the  vicinity  of  p.  Then 
the  differential  change  dF  from  location  p  to  location 
p  +  dp  is  determined  by  the  corresponding  differential 
changes  dF(p,q)  and  dfpq.  Now,  dF(p,q)  is  given, 
from  its  definition,  in  terms  of  the  change  dd(p,q)  in 
distance  to  q,  and  the  change  dg(p,  q)  in  gray  level  rel¬ 
ative  to  q.  dfpq  simply  depends  on  p.  Since  ^(p,  q) 
is  given  to  be  continuous  (differentiable)  away  from  the 
region  boundary,  and  d{p,  q)  and  fpq  are  certainly  con¬ 
tinuous  (differentiable)  functions  of  p,  F  is  also  contin¬ 
uous  (differentiable)  with  respect  to  p  away  from  the 
region  boundary.  Across  a  region  boundary,  the  gray 
level  variation  is  discontinuous,  and  therefore,  so  is  F. 

3.2  Multiscale  Edge  Detection 

In  this  section,  we  show  that  the  field  defined  by  F 
makes  the  interregion  edges  explicit  and  this  makes  the 
task  of  edge  detection  easier.  Consider  an  edge  point 
P  on  the  border  of  a  region  W  with  another  region  X, 
and  a  neighborhood  around  P  of  radius  r.  Further  as¬ 
sume  that  the  conditions  of  Property  2  are  met,  namely, 
r  >  tr,  and  W’s  boundary  is  symmetric  about  P  within 
the  neighborhood.  Then,  from  Property  2  above,  as  P 
is  approached  from  the  side  of  W,  the  direction  of  F 
is  orthogonal  to  the  edge  at  P  and  points  toward  W. 
Similarly,  from  Property  2,  as  P  is  approached  from 
the  side  of  W’,  the  direction  of  F  is  orthogonal  to  the 
edge  and  points  toward  W’.  Thus  P  is  a  point  of  direc¬ 
tional  discontinuity,  and  the  direction  changes  by  t/2 
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Rgure  1.  The  force  at  a  point  P  on  the  repon  boundary 
d^ewis  on  the  region  boundary  corrqrlexity  contained 
within  the  disk  used,  (a)  a  long  linear  bouridary.  (b)  a 
boundary  with  comers,  and  (c)  a  boundary  with 
arbitrary  curvature. 


Hgure  2.  If  the  boundary  is  synome^  about 

the  point  P,  the  force  diction  at  P  is  inward 

and  perpendicular  to  the  boundary. 


Rgure  3.  Stability  and  scale  selectkHi  in  the 
scale  space  defin^  by  my  level  scale  parameter 
(x-axis)  and  spatial  scale  panmeter  (y-axis). 
Hashed  areas  dooote  transitioas  from  one  stable 
scale  to  another. 


Rgore  4.  Closed  form  conqrutation  (tf  force  freld  at 
an  edge  point  near  a  comer.  Point  P  is  d  away  from 
the  comer  point  B*.  The  force  is  computed  from 
within  a  di^  of  radius  R.  For  Case  1  in  the  text, 
forces  at  P  due  to  disk  parts  ABC  and  A’B’C’  cancel 
out  The  force  due  to  part  ABC  is  symmetric  to  that 
due  to  part  A’B’C.  Ihe  force  due  to  part  ABC’  is 
comput^  as  the  sum  of  those  due  to  ABPand  APC. 
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while  traversing  from  W  into  X.  Further,  recall  that 
from  Property  3,  F  varies  smoothly  on  either  side  of 
the  edge.  Therefore,  for  each  point  of  directionl  discon¬ 
tinuity,  there  is  another  point  P’  in  the  vicinity  of  P 
which  is  also  a  point  of  directional  discontinuity.  Since 
with  appropriate  choice  of  scale  parameter  each  such 
point  is  along  the  region  border  (true  edge),  and  regions 
have  closed  borders,  the  points  of  directional  discontinu¬ 
ity  found  using  appropriate  scale  parameter  form  closed 
curves. 

These  properties  also  hold  near  vertices  where  more 
than  two  regions  meet.  This  is  true  because  the  force  at 
each  point  in  the  vicinity  of  a  vertex  points  toward  the 
interior  of  one  of  the  regions,  and  the  above  arguments 
hold  for  each  region.  Further,  any  variation  in  the  val¬ 
ues  of  a,  results  in  incremental  changes  in  F  due  to  the 
smoothness  property.  These  observations  lead  us  to  the 
following  result: 

RESULT:  Region  borders  are  characterized  by  a  dis¬ 
continuity  in  the  direction  of  F.  The  magnitude  of  the 
discontinuity  is  x/2  for  optimal  choice  of  scale  parame¬ 
ters  at  each  point  and  decreases  gradually  for  subopti- 
mal  choices. 

3.3  Performance  Analysis 

We  will  now  examine  the  performance  of  F  with  respect 
to  the  desired  characteristics  listed  in  Section  2.2.  First, 
the  very  motivation  for  proposing  F  comes  from  char¬ 
acteristic  A,  namely,  invariance  to  local  edge  geometry. 
Sec.  3.2  explains  how  this  characteristic  is  possessed  by 
F.  With  regard  to  desired  characteristic  B,  scale  param¬ 
eter  <Tg  provides  a  mechanism  to  accomplish  contrast 
scaling.  As  <Tj  increases,  adjacent  regions  may  merge. 
This  is  because  the  attraction  of  a  point  in  one  region 
from  another  point  across  region  boundary  may  increase 
sufficiently  so  that  the  directional  discontinuity  in  F  re¬ 
sponsible  for  the  edge  may  vanish.  Thus  changing  <Tj 
achieves  region  split  and  merge  based  on  contrast  values 
[7],  and  therefore,  desired  contrast  scaling.  Analogously, 
scale  parameter  a,  helps  achieve  geometric  scaling.  As 
a,  increases,  the  force  field  at  a  point  starts  to  get  influ¬ 
enced  by  increasingly  global  structure.  Therefore,  as  cr, 
increases,  the  sensitivity  to  fine  local  details  is  reduced 
resulting  in  a  characterization  of  more  global  shape. 

We  now  consider  the  performance  with  respect  to  de¬ 
sired  characteristic  D.  First,  recall  that  the  region  edges 
are  represented  by  discontinuities  of  F.  From  the  above 
discussion,  the  structures  at  different  scales  must  be  de¬ 
tected  by  appropriate  values  of  ((r,,ag).  Since  there 
could  be  only  a  small  number  of  scales  associated  with 
any  specific  area  of  the  image,  the  parts  of  the 
space  associated  with  the  different  scale  structures  in 
any  given  area  must  be  small  in  number.  For  any  arbi¬ 
trary  a,  and  ,  the  force  field  should  have  a  direction 
pattern  in  which  borders  of  only  certain  regions  coin¬ 
cide  with  direction  discontinuities.  This  force  field  di¬ 
rection  pattern  should  not  change  except  when  a  change 


in  tr,  leads  to  the  emergence/loss  of  a  structure  due  to 
changed  geometric  sensitivity,  or  a  change  in  tTg  leads 
to  a  split/merge  of  regions  due  to  chemged  contrast  sen¬ 
sitivity.  Consequently,  changes  over  much  of  the  range 
of  ff,  and  ag  should  preserve  the  force  direction  pattern 
in  any  given  image  area.  This  implies  that  F  should  be 
stable  and  the  significant  scales  in  any  image  area  could 
be  automatically  identified  as  corresponding  to  those  a, 
,  (7g  values  (unhashed  areas  in  Fig.  3)  for  which  there  is 
little  change  in  the  direction  pattern.  Such  scale  iden¬ 
tification  should  be  quite  robust  since  it  is  based  upon 
qualitative  changes  in  F. 

4  A  Proposed  Transform 

In  this  sec..ion  we  describe  a  specific  transform  belong¬ 
ing  to  the  family  defined  by  Eqn  (1),  and  examine  its 
performance  as  a  multiscale  edge  detector. 

4.1  Transform 

We  define  F’(p,q)  in  Eqn  (1)  as  a  product  of  two  Gaus- 
sians,  one  a  decreasing  function  of  the  distance  (/(p,  q) 
between  P  and  Q,  and  the  other  a  decreasing  function  of 
the  gray  level  difference,  <;(p.q)  between  P  and  Q.  The 
standard  deviation  for  the  spatial  variation  is  the  spatial 
scale  parameter  (Ts,  and  the  standard  deviation  for  the 
gray  level  variation  is,  the  scale  parameter  for  gray  level 
difference,  <Tg.  The  choice  of  Gaussian  for  each  part  is 
made  mainly  because  of  its  optimal  localization  prop¬ 
erties  in  both  spatial  and  transform  domains,  although 
other  properties  such  as  separability  are  also  desirable. 

Therefore,  the  proposed  transform  Fp  at  an  image 
point  P  at  location  p  is  defined  by: 


Fp 


-L 


qelmage 


-d2(p,q)  -Aff-(p,q) 
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As  we  will  see  later,  empirically  the  performance  of 
the  transform  for  edge  detection  seems  to  be  quite  in¬ 
sensitive  to  the  nature  of  the  decreasing  function;  the 
formulation  of  interpixel  interaction  as  a  vector  inte¬ 
gration  of  pairwise  pixel  similarities,  rather  than  scalar 
weighted  averaging  over  neighborhoods,  appears  to  be 
the  key  factor  in  capturing  the  image  structure. 


4.2  Performance  Analysis 

In  this  section,  we  will  examine  the  performance  of  the 
transform  of  Eqn.  (2)  for  edge  detection  with  respect 
to  the  various  desirable  characteristics  listed  in  Section 
2.  Specifically,  we  will  discuss  the  performance  of  the 
proposed  transform  with  respect  to  (A)  sensitivity  to 
edge  geometry,  (B)  sensitivity  to  intensity  model,  and 
(C)  multiscale  behavior. 


A.  Sensitivity  to  Edge  Geometry:  For  an  edge 
through  a  point  P  which  is  smooth  in  the  vicinity  of 
P,  we  have  seen  that  the  edge  is  well  represented  as  a 
directional  discontinuity  in  F.  That  is,  if  the  edge  is 
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smooth  around  P  and  <7$  is  small  relative  to  the  radius 
of  curvature  of  the  edge  contour,  the  transform  cap¬ 
tures  the  edge  information.  However,  if  the  curvature 
assumes  a  high  value  or  is  undefined  near  P,  e.g.,  when  a 
corner  is  present  along  the  edge  close  to  P,  the  behavior 
of  the  transform  needs  to  be  examined.  This  is  because 
there  may  be  a  qualitative  change  in  Fp  as  <7s  increases 
and  begins  to  to  include  significant  contributions  from 
the  parts  of  edge  beyond  the  corner  (or  high  curvature) 
part  of  the  edge.  Consider  the  corner  of  angle  6  shown 
in  Fig.  4.  Point  P  is  located  a  distance  d  away  from  the 
corner.  As  <7s  increases,  F  changes  significantly,  from 
that  corresponding  to  a  long  linear  edge,  to  that  corre¬ 
sponding  to  the  boundary  of  a  wedge  shaped  region. 

To  study  the  performance  of  the  transform  on  a  cor¬ 
ner,  we  attempted  to  derive  a  closed  form  expression  for 
F  as  a  function  d,6  and  <7-,  and  <Tg.  Due  to  the  use  of 
the  Gaussian,  some  of  the  integrals  did  not  appear  to 
have  closed  form  solutions.  However,  we  could  derive 
a  closed  form  expression  for  the  case  of  a  linear  decay 
(instead  of  Gaussian)  in  attractive  force  with  distance. 
It  is  shown  in  [1]  that  for  the  case  R  <  <t,  and  d  <  R 
(Fig.  4),  the  components  Fx  and  Fy  of  F  are  given  by 

-Fx  =  (Fot-t-Fol)(l-e^) 
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These  components  determine  the  magnitude  and  the 
direction  of  the  force  at  P’,  a  point  just  inside  X  and 
infinitesimally  far  from  P.  The  direction  of  Fp  is  given 
by 


tan  6  = 


(F^l  4-  Fjf,') 

(Fot  +  F'J 


where  b  is  the  clockwise  angle  that  Fp  makes  with  -x 
axis.  The  force  at  a  point  just  inside  W  and  infinites¬ 
imally  far  from  P  is  equal  and  opposite  to  that  at  P'. 
Thus,  the  force  direction  has  a  discontinuity  of  x/2  at 
P  but  the  absolute  direction  b  of  the  force  relative  to 
the  edge  direction,  which  is  desired  to  be  t/2,  depends 
upon  the  parameters  of  the  corner  per  the  above  equa¬ 
tion. 


B.  Senstitivity  to  Intensity  Model:  There  are  two 
aspects  of  intensity  model  with  regard  to  which  the  sen¬ 
sitivity  could  be  evaluated;  presence  of  noise  in  intensity 
values,  and  deviation  from  the  homogeneity  assump¬ 
tion  we  have  made  about  region  gray  level.  The  first 
case  is  simple.  If  the  regions  contain  independently  dis¬ 
tributed,  zero-mean,  additive  noise,  then  the  expected 
deviation  in  force  at  any  point  P  is  0  compared  to  the 
case  without  noise.  This  is  because  changes  in  region 
intensities  due  to  additive  noise  are  spatially  symmetric 
with  respect  to  P,  thus  the  deviation  in  attraction  by 
one  noisy  pixel  is  cancelled  on  an  average  by  another 
pixel  on  the  opposite  side  of  P.  Therefore,  the  expected 
value  of  Fp  is  the  same  with  or  without  noise.  Second, 
if  the  region  is  not  homogeneous  in  intensity  then  the 
force  at  a  point  in  the  region  will  depend  upon  the  rate 
change  of  intensities  away  from  P.  For  the  simple  case 
of  an  intensity  ramp,  the  changes  in  intensity  around 
P  are  still  symmetric  although  nonzero  as  for  the  ho¬ 
mogeneous  case.  Since  force  depends  on  the  absolute 
intensity  difference  only,  P  still  experiences  equal  and 
opposite  forces  from  the  up  and  down  directions  of  the 
ramp  resulting  in  a  zero  force  as  for  the  homogeneous 
case.  This  continues  to  hold  for  a  ramp  containing  inde¬ 
pendent,  zero-mean,  additive  noise  for  the  same  reason 
as  for  noisy  homogeneous  regions.  For  regions  charac¬ 
terized  by  more  complex  distributions,  e.g.,  polynomial 
variation  in  intensity,  Fp  may  show  direction  disconti¬ 
nuities  which  reflect  the  structure.  Such  detected  struc¬ 
ture  must  be  computed  using  Eqn.  (2)  for  the  specific 
case  at  hand.  We  have  verified  this  edge  preserving 
property  of  the  transform  about  a  corner  (details  in  next 
Section). 

C.  Multiscale  Behavior:  Variation  of  scale  parame¬ 
ters  (7s  and  <7g  leads  to  stable  direction  pattern  of  F  as 
was  discussed  for  the  family  of  transforms  in  the  pre¬ 
vious  section.  These  parameters  change  from  point  to 
point  and  therefore  further  processing  is  necessary  to 
obtain  a  multiscale  description  of  the  image  structure. 
This  further  processing  is  beyond  the  scope  of  the  cur¬ 
rent  paper,  therefore  we  will  not  address  it  any  further 
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for  the  specific  transformed  proposed. 

5  Multiscale  Shape  Description 
There  are  two  ways  in  which  the  proposed  approach 
yields  multiscale  description  of  region  shape  along  with 
multiscale  edges.  First,  we  have  seen  that  the  detected 
edges  form  closed  contours  surrounding  the  correspond¬ 
ing  regions.  Therefore,  mutiscale  edge  detection  im¬ 
plies  detection  of  multiscale  regions.  The  second  way 
in  which  the  transform  extracts  multiscale  region  shape 
information  is  by  making  explicit  their  skeletons.  Re¬ 
call  from  Property  1  above  that  in  the  vicinity  of  an 
edge  point  the  magnitude  of  F  decays  away  from  region 
boundary  and  its  direction  points  away  from  the  bound¬ 
ary  point.  This  implies  that  for  the  appropriate  val¬ 
ues  of  the  scale  parameters  at  each  point,  as  one  moves 
away  from  region  boundary  towards  the  interior  there  is 
a  curve  across  which  the  force  changes  direction,  from 
facing  one  side  of  the  region  boundary  to  another.  If 
scale  parameters  not  specific  to  the  different  points  are 
used,  this  and  other  properties  may  not  be  observed. 
For  example,  if  a  value  of  which  is  too  small  is  used 
inside  a  region,  then  the  force  may  become  0  because 
of  the  homogeneity  throughout  the  neighborhood  over 
which  the  transform  value  significantly  depends. 

6  Experimental  Results 

This  section  describes  the  experiments  we  \  «ve  done  to 
test  the  performance  of  the  proposed  tra  isfi.rm.  Since 
the  combination  of  scale  information  from  multiple  pix¬ 
els  is  outside  the  scope  of  this  paper,  the  experiments 
were  carried  out  to  demonstrate  the  various  properties 
of  the  transform,  instead  of  obtaining  edges  or  skele¬ 
tons  for  images.  To  demonstrate  each  property,  we  have 
chosen  appropriate  parameters  of  the  transform.  These 
parameters  would  be  selected  automatically  in  the  sub¬ 
sequent  processing  as  outlined  earlier.  Our  experiments 
demosntrated  that  the  transform  is  fairly  insensitive  to 
the  choice  of  the  decay  function.  The  results  are  rather 
similar  for  different  choices,  including  step,  linear,  ex¬ 
ponential,  and  Gaussian  functions.  Some  of  these  func¬ 
tions  were  used  to  produce  the  results  shown.  This  em¬ 
pirically  confirms  the  power  of  the  inherent  directional 
sensitivity  aspect  of  the  transform,  and  our  argument 
that  the  usual  linear,  convolution  based  edge  detection 
has  basic  limitations. 

Figure  5  shows  the  direction  of  the  transform  for  an 
image,  detected  using  fixed  values  of  erg  and  (Xg  at  all 
pixels.  Different  directions  of  F  are  shown  coded  by 
different  gray  levels.  This  coding  does  lead  to  artifact 
edges  when  the  directions  corresponding  to  the  dark 
and  bright  gray  levels  occur  at  adjacent  image  loca¬ 
tions.  The  corresponding  discontinuities  should  be  ig¬ 
nored.  Edges  and  region  skeletons  can  be  seen  as  direc¬ 
tional  discontinuities.  The  extracted  structure  would 
improve  after  the  subsequent  automatic  processing  of 
the  transformed  image  not  done  in  this  paper.  Fig  6 
demosntrates  the  sensitivity  of  the  transform  to  edge 


geometry.  The  synthetic  image  was  designed  to  cont2un 
linear  and  curved  segments  as  well  as  corners  formed 
by  them.  The  directional  discontinuities  coincide  with 
edges  for  all  parts  of  boundary  without  any  smooth¬ 
ing.  The  skeletons  of  the  regions  are  also  clear  from  the 
directional  discontinuities  present  along  them.  Fig.  7 
shows  the  directions  of  F  for  the  corner  image  of  Fig. 
4.  This  image  was  used  in  Sec.  4  to  cuialytically  evalu¬ 
ate  the  sensitivity  of  the  transform  to  edge  geometry  by 
deriving  closed  form  expressions  for  directional  discon¬ 
tinuities  along  the  edges.  The  experiments  were  con¬ 
ducted  with  corners  of  angles  15,  30,  45  and  90  degrees, 
with  clean  (binary)  images  as  well  as  after  adding  zero- 
mean  Gaussian  noise  having  standard  deviations  of  10 
and  30.  In  all  cases,  the  edges  were  mostly  preserved  as 
there  is  a  large  directional  discontinuity  across  edges. 
The  magnitudes  of  the  directional  discontinuities  cre¬ 
ated  away  from  the  edges  due  to  noise  are  comparatively 
very  small  and  easily  distinguishable  from  those  along 
the  edges.  Fig.  7  shows  F  for  only  two  of  the  cases  for 
brevity,  including  the  sharpest  corner  (15  degrees).  The 
vectors  are  shown  as  line  segments  at  each  point  in  the 
vicinity  of  the  corner.  The  length  of  the  line  segment 
represents  the  vector  magnitude  and  the  tail  of  the  vec¬ 
tor  is  indicated  by  the  end  having  a  small  filled  circle. 
Such  corners  present  a  challenge  to  the  common  meth¬ 
ods  of  edge  detection.  However,  the  results  of  the  trans¬ 
form  show  that  the  directional  discontinuities  coincide 
with  even  the  edges  of  the  15  degree  corner  without  any 
rounding.  Notice  that  the  magnitude  of  the  force  away 
from  the  boundary  decreases  as  expected  and  hence  it 
is  more  susceptible  to  noise  even  though  the  expected 
response  value  is  still  the  correct  one.  This  will  not  hap¬ 
pen  after  processing  of  the  transform  results  with  auto¬ 
matically  chosen  scale  parameters  since  the  value  of  <t, 
would  be  large  towards  the  region  interior,  giving  larger 
force  magnitude  and  thus  eliminating  the  increased  sus¬ 
ceptibility  to  noise. 
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Figure  S.  (a)  A  gray  level  image  and  (b)  the  force  direction  computed  from  the  transform;  the  brightness 
at  each  pixel  is  proportional  U)  the  direction  angle.  When  the  directions  corresponding  to  the  dark  and 
bright  gray  levels  occur  on  adjacent  image  locations,  artifact  edges  are  visible  which  should  be  ignored. 
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Figure  6.  *rhe  sensitivity  of  the  transform  to  edge  geometry,  (a)  A  binary  synthetic  image  designed  to  test  the 
performance  of  the  transform  near  edges  of  different  curvatures  and  th^  junctions,  (b)  The  force  directions 
computed  by  the  transform  shown  as  in  Fig.  S.  The  edges  and  skeletons  are  visible  as  directional  discontinuities. 
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Figure  7.  The  fiwce  directions  near  a  cwner  shown  as  vectors  with  length  proportional  tojnagnitude 
and  the  tail  of  the  vector  shown  by  a  filled  circle.  Left  coluinnn  riiows  directions  fw  a  «-^gree 
corner  with:  (top)  no  noise,  (middle)  zero-mean,  Gaussian  additive  noise  having  standard  dejnation 
!  ().  ( bottom)  same  as  (b)  but  standaid  deviation  30.  Right  column  shows  analogous  results  but  for  a 
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Abstract 

We  present  an  i^oach  for  solving  the  figure-ground 
problon  and  computing  volumetric  descriptions  in 
complex  real  images  for  an  important  class  of  objects; 
straight  homogeneous  generalized  cylinders.  Past  work 
on  stuqje  description  and  lecovety  from  contours  as¬ 
sumed  that  pofect  contours  are  given  or  that  the  scene 
has  already  been  segmented.  We  address  the  problems 
of  scene  segmentation  and  sht4)e  recovery  together.  Our 
method  is  based  on  exploiting  mathematical  invariant 
properties  of  the  contours  of  generalized  cylinders  in  a 
multi-level  percq)tual  grouping  t4)proach.  The  method 
handles  SHGCs  in  complex  scenes  with  occlusion  and 
markings.  We  demonstrate  the  plication  of  our  meth¬ 
od  on  complex  real  images.  We  tdso  demonstrate  the  us¬ 
age  of  the  obtained  descriptions  for  recovoy  of 
complete  3-D  object  centered  descriptions  of  viewed 
objects. 

1  Introduction 

One  of  the  fundamental  problems  in  computer  vision 
is  the  recovery  of  the  shape  of  viewed  objects  in  a  scene 
fi^om  a  monocular  intensity  image.  A  basic  source  of 
difficulty  is  the  ambiguity  caused  by  projection.  Human 
vision  does  show  that  substantial  infcxmation  can  be  in¬ 
ferred  fi’om  a  single  intensity  image.  This  includes  both 
segmentation  of  the  scene  into  different  objects  and  per¬ 
ception  of  their  3-D  shape.  To  achieve  such  ability  in 
machines  has  continued  to  be  one  of  the  most  challeng¬ 
ing  problems  in  computer  vision. 

Among  the  cues  that  can  be  used  for  shape  recovery 
frmn  an  intensity  inuige,  contour  is  the  richest  in  geo- 
m^c  infOTmaticm  and  most  robust  to  changes  in  view¬ 
ing  conditions.  Using  contour  as  a  cue  for  shape 
description  and  recovery  has  received  the  attention  of 
many  researchers  since  die  early  days  of  the  field.  How- 

*  This  reseuch  was  lupported  by  the  Advanced  Reaearch  Projects 
Agency  of  the  Dqiartntent  of  Drfense  and  was  monitored  by  the  Air 
Force  Office  of  Scientific  Reaearch  under  Contract  No.  F49620-90- 
C-0078.  The  United  Statea  Oovemment  is  authorized  to  rqnoduce 
and  distribute  r^rints  for  goverrunental  purposes  notwithstanding 
any  copyright  notation  hereon. 


ever,  most  previous  wOTk  on  inforing  3-D  sluqie  fi'om 
2-D  contours  assumes  that  tite  problem  of  obj^  (sur¬ 
face)  segmentation  has  been  solved,  whereas  this  is  a 
key  and  difficult  st^  in  monocular  scoie  analysis.  Real 
images  produce  contours  with  many  imperfections  such 
as  distOTtions,  breaks  and  occlusion.  Further,  not  just 
‘real*  image  contours  are  fx'esent  in  an  image.  Surface 
markings,  shadows  and  noise  also  produce  comours. 
Figure  1  shows  an  example  of  a  real  image  and  its  ex¬ 
tracted  edges  (by  a  Canny  edge  detector  [Canny  1986]). 
The  difficulty  in  dealing  with  such  imperfections  is  thu 
it  is  impossible,  by  just  looking  at  the  contours  individ¬ 
ually,  to  tell  which  constitute  real  contours  and  of  what 
objects  and  which  do  not,  or  simply  how  many  objects 
thm  are  in  the  scene.  This  problem  is  known,  in  psy¬ 
chology,  as  the  figure-gmund  problem  and  is  more  dif¬ 
ficult  in  the  presence  of  occlusion  as  substantial  object 
contours  may  be  invisible.  To  address  this  problem,  it  is 
necessary  to  use  a  grouping  process  to  collect  relevant 
features  together  and  discard  the  irrdevant  ones. 


Figure  1  A  real  image  and  its  extracted  contours 


Given  the  hierarchical  nature  of  object  and  scene  fea¬ 
tures,  it  is  important  to  address  the  figure  ground  prob¬ 
lem  by  using  feature  grouping  at  different  levds  of  this 
hierarchy.  In  (xder  to  be  of  interest  for  inqiffilect  con¬ 
tours,  the  grouping  constraints  should  be  locally  appli¬ 
cable  so  as  to  handle  occlusion  and  gt^.  Yet,  they 
should  [Rovide  global  criteria  that  discriminate  between 
true  instances  and  accidental  ones. 

In  this  p^)er,  we  describe  our  approach  to  the  figure 
ground  problem  based  on  these  obs^ations.  It  address¬ 
es  the  problem  of  shape  description  and  scene  s^men- 
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tation  in  the  presence  of  multiple  objects,  broken 
contours,  markings  and  occlusion  for  SHGCs,  an  im¬ 
portant  subclass  of  GCs  [Binford  1971],  SHGCs  do  cap¬ 
ture  a  large  number  of  objects  such  as  industrial  parts 
and  tools.  Our  method  exploits  projective  invariant 
properties  of  SHGCs  in  order  to  guide  the  segmentation 
and  description  processes.  We  believe  that  invariant 
properties  of  generic  shapes  greatly  help  in  solving  the 
figure-ground  problem.  In  fact,  an  essential  characteris¬ 
tic  of  the  segmentation  of  an  image  of  a  3D  scene  should 
be  its  viewpoint  invariance.  The  method  we  propose 
consists  of  a  bottom  up,  perceptual  grouping  approach 
handling  a  hierarchy  of  three  levels  of  features:  curves, 
symmetries  and  SHGC  siuface  patches.  Throughout 
this  hierarchy,  the  invariants  are  used  for  local  feature 
detection,  grouping  of  those  features  into  consistent  glo¬ 
bal  ones  and  completion  of  missing  features  using  only 
information  from  the  image  itself. 

We  have  implemented  a  system  that  produced  satis¬ 
factory  results  on  rather  complex  scenes  by  the  stan¬ 
dards  of  currently  developed  methods.  We  have  used 
these  results  for  3D  sht^  recovoy  and  we  believe  that 
they  also  have  application  in  recognition. 

The  remainder  of  this  paper  is  oiganized  as  follows. 
In  section  2,  we  discuss  related  previous  work.  In  sec¬ 
tion  3,  we  discuss  the  projective  invariant  propmies  of 
SHGCs  we  use.  In  section  4  we  give  an  overview  of  our 
approach.  Sections  5  and  6  discuss  in  detail  the  upper 
two  levels  of  our  hierarchy.  Examples  of  obtained  re¬ 
sults  are  also  given.  Section  7  includes  a  discussion  of 
robustness  issues  and  a  comparison  of  our  method  with 
others  proposed  in  the  literature.  In  section  8,  we  dem¬ 
onstrate  the  usage  of  those  results  for  3-D  shape  recov- 
ery.  We  conclude  this  papa*  in  section  9. 

2  Previous  Work 

Previous  methods  on  genmc  shape  detection  and  re¬ 
covery  can  be  classified  into  two  broad  classes:  those 
assuming  p^ect  data  and  those  using  real  images. 

Methods  of  shty)e  recovery  from  perfea  data  focus  on 
the  recovery  of  3D  sht^  and  assume  that  the  problem 
of  image  segmentation  has  been  solved.  The  method  of 
Ulupinar  and  Nevada  [Ulupinar  &  Nevada  1990a  and  b, 
Ulupinar  &  Nevada  1991]  addresses  3D  recovery  of 
certain  classes  of  surfaces  such  as  zero  gaussian  curva¬ 
ture  surfaces,  SHGCs  and  planar  right  constant  general¬ 
ized  cylindes.  Their  method  is  based  on  the  observation 
that  certain  types  of  symmetries  provide  strong  con¬ 
straints  on  3D  shape.  The  method  of  Gross  and  Boult 
[Gross  &  Boult  19^1  addresses  3D  recovery  of  SHGCs 
using  a  combination  of  constraints  from  contour  and  in¬ 
tensity  information. 

Methods  of  shape  from  real  images  explicitly  address 
the  problems  of  real  image  impofections.  They  can  be 
classified  into  two  classes:  specific  model  based  and  ge¬ 
neric  model  based,  although  there  is  no  clear  dividing 


line  between  them.  An  example  of  a  specific  model 
based  ^proach  is  Acronym  [Brooks  1983]  which  uses 
stored  models  to  predict  and  match  image  features.  That 
system  has  been  used  for  the  detection  of  airplane  mod¬ 
els  (straight  GCs)  from  a  top  view.  Generic  model  based 
methods  use  2D  properties  of  generic  3D  shapes,  with¬ 
out  using  specific  objects  as  models.  Ponce  et  al.  [Ponce 
et  al.  1989]  have  doived  projective  invariants  of  the 
contours  of  SHGCs  and  us^  ^em  in  a  simple  method 
to  detect  their  axes.  Richetin  et  al.  [Richetin  et  al.  1991] 
also  used  propoties  of  the  contours  of  SHGCs  far  pose 
estimation  from  an  image  of  a  single  object  These  last 
two  methods  do  not  address  the  segmentation  problem. 
Rao  and  Nevatia  [Rao  &  Nevada  1989]  and  Mohan  and 
Nevada  [Mohan  &  Nevatia  1989]  have  proposed  two 
different  method  for  the  figure-ground  problem  in  com¬ 
plex  scenes  with  occlusion,  in  the  context  of  ribbons. 
Both  methods  address  the  problem  of  selecting  symme¬ 
tries  and  ribbons,  a  necessary  task  for  the  figure-ground 
problem.  Those  methods  use  rather  intuitive  con¬ 
straints.  Sato  and  Binford  [Sato  &  Binford  1992a  and  b] 
have  recently  pr(^x)sed  a  method  to  detect  SHGCs. 
Their  method  and  ours  are  quite  similar  in  the  principle 
of  using  projective  properties.  However,  they  differ  in 
the  way  those  propoties  are  used  and  in  the  complexity 
of  the  scenes  Aey  can  handle.  Most  notably,  their  sys¬ 
tem  does  not  handle  occlusion.  We  will  compare  it  to 
ours  in  more  detail  in  section  7. 

3  Properties  of  SHGCs 

Projective  invariant  properties  provide  strong  con¬ 
straints  for  the  detection  of  genoic  sh^)es.  Thus,  detec¬ 
tion  of  contours  satisfying  those  constraints  is  an 
important  stq>  of  the  figure-ground  problem  and  en¬ 
sures  that  image  segmentation  is  itself  a  viewpoint  in¬ 
variant  process.  Here,  we  include  relevant  prop^es 
from  previous  woik  and  new  properties  we  have  de¬ 
rived.  First,  we  give  the  definition  of  an  SHGC. 

Definition  1:  An  SHGC  [Shafer  &  Kanade  1983] 
(straight  homogeneous  generalized  cylinder)  is  the  sur¬ 
face  obtained  by  sweeping  a  planar  cross-section  curve 
C  along  a  straight  axis  A  while  scaling  it  by  a  function  r. 

Let  C(t)  =  («(/), v(/))  be  a  parametrization  of  C,  r(s)  the 
scaling  function  and  a  the  angle  between  the  cross-sec¬ 
tion  plane  and  the  SHGC  axis  (^-direction),  then  the  sur¬ 
face  of  the  SHCjC  can  be  parameterized  as  follows 
(using  the  formulation  of  [Shafa*  &  Kanade  1983]): 

S{t.s)  ■  (u(/)r(«)sina,  v(0r(j),  s  +  M(/)r(5)cosa)  (1) 

When  o  =  n  /  2,  we  obtain  a  right  SHGC  (RSHGC). 
Figure  1  shows  an  RSHGC  and  the  chosen  configura¬ 
tion  of  the  axes.  For  an  LSHGC  the  scaling  function  is 
linear,  i.e.  r(s)  =  a(s  -  so).  Curves  of  constant  /  are  called 
meridians  and  curves  of  constant  s  are  called  cross-sec¬ 
tions  (also  parallels). 

In  [Ulupinar  &  Nevatia  1990a],  a  class  of  symmetry 
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Figure  2  SHGC  coordinate  system  and  terminology 
called  parallel  symmetry  is  defined  that  is  present  in 
SHGC  contours  under  cotain  conditions  (discussed  lat* 
O').  Its  definition  is  given  here. 

Definition  2:  IVo  planar  unit  speed  curves^  Ci(wi) 
and  C2{w2)  are  said  to  be  parallel  symmetric  if  there  ex¬ 
ists  a  continuous  and  monotonic  function  /,  such  that 
Tjiwj)  =  T2(f(W])).  Where  is  the  unit  tangent  vec¬ 
tor  of  Ci(wj).  Thus,  coresponding  points  have  parallel 
tangent  vectors. 


Figure  3  Example  of  parallel  symmetric  curves 

The  correspondence  is  said  to  be  linear  if/ is  a  linear 
function.  In  this  case  the  two  curves  are  similar  up  to 
scale  and  translation.  The  axis  is  the  locus  of  midpoints 
of  lines  of  symmetry  (correspondence  lines).  Figure  3 
gives  an  example.  A  propoty  of  linear  parallel  symmet¬ 
ric  curves  is  that  lines  of  correspondences  are  eidier  mu¬ 
tually  parallel  (fm'  a  unit  scaling)  or  all  intersect  at  one 
point  (apex). 

Now  we  state  projective  invariant  properties  of 
SHGCs.  Some  have  been  doived  in  previous  wcxk 
[Ponce  et  al.  1989,  Shafer  &  Kanade  1983,  Ulupinar  & 
Nevada  1990a];  we  state  those  without  proofs.  Others 
are  introduced  ho'e;  proofs  of  these  are  contained  in  the 
£4)pendix.  Figure  4  illustrates  those  properties. 
Property  PI:  Cross-section  curves  of  an  SHGC  are  mu¬ 
tually  parallel  symmetric  with  a  linear  correspondence. 
This  property  holds  in  3-D  and  in  the  2-D  (orthograph¬ 
ic)  projection.  The  proof  can  be  found  in  theorem  4  and 
its  corollary  in  [Ulupinar  &  Nevada  1990a]. 

Property  P2:  Contour  generators  (limbs)  of  an  LSHGC 
are  straight  (they  are  moidians).  This  property  holds 
also  for  the  2-D  projection  of  limbs  which  are  projec¬ 
tions  of  those  moidians.  Therefore,  in  2-D,  the  tangent 
line  and  any  correspondence  line  at  each  limb  point  are 
colinear.  The  proof  can  be  found  in  Section  4  of  [Shafer 
&  Kanade  1983]. 

Property  P3:  In  3-D,  tangents  to  the  surface  in  the  di¬ 
rection  of  the  meridians  at  points  on  the  same  cross-sec¬ 
tion,  when  not  parallel,  into'sect  at  a  common  point  on 
the  axis  of  the  SHGC  [Shafer  &  Kanade  1983].  In  2-D, 
tangents  to  the  projections  of  limbs  intersect  on  the  pro- 

curve  is  unit  speed  if  it  is  parameterized  by  arc  length. 


jection  of  the  axis  at  a  common  point  [Ponce  et  al.  1989, 
Ulupinar  &  Nevada  1990a]. 

The  properties  we  add  have  been  rqxMted,  without 
prooft,  in  an  ovo^^iew  of  this  work  in  [Nevada  et  al. 
1992].  Equivalent  ones  have  been  independently  de¬ 
rived  by  [Sato  &  Binford  1992a  and  b].  Ho-e,  we  state 
the  new  propoties  and  give  their  prooft. 

Property  P4:  We  give  this  propaty  in  the  form  of  a  the¬ 
orem  and  its  corollary. 

Theorem  P4:  Lines  of  correspondence  between  any  pair 
of  cross-section  curves  are  eitho'  parallel  to  the  axis  or 
intersea  on  the  axis  at  the  same  point.  The  proof  of  this 
theorem  is  given  in  qjpendix  A.1.1. 

Corollary  P4:  In  2-D  (cxthogrs^hic  projection),  lines  of 
parallel  symmetry  between  any  pair  of  projected  ctoss- 
sections  are  either  parallel  to  the  projection  of  the  axis 
or  intersect  on  it  at  a  common  point.  The  proof  of  this 
corollary  is  given  in  appendix  A.  1 .2. 

Property  PS:  Let  C;(ii)  and  C2(v)  be  two  unit  speed  par¬ 
allel  symmetric  curves  with  a  linear  correspondence 
f[u)  =  au  +  b.  Then  for  all  u  and  u'  the  veaors 
Vj  =  Cj(u')  -  Cj(u)  and  V2  =  C2(au'  +  b)  -  C2(au  +  b)  are 
parallel  and  IV2I/IV1I  =  a  (i.e.  the  ratio  of  their  lengths  is 
constant  and  equal  to  the  scaling  of  the  correspon¬ 
dence).  The  proof  is  given  in  appendix  A.  1.3. 

SHGC  axis - 1^|  Tangents  at 

corresponding  points 
Property  P3. 

corresponding 

segments.  ^  p  ;  Parallel  Symmetric 

Property  P5.  j  \  Cross-sections. 

U  :  A  /  Property  PI. 

Lines  of  Parallel^,^^^ '■)'  V 
Symmetry. 

Property  P4.  V  / 

Figure  4  Projective  invariant  properties  of  SHGCs 

The  usage  of  these  properties  will  be  discussed 
throughout  the  description  of  the  method  in  the  next 
sections. 

4  Overview  of  the  approach 

Our  method  operates  at  three  pacqttual  grouping  lev¬ 
els:  the  curve  level,  the  parallel  symmetry  level  and  the 
SHGC  patch  level.  The  curve  levd  grouping  is  aimed  at 
forming  contours  from  edges  detected  in  a  real  image. 
The  itq>ut  to  that  level  is  an  edge  image  and  the  ouqiut 
are  global  contours.  Contours  are  first  formed  by  an 
edge  linking  process  based  on  simple  contiguity  criteria. 
Those  contours  are  then  segmented  at  comas.  Obtained 
contours  are  usually  discontinuous.  At  this  level,  a  con¬ 
servative  co-curvilinearity  based  process  is  used  for 
bridging  short  breaks.  A  method  similar  to  [Mohan  & 
Nevada  1989]  is  used  here,  which  consists  of  using  an 
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eneigy  function  measuring  zo^oth  order  and  first  order 
continuity  between  curve  ends. 

The  parallel  symmetry  level  grouping  is  aimed  at 
fmming  global  linear  paralld  symmetries  between  con- 
tcjrs  formed  at  the  curve  levd.  A  hypothesize-verify 
process  of  several  steps  is  used  fm  that  purpose.  Linear 
parallel  symmetries  are  used  to  hypothesize  aoss-sec- 
tions  of  SHGCs. 

The  SHGC  patch  levd  grouping  is  aimed  at  forming 
global  SHGC  surface  descriptions  wheneva-  possible. 
Another  hypothesize-verify  process  is  used  that  starts 
by  detecting  local  SHGC  patches  (defined  later)  and 
generating  grouping  hypotheses  of  those  patches  and 
produces  verified  global  SHGC  hypotheses. 

The  percqjtual  grouping  nature  of  our  method  allows 
to  han^e  discontinuities  in  descriptions  caused  by  oc¬ 
clusion  and  laige  gi^n.  The  constraints  for  detection  of 
features  (paralld  symmetries,  SHGC  patches)  and 
grouping  ^  those  feature^  based  on  projective  in¬ 
variant  prcq)eities  discussed  in  section  3.  Those  proper¬ 
ties  are  also  used  for  completion  of  missing  features 
such  as  brdten  contours  and  surfaces. 

Fot  lack  of  space,  we  will  not  discuss  the  curve  levd 
grouping.  In  section  4,  we  discuss  the  parallel  symmetry 
levd  and  in  section  3,  the  SHGC  patch  level. 

5  Parallel  Symmetry  Level  Grouping 

To  detect  paralld  synunetries  we  use  a  hypothesize- 
verify  process  of  several  stq}s,  jfrc m  detection  of  local 
correspondences  to  fmining  consistent  global  ones.  The 
stq)s  are  discussed  bdow. 

5.1  Detection  of  local  correspondences 

Local  parallel  symmetry  cmiespondences  are  detect¬ 
ed  using  the  method  of  [Saint-Marc  &  Medioni  1990]. 
This  method  consists,  first,  of  fitting  quadratic  B-splines 
to  curves  then  finding  cmrespondences  analytically. 
The  correspondences  obtained  are  geno'ally  noisy, 
sparse  due  to  lu'eaks  and  may  involve  both  desired  and 
undesired  symmetries  (involving  markings,  for  exam¬ 
ple).  Furtho*,  the  correspondences  may  not  be  linear, 
which  is  a  requirement  for  cross-sections  of  SHGCs 
(property  PI).  Figure  5  gives  some  examples  of  corre¬ 
spondences  given  by  that  method.  The  second  example 
has  of  the  order  of  a  thousand  such  symmetries.  Group¬ 
ing  and  selection  of  relevant  correspondences  are  the 
objective  of  the  next  steps. 

5.2  Grouping  of  parallel  symmetries 

The  purpose  of  this  step  is  to  generate  grouping  hy¬ 
potheses  (connections)  between  symmetry  elements. 
For  this,  we  use  a  grouping  method  based  on  a  local 
compatibility  constraint  derived  from  property  P3.  As 
shown  in  Figure  6,  two  symmetry  elements  psj  and  ps2 
are  considered  fm  grouping  if  tte  veaors  s  and  r  de¬ 
fined  by  the  end  points  of  the  symmetries  are  parallel 
and  the  ratio  of  their  lengths  I  >  I/I  r  |  Oocal  scale)  is 


Hgure  3  Local  parallel  symmetry  correspondences 
(axes  are  in  thick  lines) 

similar  to  the  scale  I  ^  |/|  A I  suggested  by  the  grouped 
symmetries  (global  scale).  In  fx’actice,  we  also  use  a 
connection  measure  fcH’  each  grouping  hypothesis.  Fm* 
this,  we  distinguish  two  cases,  continuous  cotm<x:tion 
and  discontinuous  connection. 


1)  Continuous  connection:  in  this  case  the  two  synune- 
try  elements  have  a  common  curve  (Figure  6.a).  The 
connection  measure  is  based  on  the  relative  %ap, 

E=  U  I/I  5 1 ,  between  the  non  connectea  ;urves.  A 
grouping  hypothesis  is  generated  if  £  is  less  than  a  fixed 
threshdd.  Tins  measure  has  been  introduced  so  as  to  pe¬ 
nalize  distant  symmetry  dements,  a  case  that  might  oc¬ 
cur  between  ui^ated  symmetries  (involving  markings, 
for  example). 


a.  ccmtinuous  crxinection  b.  discrxitinuous  ccxmection 


£-|l|/|s|  ■  £  -  (lH/lsl)  (a^4-p^) 

local- scales  -  |il/|r|  global- scales  ■  Isl/I^i 


Figure  6  Groiqrirtg  of  paralld  symmetry  demoits 

2)  Discontinuous  connection:  in  this  case,  the  synune- 
try  dements  do  not  share  a  conunon  curve.  Another 
connection  measure  is  used  in  this  case,  £  =  (UI/ 

I  SXa?  +  and  is  as  shown  in  Figure  6.b.  It  controls 
both  the  giq>  and  the  continuity  of  the  symmetric  rves. 
Gaps  that  involve  change  in  ciuvature  sign  are  r  con¬ 
sidered. 


This  local  compatibility  constraint  prevents  grmiping 
of  syrrunetry  dements  such  as  the  ones  of  Figure  7.a 
and  Figure  7.b.  Notice  that  grouping  of  symmdry  ele¬ 
ments  implies  grouping  of  curves  involved  in  the  sym¬ 
metries.  Therefore,  this  stq)  implicitly  handles  (cross- 
section)  curve  groupings  that  have  not  bemi  generated  at 
the  curve  levd  due  to  large  gaps  m  non  smooth  connec¬ 
tions. 

5.3  Selection  of  hypotheses 
The  previous  stq)  may  produce  a  large  number  of 
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a. 

Figure  7  Non  grouped  symmetries,  a.  non  parallel 


connections,  b.  non  similar  local  and  global 
scales. 

connection  hypotheses,  some  of  them  conflicting.  Con¬ 
flicts  arise  when  there  is  more  than  one  coimection  hy¬ 
pothesis  involving  the  same  curve  at  the  same  end. 
Figure  8  gives  an  example.  At  this  level,  it  is  difficult  to 
decide  which,  among  competing  hypotheses,  is  the  right 
one.  Consequently,  all  altonative  combinations  involv¬ 
ing  non  competing  hypotheses  (conflict  free  sets)  are  in¬ 
vestigated. 


the  gap  in  the  middle  curve  is 
diflerently  influenced  by  the 
upper  and  lower  curves 


Figure  8  Competing  hypotheses.  Grouping  of  psj  and 
ps2  competes  with  that  of  psj  and  ps4. 


5.4  Verification  of  globai  correspondences 

In  this  step,  hypothesized  connections  are  checked  fa* 
geometric  consistency.  The  objective  is  to  retain  only 
those  groupings  that  produce  global  parallel  symmetries 
with  linear  correspondences.  The  verification  consists 
of  checking  the  similarity  between  the  scale  given  by 
the  global  correspondence  and  the  scales  of  each  of  its 
component  parallel  symmetry  elements  and  connec¬ 
tions  (property  P3).  This  is  necessary  because  the  local 
compatibility  constraint  only  ensures  scale  similarity  of 
two  neighboring  local  correspondences^. 

5.5  Boundary  completion 

Selected  global  correspondences  are  used  in  order  to 
fill  in  the  gaps.  Since  symmetries  are  similarity  relation¬ 
ships,  missing  bound^es  of  a  curve  can  be  inferred 
fi'om  corresponding  boundaries  of  a  symmetric  curve. 
Boundary  completion  is  diffoent  for  the  two  types  of 
connections  discussed  in  5.2.  We  discuss  them  separate¬ 
ly- 

1)  Continuous  connection:  in  this  case,  the  common 
curve  of  the  connected  symmetry  elements  is  used  as  a 
model  for  the  missing  boundary  of  the  connection.  This 
is  done  as  follows. 


1 )  the  part  of  the  model  curve  that  corresponds  to  the 
gap  is  detected.  For  this,  the  global  scale  is  used. 

2)  the  missing  boundary  is  obtained  by  scaling  and 
translating  the  previous  part  so  that  it  fills  the  gap. 


^due  to  the  utege  of  similmrity  measures,  the  relation  may  not  be 
transitive. 


This  is  shown  by  the  dashed  curve  in  Figure  6.a.  This 
operation  is  done  efficiently  by  the  use  of  B-splines.  The 
cross-section  gaps  in  Figure  9.a  and  b  have  been  so 
completed. 

2)  Discontinuous  connection:  in  this  case,  there  are 
gaps  on  both  sides  of  the  connection.  The  completion  is 
done  in  two  stq)s. 

1)  boundaries  are  inferred  up  to  the  extremities  of  the 
continuing  curves  (dash^  curves  in  Figure  6.b). 
The  same  procedure  as  the  one  discussed  previously 
is  used  here. 

2)  the  two  remaining  gaps  are  filled  in  each  by  a  qua¬ 
dratic  B-spline  (dotted  curves  in  Figure  6.b). 

The  two  filled  in  boundaries  are  paralld  symmetric, 
thus  producing  a  consistent  bounda]7  completion  with 
the  symmetry  requirements.  Figure  9.c  is  an  exanqrle  of 
such  completion. 

Finally,  only  symmetries  involving  closed  curves  are 
selected.  Qosure  is  defined  by  the  existence  of  a  cycle 
of  curves  connecting  both  extremities  of  a  given  curve. 
Gaps  between  adjacent  curves  in  the  cycle  are  compli¬ 
ed  by  B-splines.  This  method  has  produced  sadsfaiory 
results  for  all  tested  examples.  Closed  curves  involved 
in  parallel  symmetries  are  likely  to  correspond  to  cross- 
sections  of  SHGCs  (Property  PI). 

Figure  9  shows  the  results  obtained  in  this  levd  on 
some  objects.  Figure  9.d  shows  the  completed  cross- 
section  of  the  cone  of  Figure  S.b. 


Figure  9  Results  obtained  in  level  2.  Original 

contours  and  completed  correspondences. 


6  SHGC  Patch  Uvel 

At  this  level,  the  objective  is  to  produce  complete 
SHGC  descriptions.  Due  to  large  gaps,  usually  caused 
by  occlusion,  a  single  surface  may  not  be  deteied  sim¬ 
ply  by  searching  for  closed  contours,  or  by  expecting 
connectivity  between  surface  extremities  as  in  [Sato  & 
Binford  1^2a].  Furthermore,  junctions  and  comers 
may  not  be  reliable  as  cues  fOT  surface  segmentation  in 
a  r^  image,  as  those  features  are  themsdves  difficult  to 
detect  and  sensitive  to  image  impofections.  However, 
even  under  heavy  occlusion,  local  surface  patches,  or 
surface  segments,  can  still  be  detected.  A  fragmented 
surface  can  then  be  recovered  by  grouping  those  surface 
patches  whenever  there  is  evidence  that  they  project 


909 


from  the  same  global  surface.  In  this  level,  we  use  a  hy¬ 
pothesize- verify  process  of  several  stq>s  in  order  to  de¬ 
tect  local  SHGC  patches  and  generate  grouping 
hypotheses  of  these  patches.  The  constraints  used  in  this 
process  are  based  on  projective  invariant  properties  dis¬ 
cussed  in  section  3. 

6.1  Detection  of  local  SHGC  patches 

Definition  3:  A  local  SHGC  patch  is  given  by  a  hypoth¬ 
esized  closed  cross-section  and  a  pair  of  corresponding 
limb  curves  (limb  patches)  satisfying  the  projective 
properties  P2  or  P4;  i.e.  the  limb  patches  are  either 
straight  (for  a  local  LSHGC)  or  have  the  property  that 
lines  of  symmetry  between  any  pair  of  projected  cross- 
sections  intersect  on  a  straight  line  (projection  of  the 
axis).  Figure  10  shows  sample  local  SHGC  patches. 


a  '  b.  c.' 

Figure  1 0  Sample  local  SHGC  patches,  a.  cylindrical 
patch  b.  conical  patch,  c.  non-linear  patch 

For  each  hypothesized  cross-section,  the  method  con¬ 
sists  of  finding  limb  curves  having  such  a  correspon¬ 
dence.  Curve  segments  lying  between  the  two  curves  of 
a  parallel  symmetry  (involving  the  hypothesized  cross- 
section)  are  classified  in  two  sets  lying  on  the  two  sides 
of  the  parallel  symmetry,  say  “left”  and  “right”  side.  For 
each  pair  of  such  candidate  limbs  we  check  whether 
they  form  corresponding  limb  patches.  Using  the  given 
cross-section,  the  correspondence  can  be  found  using  a 
method  that  minimizes  the  scale  of  the  cross-section^ 
joining  corresponding  points  [Ulupinar  &  Nevada 
1990a]  (caU  it  limb  based  cross-section  recovery).  Pairs 
of  candidate  limbs  having  such  a  correspondence  are 
checked  whetho*  they  form  local  SHGC  patches: 

•  If  the  limb  patches  are  straight,  then  a  local  LSHGC 
patch  is  hypothesized.  The  patch  is  further  classi¬ 
fied  as  being  cylindrical  if  the  two  limb  segments 
are  parallel,  or  otherwise  conical.  In  the  first  case, 
the  limbs  give  an  estimate  of  the  direction  of  the 
axis  (corollary  P4).  In  the  second  case,  the  intersec¬ 
tion  of  the  lines  supporting  the  limbs  of  a  conical 
patch  gives  the  cone  apex,  which  belongs  to  the  axis 
(also  corollary  P4).  Cross-section  recovery  in  this 
case  is  simple  since  all  limb  correspondence  seg¬ 
ments  are  parallel  (property  PS). 

■  If  the  limb  patches  are  not  both  straight  then  corre¬ 
sponding  points  between  the  limb  patches  are  iden¬ 
tified  using  the  limb-based  recovery  method.  This 

scale  is  with  respect  to  the  hypothesized  (completed)  cross- 

section  (which  we  will  henceforth  cell  "top’’  cross-section) 


yields  a  set  of  recovered  cross-sections  (see 
Figure  11).  Between  each  pair  of  such  cross-sec¬ 
tions  having  different  scales,  the  intersection  point 
of  lines  of  symmetry  is  determined  (Figure  lO.c).  A 
local  SHGC  patch  is  hypothesized  if  the  locus  of 
such  pdnts  is  a  straight  line  (using  fitting  criteria). 
This  liiK  is  a  local  estimate  of  the  projection  of  the 
axis  (corollary  P4).  We  call  this  patch  a  non-linear 
SHGC  patch. 

We  believe  this  method  of  hnding  axis  points  to  be 
more  robust  and  accurate  than  the  method,  used  by 
Ponce  et  al.  [Ponce  et  al.  1989],  based  on  tangent  lines 
(property  P3),  as  the  latter  is  s^itive  to  distOTtions  of 
Ae  limte.  In  a  sense,  property  P3  can  be  thought  of  as 
the  limit  case  of  corollary  P4  where  the  two  cross-sec¬ 
tions  get  arbitrarily  close  to  each  other.  The  error  in  the 
slope  of  a  line  of  cmrespondence  is  inversely  prq>or- 
tional  to  the  distance  betwera  the  cross-sections.  Thus, 
small  errcrs  in  the  tangent  line  direction  greatly  affect 
the  position  of  the  intersection  (axis)  pdnts.  Further, 
our  method  can  be  applied  to  O(n^)  cross-sections  (all 
combinations),  providing  mote  points  for  the  voting 
process  than  the  0(n)  pr(q)erty  P3  would. 

6.2  Grouping  of  local  SHGC  patches 

The  previous  stq)  may  generate  sparse  local  patches 
not  all  corresponding  to  perceived  objects,  as  surface 
maridngs  and  contours  from  different  objects  may  pro¬ 
duce  false  hypotheses.  Further,  due  to  breaks  and  occlu¬ 
sion,  several  local  patches  can  be  obtained  fcr  the  same 
(global)  SHGC.  In  this  stqp,  grouping  hypotheses  are 
generated  so  as  to  form  global  patches  describing  com¬ 
plete  primitive  SHGCs.  A  combination  of  local  geomet¬ 
ric  and  structural  conq)atibility  constraints  is  used  for 
hypothesis  generadoa 

Geometric  cmnpatibility  between  two  patches  is 
based  on  corollary  P4  and  property  P2.  Dq^nding  on 
the  types  of  the  patches,  several  cases  can  occur: 

•  non-linear  and  non-linear:  the  axes  must  be  (al¬ 
most)  colinear  (Figure  1 1  .c  and  d) 

•  non-linear  and  conical:  the  cone  apex  must  lie  on 
the  axis  (up  to  some  errcxr.  Figure  11. a) 

•  non-linear  and  cylindrical:  the  direction  of  the  cyl¬ 
inder  must  be  (almost)  parallel  to  the  axis 

•  conical  and  conical:  either  the  limbs  are  colinear 
(same  apex  as  in  Figure  1 1  .b),  or  a  line  is  generated 
(between  to  the  q)ex«)  ^ 

•  conical  and  cylindrical:  a  line  is  generated  (apex 
and  direction) 

•  cylindrical  and  cylindriad:  the  limbs  must  be  colin¬ 
ear  (for  the  same  LSHGC),  otherwise  the  directions 
must  be  parallel 

*the  line  will  be  conttreined  to  be  colinear  to  the  global  SHGC 

axis 
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Figure  1 1  shows  some  examples  of  geometrically 
compatible  local  SHGC  patches. 


a.  b.  c.  d 

Figure  1 1  Examples  of  geometrically  compatible 

local  SHGC  patches. 


Structural  compatibility  involves  measures  of  prox¬ 
imity  and  continuity  between  SHGC  patches.  We  distin¬ 
guish  two  cases:  continuous  connection  where  the 
patches  share  a  limb  curve  segment  as  for  the  SHGC  in 
Figure  ll.c,  and  discontinuous  connection  where  there 
is  no  cotrunon  limb  as  in  Figure  1 1  .d.  A  connection  hy¬ 
pothesis  is  generated  between  an  SHGC  patch  and  a 
geometrically  compatible  neighbor  if  the  connection  is 
continuous  or  is  discontinuous  with  the  constraint  that 
the  limb  extremities  are  co-curvilinear  or  form  sdf-oc- 
clusion  The  co-curvilinearity  measure  uses  looser 
thresholds  than  at  the  curve  level  since  more  informa¬ 
tion  about  the  contours  is  given  at  this  more  global  lev¬ 
el. 

Note  finally,  that  a  grouping  of  local  SHGC  patches 
implies  a  grouping  of  their  limb  curves.  Therefore,  gaps 
that  have  not  been  bridged  in  the  curve  level  using  co- 
curvilinearity  may  be  bridged  at  this  more  global  level. 

6.3  Selection  of  hypotheses 

Because  of  the  highly  constrained  nature  of  the  com¬ 
patibility  measures,  conflicting  hypotheses  are  rare  at 
this  level.  When  they  do  ht^pen,  diey  involve  alterna¬ 
tive  connections  at  the  same  extremity  of  some  local 
SHGC  patch.  The  only  selection  done  at  this  stq)  con¬ 
sists  of  preferring  continuous  coimections  over  discon¬ 
tinuous  ones.  Among  the  remaining  hypotheses,  the  one 
involving  the  closest  connection  is  selected. 

6.4  Verification 

In  this  step,  grouping  hypotheses  (candidate  global 
SHGCs)  generated  in  the  previous  steps  are  checked  fcx- 
global  consistency.  Global  consistency  is  checked  by 
first  determining  the  global  axis  of  each  set  of  connected 
patches,  then  verifying  its  compatibility  with  the  axis  of 
each  of  those  patches. 

The  global  axis  detection  dq)ends  on  the  nature  of  the 
local  patches.  If  all  the  local  patches  are  cylindrical  with 
mutually  colinear  limbs,  then  they  form  a  global  cylin¬ 
drical  l^HGC  with  an  axis  direction  determined  from 
the  direction  of  the  global  limbs.  Similarly,  if  all  the  lo¬ 
cal  patches  are  conical  with  colinear  limbs,  then  they 

^lelf-occlui ion  can  be  detected  by  T-junctions  and  discontinui¬ 
ties  in  cross-section  scaling  between  local  SHGC  patches. 


form  a  global  conical  LSHGC  whose  apex  is  the  inter- 
sectitm  of  the  global  limbs.  Otherwise,  the  global  axis 
(line)  is  detected  by  combining  the  recovered  cross-sec¬ 
tions  of  all  component  local  SHGC  patches  in  the  same 
way  discussed  for  the  non-linear  patches  in  section  6.1 
(i.e.  using  ctx'cdlary  P4)  and  fitting  a  line  to  the  obtained 
axis  points.  Figure  12  Uustrates  this  procedure. 


Vinification  between  the  global  axis  and  the  compo¬ 
nent  local  ones  uses  the  rules  described  for  geometric 
compatibility  in  section  6.2.  Successfully  verified 
groupings  form  a  global  SHGC  with  a  more  accurate  es¬ 
timate  of  the  axis. 

6.5  Boundary  completion 

Global  SHGC  patches  formed  in  the  previous  stq) 
consist  only  of  aggregates  of  local  patches  believed  to 
make  up  a  single  global  surface.  The  descriptions  of 
those  global  surfaces  may  thus  be  discontinuous  if  they 
are  occluded  or  simply  IxMinded  by  brdken  contours. 
This  can  be  seen  in  the  case  of  the  occluded  vase  of 
Figure  9d,  for  which  the  surface  boundary  does  not  ter¬ 
minate  due  to  that  occlusion  (anotho:  example  is  the 
case  of  the  objects  in  Figure  1  Ic  and  d).  However,  our 
(human)  perception  of  a  surface  is  clear  there.  In  fact, 
we  can  even  guess  the  shape  of  the  hidden  boundary  due 
to  the  symmetric  nature  of  the  shape.  We  show  that  pro¬ 
jective  invariants  can  also  be  used  for  completion  of 
surface  descriptions.  In  this  step,  gaps  in  descriptions  of 
verified  globaJ  SHGCs  are  completed.  Boundary  com¬ 
pletion  is  done  for  connections  between  adjacent  local 
SH(jC  patches  and  terminations  where  a  local  SHGC 
patch,  at  an  extremity  of  the  global  SHGC,  has  an  in¬ 
complete  limb  corre^HMidence  due,  say.  to  occlusion  as 
is  the  case  for  the  lower  part  of  the  occluded  vase  in 
Figure  9.d.  The  method  consists  of.  first,  recovering 
cross-sections  at  all  points  of  the  unmatched  limb  patch, 
using  a  method  we  call  axis-based  cross-section  recov¬ 
ery,  then  finding  the  (missing)  corresponding  limb  patch 
using  a  method  we  call  limb  reconstruction  method.  We 
discuss  the  two  methods  separately. 

Axis-based  cross-section  recovery 

Given  the  axis  of  the  global  SHGC  and  a  reference 
cross-section  (of  any  of  the  local  patches),  this  method 
recovers  CToss-sections  for  unmatched  limb  patches  (at 
connections  or  terminations).  First,  for  each  point  of 
a  given  unmatched  liiuo,  its  corresponding  point  R„  on 
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the  reference  cross-section  is  found  (they  have  parallel 
tangents).  See  Figure  13.  The  scale  of  the  recovered 
cross-section  relatively  to  the  reference  one  is  deter¬ 
mined  as  follows: 

In  the  case  of  an  LSHGC  (Figure  13.a),  the  corre¬ 
sponding  point  Pc  is  simply  the  intersection  of  the 
line  from  parallel  to  the  limb  correspondence  line 

of  the  reference  cross-section  (line  Rc  in  the  fig¬ 

ure^)  and  the  other  straight  limb  of  the  LSHGC.  The 
scale  is  given  by  the  ratio 
(property  P5). 

In  the  case  of  a  non-linear  SHGC  (Figure  13.b),  let 
be  the  into'section  of  the  line  connecting  to 
and  the  axis^.  Then,  by  the  property  of  linear  paral¬ 
lel  symmetry  between  the  cross-sections,  it  can  be 
shown  that  the  scale  is  given  by  the  ratio  dist(P^J*j)l 


a.  b. 

Figure  1 3  Axis  based  cross-section  recovery  method, 

a.  for  LSHGCs.  b.  for  nonlinear  SHGCs 


For  a  termination,  the  method  is  i4)plied  until  the  end 
of  the  unmatched  limb  is  reached  or  thae  is  overlap  be¬ 
tween  the  recovered  cross-section  and  the  bottom  end 
cross-section  of  the  SHGC.  In  doing  so,  we  obtain  a 
more  accurate  segmentation  of  limbs  from  cross-sec¬ 
tion.  Figure  14.a  shows  the  cross-sections  so  recovered 
for  the  connection  of  the  SHGC  of  Figure  9.b  and  the 
termination  of  the  SHGC  of  Figure  9.d. 


a. 

Figure  14 


Cross-section  recovery  and  limb 
reconstruction  for  previous  SHGCs. 


For  discontinuous  connections  where  there  are  no 
limb  patches  on  either  side  of  a  connection,  no  cross- 
sections  can  be  recovo-ed.  This  leaves  Imles  in  the  final 
description  (see  third  SHGC  in  Figure  16a  and  b  for 
which  discontinuity  is  caused  by  self  occlusion).  In  the 
case  of  LSHGCs,  however,  the  recovery  is  straightfor¬ 
ward  as  the  limbs  are  known  on  both  sides. 


^Rc  is  the  point  on  the  top  cross-section  whose  tangent  is  parallel 
to  the  other  straight  limb. 

^the  reference  cross-section  is  chosen  so  that  the  correspondence 
line  is  not  parallel  to  the  axis. 


Limb  reconstruction  method 

Cross-sections  recovered  by  the  previous  method  for 
a  connection  ot  a  tennination  can  be  used  to  infer  the 
missing  limb  boundary.  The  limb  reconstruction  meth¬ 
od  finds  a  point  on  each  of  the  recovered  cross-secti(Mis 
that  is  a  limb  point  (in  the  projection  of  an  SHGC,  limbs 
and  internal  cross-sections  are  tangential  to  each  other). 
The  method  consists  of  finding  the  tangential  envelope 
of  the  set  of  recovoed  cross-sections.  Given  a  starting 
point,  call  it  Po  as  in  Figure  IS,  taken  to  be  an  extremity 
at  the  open  side  of  the  connection  (ex'  termination),  the 
method  consists  of  finding  a  point  P;  on  the  first  recov¬ 
ered  cross-section  whose  tangent  line  passes  through  Pq. 
Py  is  marked  as  a  limb  point  The  process  is  then  repeat¬ 
ed  for  P]  and  the  next  cross-section  until  the  limb  point 
on  the  last  cross-section  is  so  determined.  Since  the 
axis-based  cross-section  recovery  produces  a  dense  set 
of  very  close  cross-sections,  the  method  essentially 
treats  the  infinitesimal  patch  between  two  successive 
cross-sections  as  being  an  LSHGC  patch  (where  line 
tangents  and  limbs  are  colinear;  i.e.  property  P2). 


Figure  1 5  Limb  leconstnicticm  method 

For  LSH(jCs  the  limb  reconstruction  is  straightfor¬ 
ward  as  it  only  consists  of  extending  the  known  straight 
limbs  for  continuous  ex'  discontinuous  connections  (or 
terminations).  For  discontinuous  connections  of  a  non 
linear  SHGC,  the  boundary  completion  is  not  perftxmed 
as  no  cross-sections  could  be  recovered  for  the  holes 
mentioned  previously.  However,  the  SHGC  can  still  be 
used  for  3-D  shape  recovery  and  recognition  (the  hole 
region  wiU  be  left  unspecified).  Figure  14.b  shows  the 
limb  boundaries  so  completed  for  the  SHGCs  of 
Figure  14.a 

Completed  SHGCs  are  furtho^  voified  for  closure. 
Closure  verification  consists  of  checking  required  junc¬ 
tion  properties  for  the  global  SHGCs.  Using  the  termi¬ 
nology  used  in  [Malik  1987],  limb  curves  and  the  top 
cross-section  generally  fexm  three-tangent  junctions^. 
At  the  bottom  cross-section,  they  form  curvature-L 
junctions.  Because  junctions  cannot  be  expected  to  be 
perfect  in  real  image  contours,  we  use  measures  based 
on  proximity  and  angular  variations  at  junction  points. 
For  lack  of  space,  we  omit  the  details  of  this  process. 
Hypotheses  with  non  closed  objects  are  rejected. 

Results  of  this  level  are  given  in  Figure  16. 
Figure  16.a  shows  the  detected  global  SHGCs  whose 
original  contours  have  been  givra  in  Figure  9,  except 
the  last  example  whose  original  comours  are  given  in 

^in  the  cew  of  edges  (non  Umb  boundaries)  arrow  and  Y  junc¬ 
tions  are  formed. 
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Figure  1  (for  this  latter,  notice  the  completion  of  the  oc¬ 
cluded  vase  and  cone).  Recovered  cross-sections  and 
axes  are  shown.  Figure  16.b  shows  the  ruled  surfaces 
(recovered  cross-sections  and  meridians). 


Figure  16  Results  obtained  in  level  3.  a.  Obtained 
global  SHGCs.  b.  Corresponding  ruled 
surfaces.The  last  example  is  from  the 
image  in  Figure  1. 

7  Discussion  and  Comparison 

We  have  tested  our  method  on  several  images,  about 
ten  including  variations  of  the  one  in  Figure  1,  and  sat¬ 
isfactory  results  have  been  obtained.  Four  such  exam¬ 
ples  are  shown  in  Figure  16.  Some  of  the  simple 
examples  have  been  presented  here  so  as  to  illustrate  the 
different  stq)s  of  the  method.  The  others  involve  multi¬ 
ple  objects  in  different  arrangements  such  as  the  last  ex¬ 
ample  in  Figure  16.  Also,  a  number  of  parameters  have 
been  used  in  the  implementation  of  our  system.  We  have 
mentioned  them  throughout  the  description  of  the  meth¬ 
od.  For  example,  connection  measures  in  the  parallel 
symmetry  level,  linearity  of  limbs  and  axes  and  junction 
measures  in  the  SHGC  patch  level.  In  all  the  tested  im¬ 
ages,  the  values  of  all  parameters  have  been  constant 
Robustness  of  the  system  to  changes  in  those  parame¬ 
ters  has  been  tested  by  changing  their  values  by  50%  of 
their  default  ones.  Those  changes  have  only  affected  the 
number  of  hypotheses  and  consequently  the  size  of  the 
search  space.  Lxx»er  thresholds  produce  laiger  search 
spaces.  In  the  case  of  the  contours  of  Figure  1,  for  ex¬ 
ample,  inCTeasing  the  values  of  the  connection  measures 
of  the  parallel  symmetry  level  by  50%  of  their  default 


ones  produced  17  connection  hypotheses  compared  to  4 
for  tte  default  values.  Also,  changes  by  50%  of  the  lin¬ 
earity  thresholds,  in  the  SHGC  patch  level,  produced  95 
local  SHGC  patches  compared  to  94  for  the  default  val¬ 
ues.  Most  impmtantly,  the  same  final  results  have  been 
ot^ned  by  changing  the  values  of  the  parameters;  i.e. 
one  grouping  hypothesis  and  two  terminations  accq)ted 
(selecting  4  lo(^  SHGC  patches  from  the  94  originally 
hypothesized),  resulting  in  the  last  three  SHGCs  given 
in  Figure  16. 

By  way  of  comparison,  the  method  of  Sato  and  Bin- 
ford  [Sato  &  BinfOTd  1992a  and  b]  is  similar  to  ours  in 
the  principle  of  using  projective  invariants  to  detect 
SHCiCs.  It  differs,  however,  in  two  ways.  First,  applica¬ 
tion  of  the  projective  properties  in  their  method  is  some¬ 
what  restricted  to  surfaces  of  revolution  (SORs)  and 
LSHGCs.  Their  application  of  the  prtqjerties  to  surface 
detection,  for  example,  may  nor  give  accurate  results  fOT 
general  SHGCs  as  limb  projections  are  geno-ally  not 
meridian  projections.  For  example,  in  their  system,  cor¬ 
respondence  segments  between  Umbs  are  assumed  to  be 
parallel,  whereas  it  can  be  shown  that  this  is  the  case 
only  for  SORs  and  LSHGCs.  We  also  believe  that  our 
q>plication  of  the  properties  is  mmv  robust  For  exam¬ 
ple,  their  parallel  symmetry  detection  uses  a  Hough 
transform  to  detect  the  point  of  intersection  of  corre- 
spondmce  lines  betwera  symmetric  curves.  For  sym¬ 
metries  with  a  scaling  factor  close  to  1,  the  errm^  in  the 
detected  point  can  be  large.  We  use  property  5  which  is 
mote  robust  to  errws  in  correspondences  as  such  errors 
cause  only  small  changes  in  length  ratios  and  directions 
of  corresponding  segments.Tbe  main  difference  be¬ 
tween  tb^  method  and  ours,  however,  lies  in  handling 
occlusion  and  large  gaps.  The  authms  note  that  their 
system  does  not  handle  occlusion  as  simple  connectivi¬ 
ty  criteria  are  used  for  surface  detection.  Our  SHGC 
patch  level  grouping  handles  breaks  and  occlusion  by 
detecting  visible  local  surface  patches  and  grouping  of 
compatible  ones  into  global  surface  descriptions.  Fur¬ 
ther,  our  method  also  detects  non-visible  parts  of  the  ob¬ 
jects  and  completes  them  adequately  thereby  producing 
complete  surface  descriptions. 

8  3*D  Shape  Recovery 

We  have  applied  the  method  of  [Ulupinar  &  Nevatia 
1990a]  on  our  results.  That  method  produces  a  viewer 
centered  description,  not  the  whole  object  description. 
However,  it  is  often  desirable  to  recovo-  a  complete  3-D 
object  centered  description  of  viewed  objects  also  pro¬ 
viding  volumetric  information.  Fch*  this,  we  propose  a 
method  that  uses  constraints  from  both  our  2-D  descrip¬ 
tion  and  the  viewer  centered  description  mention^ 
above.  The  method  assumes  that  we  have  a  Right  SHGC 
(i.e.  o  =  71/2  in  equation  (1)).  This  equation  becomes: 

S(/,  s)  -  (r(s)u(t),  r(s)v(t)j)  (2) 

Without  loss  of  geno-ality,  we  assume  that  j  =  0  for 
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the  top  cross-section  and  that  riO)  -  1  (the  scaling  is  rel¬ 
ative  to  the  top  cross-section).  Note  that  our  2-D  de¬ 
scription  already  provides  the  values  r(s-^  (the  scaling 
fiin^on  is  an  orthographic  invariant).  However,  it  does 
not  provide  the  values  ^ ,  (call  them  s-values)  necessary 
fOT  a  complete  description  of  the  sweq)ing  function. 

Figure  17  gives  the  ccmfiguration  of  tlK  cotvdinate 
systems  rdevant  to  this  analysis.  We  denote 

V.v’ 


geometry 


S  =  (OS,u,v,s)  the  SHGC  coffl-dinate  system, 
W  =  (OW,x,y,z)  the  viewer  coordinate  system, 
/  =  (OW,  X,  y)  the  image  comdinate  system,  V  the  view¬ 
ing  direction  (assumed,  without  loss  of  generality,  to  lie 
in  the  u-s  plane  of  5;  (Mthogri^hic  projection  assumed). 
V  makes  an  angle  o  with  the  r-axis  (SHGC  axis).  Con¬ 
sider  S’  =  (OS,  u',  v’,  /)  obtained  by  rotatiitg  S  around  v 
by  o  so  that  the  new  «-axis  (s’)  is  aligned  with  Let  6 
be  the  angle  between  the  projection  of  the  SHGC  axis 
and  the  x-axis.  Consider  W  ^  (OW,x’,  y,  s’)  obtained  by 
rotating  fV  by  6  around  the  z-axis  so  t^  the  new  jr-axis 
(x*)  is  parallel  to  the  projection  of  the  SHGC  axis.  Let 
/’  s  (OW,  x’,  y’)  be  obtained  by  rotating  /  by  0.  Then,  in 
/’  the  SHGC  axis  is  horizcmtal,  and  S’  and  W’  differ  only 
by  a  translation  OS-OW.  The  rdationship  between  coor¬ 
dinates  in  5,  S’,  VT  and  /’  are  as  fdlows  (from  now  on 
we  omit  the  arguments  /  and  j  in  the  expressions). 

Letting  (Xj,yg,  z^)  be  the  coordinates  of  OS  in  W’,  a 
point  P  with  cooi^nates  (ru,  rv,  z)  in  S  has  coordinates 
(cosaru  s  sina,  rv,  -sinaru  +  s  cosa)  in  S’, 

(cosaru  +  s  sina  +  x,,  rv  +y,,  -sinaru  +  s  cosa  +  z,)  in  W’ 
and  its  projection  has  coordinates 
(cosaru  +  s  sina  +  x,,  rv  +y,)  in  /’.  In  the  remainder  of 
this  analysis,  we  consid^  image  measurements  in  /’. 
World  coordinates  will  be  expressed  directly  in  W’. 

To  recover  a  complete  3-D  descriptimi  of  the  SHGC, 
it  is  necessary  to  doermine  the  3-D  top  cross-section, 
the  orientation  of  the  axis,  the  coordinates  (x^,  y,,  Zj)  of 
its  origin  (point  where  the  axis  pierces  the  cross-section, 
not  necessarily  at  its  center)  and  the  values  z,  (i  =  l  ...n) 
(z-values)  of  the  cross-sections  (parallels)  of  interest 
The  recovery  of  those  parameters  is  discussed  in  appen¬ 
dix  A.2. 


Application  of  this  method  to  the  descriptions  ob¬ 
tain^  in  Figure  16  is  shown  in  Figure  18,  wl^  the  ob¬ 
jects  are  displayed  for  different  values  of  slant  and  tilt 
from  their  (xigjnal  ones.(The  right  most  object  in  the 


Figure  18  Recovered  3-D  desciipticxis  of  {xevious 
SHGCs  shown  from  different  viewpoints. 

figure  has  been  interpreted  as  a  LSHGC  because  the 
scaling  function  is  close  to  be  linear,  producing  straight 
limbs  over  its  surface). 

9  Conclusion 

We  have  presented  an  approach  to  the  figure-ground 
problem  in  real  monocular  images  containing  objects 
that  can  be  described  as  SHGCs.  It  is  based  on  two  fun¬ 
damental  aspects.  First,  the  study  and  derivation  of  pro¬ 
jective  invariant  propenies  df  SHGCs.  Second,  the 
multi-levd  perceptual  grouping  ^iproach  to  scene  s^- 
mentation  and  shi^ie  description.  'Die  projective  invari¬ 
ant  properties  are  the  basis  for  ddecting  local 
corre^ndoices,  hypothesizing  groiqiings,  estimating 
more  accurate  global  correspondences,  verifying  global 
consistracy  and  completing  misang  boundaries.  Thus, 
in  a  sense,  our  method  not  only  filters  out  irrdevam  fea¬ 
tures  but  detects  where  relevant  ones  are  missing  and 
completes  them  adequately.  Our  method  difiers  from 
other  poceptual  grouping  methods  in  that  it  uses  rigcx*- 
ous  constraints  fix'  the  s^mentation  process  as  opposed 
to  intuitive  ones  used  by  those  methods.  It  also  differs 
from  other  methods  of  invahants-based  genmc  inter¬ 
pretation  of  image  contours  in  that  it  uses  a  hierarchy  of 
grouping  levels  handling  substantial  occlusion.  We 
have  also  demonstrated  the  usage  of  our  results  to  3-D 
shape  recovery,  using  a  single  image  of  a  scoie. 

We  plan  on  extmiding  this  tqiproach  to  handling 
curved  axis  primitives  (see  [Zerroug  &  Nevada  1993] 
fOT  the  derivation  of  quasi-invariant  properties  of 
curved  axis  GCs),  and  composite  objects.  These  latter 
are  made  up  of  simple  GC  primitives.  Detection  of  such 
objects  can  proceed  by  first  detecting  the  component 
GO  and  analyzing  th^  structural  and  geometric  rela¬ 
tionships.  Such  objects  introduce  many  difficulties  that 
have  not  been  addressed  in  this  work,  including  incom¬ 
plete  cross-sections  and  non-paralld  surface  cuts  of 
primitives,  especially  at  joints.  Ws  believe,  however, 
that  the  constraints  and  the  methods  developed  here  will 
be  an  important  part  of  handling  those  objects. 
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Appendix 

A.l  Proofs 

A.1.1  Proof  of  theorem  P4 

Let  Py  =  S(t,  sj)m(lP2  =  S(t,  $2)  be  two  corresponding 
points  on  two  diffo'ent  cross-sections.  The  line  L  join¬ 
ing  these  points  can  be  parameterized  as  follows; 

u  =  [u(0sina  (r(ji)  -  +M(/)sinar(ji)  (3) 

V  =  [v(0  (r(Ji)  -  '•(^2))]'”  +  (4) 

«  =  -  «2+“(^)cos«  W^i)  -  '<^2))]'"  Ji+«(t)cosar(«i)  (5) 

Case  1:  r(si)  *  r(s2).  The  intersection  point  of  L  with  the 
SHGC  axis  (s-axis)  is  given  by  setting  u  =  v  =  0;  this 
yields  m  =-r(sj)l  ir(sj)-r(52))  and  the  intersection 
point  M  has  coordinates  (0, 0,  (52r(sj)  -  sir(s2))  /  (r(si)  - 
r(52))  which  are  independent  of  t  (i.e.  of  the  particular 
points  on  the  two  cross-sections). 

Case2:  r(S])  =  r(s2).  In  this  case  the  direction  of  L  is  giv¬ 
en  by  the  vector  (0.0,  £,-.^2)  which  is  indq)endent  of 
r  and  parallel  to  the  axis  □. 

A.1.2  Proof  of  corollary  P4 

An  algebraic  proof  is  not  necessary  as  it  is  similar  to 
the  previous  one,  except  that  the  expression  of  lines  of 
COTTespondence  is  in  tte  image  plane.  Instead,  we  give 
the  following  argument 

If  the  correspondence  lines  are  parallel  to  the  axis, 
then  they  are  also  parallel  in  the  image  ((xthogri^>hic 
projection),  otherwise  the  intersection  property  holds 
also  true  in  the  image  since  intersecting  lines  in  3D 
project  omo  intersecting  lines  in  the  2D  image  (general 
viewpoint)  □ . 

A.U  Proof  of  property  PS 

C,(u0-c,(u)  -  J“'ri(s)d5  -  J"'rj(as  +  6)ds 

«  1/flJi;'  T2  (0  dt  -  1/a  (C2(V)  -  C2(v))  □ 

A.2  Recovery  of  SHGC  Parameters  (3-D 
Object  Centered  Description) 

A.2.1  Recovery  of  the  3-D  cross-section  curve 

The  method  of  [Ulupinar  &  Nevada  1990a]  deto*- 
mines  the  orientation  N-  (N^.  Ny,  of  the  cross-sec¬ 

tion  plane.  The  back  projection  of  the  top  cross-section 
points  (Xi,  yi)r  is  thus  given  by  (x„  y,-,  z.^.,  where  z,  =  - 
(NgXi  +  Nyyi)  /  (the  plane  is  arbiti^y  fixed  to  pass 
through  the  origin  of  W). 

\.2J2  Recovery  of  the  viewing  direction 

From  Figure  17,  o  is  the  viewing  angle  (between  the 
z-a.xis  and  the  SHGC  axis).  The  axis  of  the  SHCK:  has 
its  direction  given  by  -N.  Thus  cosa  =  -N^  and 
a  =  acosi-N^). 

\.23  Recovery  of  LSHGCs 

Fot  LSHGCs  r(s)  =  (as+l)  in  equation  ^2).  For  a  cyl¬ 
inder,  asO  (constant  size  cross-section).  N  is  the  direc¬ 


tion  of  sweq>.  It  suffices  to  recover  the  z-value  for  the 
last  (“bottom”)  cross-section.  Let/*  =  (x,  y)  be  apoimon 
that  cross-section  and  Po  =  (xo.yo)  its  corresponding 
point  on  the  “tq>”  one^Then,  using  the  relationship  be¬ 
tween  co<»dinates  of  C  =  P  -  Fo  in  /’  and  S’,  we  obtain 
s  =  (x-xO)l  sina. 

For  a  cone  (a  *  0),  we  can  determine  a,  the  z-value  of 
the  last  cross-section  and  the  apex  of  the  cone.  Using  the 
ntxmal  h  =  (r,.  rty,  n^),  determined  by  the  method  of 
[Ulupinar  &  Nevada  1990a],  at  a  point  P  (as  above)  and 
writing  that  h  and  the  meridian  C  =  P-Po  are  mthogo- 
nal  (a  LSHGC  surface  being  developable,  C  is  in  the 
tangent  plane  at  P)  yidds: 

f  =  -  (coso«y(y  -  >0)  +  (^  •  J'b)  (coso^x  *  sinan^)  /  n^. 

Using  the  value  r  of  the  scaling  of  the  last  cross-sec¬ 
tion  (given  by  our  2-D  description),  a  is  given  by  a  -  (r  - 
\)l  5.  Writing  that  the  apex  of  the  cone,  given  by  its  2- 
D  coordinates  (x^.yj  is  the  intersection  point  of  all 
(straight)  meridians  and  using  the  relationship  betwem 
its  coordinates  in  /’  and  S’  yidds  the  coordinates  of  the 
SHCjC  center  x^mx^- sma(sl(l -r));  y^^ya!  z,  =  - 
(N^,  +  Nyy,)IN,. 

X2A  Recovery  of  Non  linear  SHGCs 

Fot  each  cross-section  (paralld)  Cj  of  interest,  it  is 
necessary  to  recover  the  value  z.  along  the  axis.  Our  2- 
D  description  already  provides  me  values  rj  (the  scaling 
wim  respect  to  the  t^  cross-section).  Let  Py  be  a  poim 
on  the  surface  of  me  SHGC  (on  any  parallel).  Let 
mm(m^,my,m^)  be  the  tangent  to  the  meridian 
((m^.  nty)  can  be  computed  in  the  image)  and 
h  m  (njf.  riy,  n^)  the  surface  normal  atPy.  Let  (r^.  rjv,  sj)  be 
the  coormnates  of  Pj  in  S,  (x,  y,  z)  its  coOTdinates  in  W’ 
i(x,y)  are  known  fi-om  image  measurraients)  and 
(xq,  yo>  ^0)  ^  coordinates  of  Pq,  the  corresponding  point 
of  Pj  on  me  cross-section  ((xq,  yo,  zq)  are  determined 
as  discussed  in  A.2.1). 

Writing  mat  h  and  m  are  oithogonal  (basic  differen¬ 
tial  geometry)  and  equating  expressions  of  the  tangent 
to  me  meridian  in  S’  and  W’  yidds  u-v  ((cozom^- 
stnantj)  /  my)’,  where  v  can  be  computed  wim  good  accu¬ 
racy  using  the  rdationship  between  coordinates  of  Pq  in 
S’  andW’,!^  v-yo-y,  (y,  is  tte  constant  second  coor¬ 
dinate  of  the  projection  of  the  axis  in  /’;  recall  that  in 
that  system  the  axis  is  horizontal).  Equating  expressions 
of  the  correspondence  vectOT  C  -  Py  -  Pq  in  P  and  S’ 
yields  Sj  ■  (x  -  -  cosau(rj  -1))  /  (sina). 

In  practice,  the  above  process  is  perfOTmed  on  points 
whoe  the  tangents  to  me  meridians  are  not  parallel  to 
the  axis  or  orthogonal  to  it  More  than  one  point  is  used 
for  each  cross-section.  The  olxained  values  are  aver¬ 
aged  to  obtain  an  estimate  of  the  value  of  sj. 

Let  Cy  be  an  arbitrary  cross-section,  ry  (ry  *1)  its  scaling 
wim  respect  to  the  tq>  cross-section,  and  zy  its  z-value 
computed  by  the  rnemod  described  previously.  Let 
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(Xa.ya)^  the  image  coordinates  of  the  intersection 
point  of  all  lines  of  synunetry  between  Cj  and  the  top 
cross-section  ((Xg,  yj  are  given  by  our  2-D  description). 
From  (2)  and  the  analysis  in  appendix  A.  1 . 1 ,  the  3-D  co¬ 
ordinates  of  that  point  in  S  are  (0, 0,  sj  /  (1  -  rj).  Equating 
these  with  their  expression  in  W’  yields  -  (sinasj)/ 
(1  -  fj);  ys  “  ya:  and  =  -  (N,x^  +  /  N^. 
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Abstract 

This  paper  summarizes  the  underlying  ideas  and  algo¬ 
rithmic  details  of  a  computer  program  that  performs  at 
a  human  level  of  competence  for  a  significant  subset  of 
the  carve  partitioning  task.  It  extends  and  “rounds  out” 
the  technique  and  philosophical  approach  originally  pre¬ 
sented  in  a  1986  paper  by  Fischler  and  Bolles.  In  par¬ 
ticular,  it  provides  a  unified  strategy  for  selecting  and 
dealing  with  interactions  between  salient  points,  even 
when  these  points  are  salient  at  “different  scales  of  res¬ 
olution.”  Experimental  results  are  described  involving 
on  the  order  of  1000  real  and  synthetically  generated 
images. 

Index  Terms:  computer  vision,  salient  points,  critical 
points,  curve  partitioning,  curve  segmentation,  curve  de¬ 
scription 

1.  Introduction 

A  critical  problem  in  machine  vision  is  how  to  break  up 
(partition)  the  perceived  world  into  coherent  or  meaning¬ 
ful  parts  prior  to  knowing  the  identity  of  these  parts.  Al¬ 
most  all  current  machine  vision  paradigms  require  some 
form  of  partitioning  as  an  early  simplification  step  to 
avoid  having  to  resolve  a  combinatorially  large  number 
of  alternatives  in  the  subsequent  analysis  process.  Given 
this  critical  role  for  partitioning  as  a  functional  require¬ 
ment  of  a  complete  vision  system,  it  is  a  major  challenge 
to  find  some  significant  subset  of  the  partitioning  prob¬ 
lem  for  which  an  algorithmic  procedure  can  duplicate 
normal  human  performance.  This  paper  (a  compressed 
version  of  a  much  longer  document  which  will  appear 
in  IEEE  PAMI  later  this  year)  summarizes  the  under¬ 
lying  ideas  and  algorithmic  details  of  a  computer  pro¬ 
gram  which  performs  at  a  human  level  of  competence 
for  a  significant  subset  of  the  curve  partitioning  task.  It 
extends  and  “rounds  out”  the  technique  and  philosophi¬ 
cal  approach  originally  presented  in  a  1986  PAMI  paper 
by  Fischler  and  Bolles  [Fischler86].  For  example,  it  pro¬ 
vides  a  unified  strategy  for  resolving  conflicts  in  selecting 

'This  work  was  perfonned  under  contracts  supported  by  the 
Defense  Advanced  Research  Projects  Agency. 


among  neighboring  potential  partition  points  that  may 
be  salient  at  different  “scales  of  resolution.” 

While  our  focus  in  this  paper  is  on  curve  partitioning 
in  a  generalized  setting  (the  curves  in  our  experiments 
are  mostly  without  semantic  meaning),  and  where  the 
criterion  for  success  is  duplicating  normal  human  perfor¬ 
mance,  finding  salient  points  on  image  curves  (potential 
partition  points)  plays  a  critical  role  in  both  two  and 
three  dimensional  object  recognition,  in  curve  approxi¬ 
mation,  in  tracking  moving  objects,  and  in  many  other 
tasks  in  machine  vision. 

In  many  approaches  to  2-D  object  recognition,  objects 
are  represented  by  their  boundaries,  and  the  recogni¬ 
tion  techniques  depend  (directly  or  indirectly)  on  locat¬ 
ing  distinguished  points  along  the  boundary;  typically 
these  distinguished  points  are  discontinuities  or  extrema 
of  local  curvature  (sometimes  called  “corner  points”)  and 
inflection  points  [e.g.,  Mokhtarian86].  “Corners”  on  the 
contours  of  imaged  objects  are  often  used  as  features  for 
tracking  the  motion  of  these  objects  and  for  comput¬ 
ing  optical  flow  [e.g.  Mehrotra90].  In  3-D  recognition, 
partitioning  is  typically  one  of  the  first  analysis  steps  - 
especially  when  objects  can  occlude  each  other.  Hoffman 
and  Richards  [Hoffman82]  argue  that  when  3-D  parts  are 
joined  to  create  complex  objects,  concavities  will  gener¬ 
ally  be  observed  in  their  silhouettes,  and  that  segmen¬ 
tation  of  image  contours  at  concavities  (  the  maxima  of 
negative  curvature  along  the  contours)  is  a  good  strat¬ 
egy  to  decompose  (even  unmodeled)  objects  into  their 
“natural  parts.” 

In  cartography,  computer  graphics,  and  scene  anaysis, 
it  is  often  desirable  to  partition  an  extended  boundary 
or  a  contour  into  a  sequence  of  simply  represented  prim¬ 
itives  (e.g.,  straight  line  segments  or  polynomial  curves 
of  some  higher  degree)  to  simplify  subsequent  analysis 
and  to  minimize  storage  requirements  [e.g.,  Teh89]. 

In  our  own  current  work  concerned  with  delineating 
linear  structures  in  aerial  images,  the  techni(|ue  pre¬ 
sented  in  this  paper  was  an  essential  component  of  the 
system  (briefly  described  in  Appendix  C)  that  produced 
the  results  displayed  in  Figure  6. 
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2.  Problem  Statement 

In  its  most  general  sense,  partitioning  involves  assign¬ 
ing,  to  every  element  of  a  given  ’’object”  set,  a  label 
from  a  given  ’’label”  set.  For  our  purposes  in  this  pa¬ 
per,  the  object  set  is  the  set  of  points  along  a  curve  (or 
contour  segment)  lying  in  a  prescribed  region  of  a  two- 
dimensional  plane.  While  we  deal  with  cases  where  the 
points  in  the  object  set  do  not  form  a  continuous  dig¬ 
ital  curve,  in  most  of  our  exposition  in  this  paper  we 
will  assume  that  the  curves  are  continuous  ^  and  non¬ 
intersecting.  Our  label  set  is  binary,  points  will  be  called 
either  significant  (critical)  or  non-significant,  for  some 
specified  purpose.  In  Fischler  and  Bolles  [Fischler86], 
it  is  demonstrated  (or  at  least  argued)  that  perceptual 
partitioning  is  not  independent  of  some  assumed  task 
or  purpose.  In  this  paper  we  focus  on  one  of  the  three 
tasks  discussed  in  the  above  reference:  Selecting  a  small 
number  of  points  (called  critpis)  along  a  curve  segment 
which  could  be  used  as  the  basis  for  reconstructing  the 
curve  at  some  future  time.  Figure  1  shows  the  specific 
instructions  and  curves  used  in  one  set  of  relevant  exper¬ 
iments  involving  human  subjects;  this  figure  also  shows 
the  critpts  that  were  selected  by  the  subjects,  and  the 
comparable  results  produced  by  our  algorithm  (called 
the  Saliency  Selection  System,  or  SSS,  and  discussed  in 
Appendix  B). 

In  order  to  separate  the  generic  partitioning  criteria 
used  by  human  subjects  from  criteria  based  on  their 
past  experience,  such  as.  when  the  subject  is  able  to  as¬ 
sign  a  name  to  the  curve  (e.g.,  the  curve  looks  like  the 
letter  ’’s”),  we  used  ’’random”  curve  segments  for  our 
experiments;  the  technique  employed  to  generate  the 
segments  is  described  in  Appendix  A.  We  also  wanted 
to  avoid  having  to  deal  with  the  recognition  of  global 
features  (e.g.,  symmetry  or  repeated  structure, .or  even 
straight  lines  and  analytic  curves)  as  a  condition  for  mak¬ 
ing  critpt  selections;  avoiding  this  problem  is  justified  if 
we  are  correct  in  our  belief  that  local  and  global  anal¬ 
ysis  are  accomplished  by  separate  mechanisms.  In  or¬ 
der  to  deal  with  global  features,  the  complexity  of  any 
solution  would  be  expanded  enormously  since  a  whole 
new  vocabulary  of  such  features  and  their  representa¬ 
tions  would  have  to  be  implemented.  The  generation 
and  use  of  random  curves  took  care  of  this  problem  also 
(i.e.,  it  is  highly  unlikely  that  symmetries  or  repeated 
structure  would  ever  be  generated  by  our  random  pro¬ 
cess)  . 

3.  Relevance,  Prior  Work,  and  Critical 
Issues 

The  partitioning  problem  has  been  a  subject  of  in¬ 
tense  investigation  since  the  earliest  work  began  in  ma- 

*  Each  point  of  the  non-branching  one  pixel  wiile  curve,  with 
coordinates  (x,y),  has  one  or  more  neighbors  with  x-coordinates  in 
the  set  (x+1,  x,  x-l),  and  y-coordinates  in  the  set  (y-H,  y,  y-1). 


chine  vision.  It  has  been  widely  assumed  that  in  order 
to  reduce  the  combinatorics  of  scene  analysis  to  a  man¬ 
ageable  level,  it  is  necessary  to  decompose  images  into 
their  meaningful  component  parts  as  one  of  the  first  steps 
in  the  analysis  process.  The  difficulty  arises  from  the 
need  to  partition  the  image  into  parts  before  we  know 
the  identity  of  those  parts.  The  underlying  assumption 
then  is  that  there  are  generic  criteria,  independent  of  the 
goal  of  the  analysis,  that  if  discovered,  could  be  used  to 
obtain  useful  (or  at  least,  intuitively  acceptable)  parti¬ 
tioning;  additional  problem  dependent  criteria  could  be 
always  added  to  produce  a  more  relevant  result  for  some 
particular  purpose. 

The  partitioning  problem  becomes  progressively 
harder  as  we  increase  the  number  of  dimensions  in  which 
we  are  working;  in  this  paper  we  only  address  the  1.5-D 
problem  of  partitioning  planar  curves.  A  specific  crite¬ 
rion  which  can  form  the  basis  of  such  partitioning  was 
originally  proposed  by  Attneave  [Attneave54]  -  points 
at  which  the  curve  bends  most  sharply  are  good  parti¬ 
tion  points.  ^  This  idea  has  been  the  starting  point  for 
most  of  the  subsequent  efforts  in  curve  partitioning,  but 
attempts  to  convert  this  abstract  concept  into  a  com¬ 
putationally  executable  procedure,  that  gives  intuitively 
acceptable  results,  has  meet  with  limited  success.  ^  Ref¬ 
erences  [Imai86,  Mokhtarian86,  Pavlidis74,  Rosenfeld73, 
Teh89,  WuescherQl]  are  representative  of  work  in  this 
area.  ^ 

The  main  problems  we  must  solve  are: 

(a)  A  way  of  assigning  a  measure  (or  degree)  of 
saliency/criticality  ®  to  each  point  on  a  curve. 
Most  investigators  have  equated  sharp  bending  of 
a  curve  with  the  mathematical  concept  of  curva¬ 
ture,  but  curvature  is  not  well-defined  for  a  finite 
sequence  of  points  (which  is  how  our  sensor  ac¬ 
quired  curves  are  generally  represented).  Further, 
it  is  not  obvious  that  the  mathematical  definition 
of  curvature  is  the  best  computational  approxima¬ 
tion  to  the  human  criteria  for  criticality.  In  Fis¬ 
chler  and  Bolles  [Fischler86],  bending  is  interpreted 

^Hoffman  and  Richards  [HoH'man82]  give  convincing  evidence 
that  we  should  distinguish  between  positive  and  negative  curva¬ 
ture  maxima.  That  is,  on  closed  curves,  extreme  points  of  nega¬ 
tive  curvature  -  associated  with  object  concavities  -  have  greater 
utility  as  partition  points  than  positive  curvature  maxima,  but  the 
positive  maxima  (and  inflection  points)  play  an  important  role  in 
describing  the  individual  segments. 

’As  noted  later,  most  of  the  work  on  the  ciu-ve  partitioning 
problem,  especially  recent  work,  has  not  been  concerned  with  du¬ 
plicating  generic  human  performance,  but  rather  with  performing 
specific  visual  tasks  having  different  criteria  for  success. 

*The  approach  taken  by  Wuescher  and  Boyer  is  distinct  in  that 
they  first  extract  rontn\ir  segments  of  approximately  constant  cur¬ 
vature  and  then  infer  the  location  f  partition  points  as  a  secondary 
operation. 

^We  will  use  the  terms  saliency  and  criticality  somewhat  inter¬ 
changeably  in  this  paper.  Hc>wever,  saliency  ran  be  considered  to 
be  the  generic  subset  of  points  that  are  critical  for  some  partition¬ 
ing  task. 
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as  deviation  from  straightness  -  it  is  closely  related 
to  proposed  approximations  to  mathematical  cur¬ 
vature,  as  illustrated  in  Figures  2  and  3,  but  has 
a  number  of  advantages;  it  is  an  easily  measured 
quantity,  even  for  digital  curves  (i.e.,  sequences  of 
coordinate  pairs),  and  as  discussed  in  the  next  sec¬ 
tion,  its  local  extrema  are  in  better  accord  with  hu¬ 
man  preference  (choices  based  on  approximations 
to  the  definition  of  mathematical  curvature  occa¬ 
sionally  include  anomalous  points  as  shown  in  the 
examples  of  Figures  2  and  3). 

(b)  A  w^y  of  adjusting  the  criticality  of  a  given  curve- 
point  to  take  into  account  its  interactions  with  its 
neighbors;  i.e.,  local  context.  It  is  obvious  that 
human  subjects  will  often  avoid  assigning  a  critpt 
label  to  both  members  of  a  pair  of  points,  even  when 
both  points  have  high  (independent)  criticality  val¬ 
ues,  if  the  points  are  close  neighbors  along  the  curve. 
The  basic  approach  of  local  non-maximum  suppres¬ 
sion  is  not  sufficient,  in  itself,  to  duplicate  human 
performance. 

(c)  A  way  of  dealing  with  the  interactions  between 
critpts  that  are  significant  at  different  scales  of 
resolution.  If  a  human  subject  looks  through  a 
fixed  sized  window  at  the  same  curve  segment  dis¬ 
played  at  two  different  magnifications,  the  selected 
critpts  will  not  always  be  the  same,  and  the  selection 
at  the  lower  resolution  will  not  always  be  a  subset  of 
those  at  the  higher  resolution  (e.g..  Figure  4).  This 
is  in  contrast  to  the  commonly  held  assumption  that 
critpt  assignment  should  be  independent  of ’’scale  of 
resolution.” 

(d)  A  threshold  of  significance;  a  minimal  level  of 
criticality  below  which  variations  are  considered  to 
be  noise  and  no  critpt  designations  are  made.  (Some 
investi-^ai-rs  reject  the  idea  that  any  user  supplied 
parameters  or  thresholds  should  be  necessary.) 

We  have  addressed  the  above  issues  through  the  solu¬ 
tions  to  a  set  of  subproblems: 

1.  Definition  of  an  algorithmic  procedure  (which  is  pa¬ 
rameterized  to  deal  with  noise  and  scale)  for  assign¬ 
ing  criticality  values  to  each  point  on  a  curve  in¬ 
dependent  of  decisions  made  about  the  locations  of 
(other)  critpts.  The  solution  to  this  problem,  es¬ 
sentially  the  procedure  given  in  Fischler  and  Bolles 
[FischlerSfi],  provides  answers  at  a  human  level  of 
performance  for  isolated  critpts  (i.e.,  along  a  sec¬ 
tion  of  a  random  curve,  generated  as  described  in 
Appendix  A,  for  which  human  subjects  select  only 
one  critpt).  Thus,  for  the  domains  we  experimented 
with  (and  especially  the  domain  defined  in  Ap¬ 
pendix  A),  we  were  able  to  assign  fixed  values  to 
scale/resolution  and  noise/significance  parameters 


so  that  our  program  would  make  the  same  selections 
as  human  subjects  when  there  was  near  unanimous 
agreement  among  these  subjects.  This  algorithm  is 
described  in  Appendix  B. 

2.  An  analysis  of  how  geometric  scaling  of  the  in¬ 
put  curve,  and  resolution  specific  operations  on  the 
curve,  can  be  equated,  and  thus  the  development  of 
a  basis  for  normalizing  criticality  scores  across  scale. 

3.  Development  of  a  general  approach  to  the  problem 
of  resolving  the  competition/cooperation  interac¬ 
tions  of  geometrically  related  objects  based  on  ’’lo¬ 
cal  dominance.”  The  same  machinery  used  to  deal 
with  interactions  at  a  given  scale  of  resolution  is 
also  used  to  resolve  conflicts  across  different  scales 
of  resolution. 

In  the  remainder  of  this  paper,  we  describe  our  so¬ 
lutions  to  the  problems  enumerated  above,  and  then 
present  examples  and  experimental  results  to  justify  the 
design  decisions  we  made  and  to  illustrate  the  perfor¬ 
mance  capabilities  of  our  algorithm. 

4.  Evaluation  of  Saliency 

Saliency  is  a  critical  attribute  (for  description  and 
recognition)  assigned  to  perceived  things  in  the  world 
by  the  human  visual  system  (HVS).  While  an  elusive 
concept  in  general,  task  specific  specializations  of  this 
concept  ate  easily  found  that  elicit  consistent  choices 
across  human  subjects.  An  acceptable  computational 
definition  of  contour/curve  saliency  must  provide  ® 

•  The  specification  of  a  procedure  that  quantifies  the 
abruptness  and  extent  of  the  deviation  of  a  curve 
from  its  straight-line  continuation;  a  sharp  bend  is 
more  salient  than  a  shallow  one,  and  the  greater 
the  excursion,  the  more  prominent/salient  the  ’’fea¬ 
ture.” 

•  Agreement  with  human  judgement  in  terms  of  both 
selection,  and  accuracy  of  placement,  of  the  critical 
points  (in  some  well  defined  context). 

4.1  A  Computational  Definition  of 
Saliency 

Conventional  definitions  of  curvature  present  a  num¬ 
ber  of  serious  problems  with  respect  to  their  use  as  a 
saliency  measure  in  computational  vision  (CV).  First, 
the  mathematical  definition  is  based  on  the  properties 
of  a  curve  in  the  infinitesimal  neighborhood  about  the 

®In  till*  paper  we  are  primarily  concerned  with  saliency  based 
on  local  cues;  locations  on  a  curve  where  there  is  a  transition 
from  one  type  of  curvature  behavior  to  another,  e.g.  from  per¬ 
fectly  straight  to  "wigglry,"  may  also  be  psychologically  salient, 
but  such  forms  of  glokal  sa/irnry  are  beyond  the  scope  of  our  cur¬ 
rent  investigation. 
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point  at  which  curvature  is  being  measured.  For  the  fi¬ 
nite  precision  quantized  curves  dealt  with  in  CV,  it  has 
been  difficult  to  find  a  suitable  approximation  to  the 
limiting  process  originally  intended  for  use  on  matke- 
inatically  continuous  curves.  Second,  it  is  readily  ob¬ 
served  that  saliency  is  not  an  infinitesimal  point  prop¬ 
erty,  but  is  based  on  some  finite  extent  of  the  curve.  A 
proposed  solution  to  both  problems,  offered  by  Rosenfeld 
and  Johnston  [Rosenfeld73]  was  to  find  an  appropriately 
sized  segment  of  the  curve  about  the  point  in  question, 
and  take  a  ’’snapshot”  of  the  limiting  process  at  this 
single  (implied)  scale.  That  is,  rather  than  the  rate  of 
change  of  tangent  angle  with  respect  to  curve  length, 
R/J  proposed  measuring  the  angle  between  two  fixed 
length  chords,  where  the  lengths  correspond  to  the  com¬ 
puted  ’’natural  scale”  of  the  curve  about  the  given  point. 
We  will  call  this  curvature-analog  the  R/J-Curvature. 
There  are  a  number  of  other  definitions  of  mathemati¬ 
cal  curvature  (e.g.,  the  limiting  radius  of  a  circle  whose 
three  defining  points  converge  at  the  curve-point  in  ques¬ 
tion)  which  have  analogs  that  could  have  been  used  in 
place  of  the  angle  measure  in  R/J-Curvature  but  these 
definitions  are  monotonically  related,  and  do  not  really 
present  distinct  alternatives.  Thus,  R/J-Curvature  is  a 
suitable  representative  for  the  whole  class  of  mathemat¬ 
ical  curvature-measure  analogs. 

In  Fischler  and  Bolles  [FischlerSfi],  our  concern  was 
not  to  find  a  good  digital  analog  for  curvature,  but  rather 
to  find  an  effective  measure  of  saliency.  The  quantity 
defined  in  that  paper  can  be  viewed  as  a  curvature- 
extremum  measure  in  which  the  limiting  process  (in 
scale)  is  replaced  by  a  scanning  process  (in  space)  more 
appropriate  to  digital  curves.  The  scanning  process  is 
parameterized  by  scale,  and  the  resulting  measure  is  a 
signed  quantity  which  we  call  F/B-Saliency  (F/B-S). 

While  the  particular  choice  of  a  curvature  measure 
as  a  component  in  a  complete  system  for  selecting  the 
most  salient  points  (critpts)  on  a  planar  curve  depends 
on  many  factors,  it  is  still  interesting  to  compare  the  raw 
scores  returned  by  curvature-analogs  represented  by  the 
R/J-Curvature  with  the  extreme  points  (ultimately)  se¬ 
lected  by  our  algorithm  (SSS)  as  shown  in  Figures  2  and 
3  for  a  randomly  generated  curve.  In  these  figures  we 
observe  problem  situations  that  highlight  some  of  the 
differences  between  the  two  underlying  metrics  (R/J- 
Curvature  and  F/B-Saliency).  ^ 

There  are  some  problems  with  any  raw  measure  of  cur¬ 
vature  that  must  be  dealt  with  by  using  procedures  that 

^In  both  of  the  figures,  we  used  fixed  common  scale  parame¬ 
ters  for  both  metrics  u  noted  in  the  figure  captions.  It  should  be 
remembered  that  R/.I-ciu'vature.  as  we  define  it  in  this  paper,  is 
representative  of  a  whole  class  of  curvature-based  metrics  and  is 
not  intended  to  duplicate  the  complete  Rosenfeld/ Johnston  algo¬ 
rithm  -  they  also  incorporate  a  procedure  for  finding  a  preferred 
stick  length.  However,  many  of  the  problems  with  the  performance 
of  the  complete  algorithm,  which  are  discu.ssed  in  DavisTT  and  in 
other  of  the  papers  we  reference,  can  be  observetl  in  the  perfor¬ 
mance  of  the  R/J-Curvature  metric. 


invoke  (at  least)  local  context.  For  example,  in  Figure  3 
we  see  a  case  (double  arrow)  where  two  critpts  were  se¬ 
lected  at  almost  adjacent  locations  along  the  curve.  This 
undesirable  behavior  was  not  eliminated  by  the  simple 
’’non-maximum  suppression”  filter  that  produced  good 
results  in  most  other  situations.  It  is  necessary  to  use 
more  specific  criteria  in  deciding  when  two  critpts  are 
too  close  together,  and  also,  what  to  do  when  the  ad¬ 
jacent  points  have  equal  saliency  scores  (e.g.,  arbitrar¬ 
ily  eliminate  one  of  them  or  eliminate  both  and  place 
a  new  critpt  between  them).  In  Figure  3  we  see  cases 
(two  single  arrows)  where  almost  invisible  features  were 
chosen  as  critpts  because  they  did  have  locally  extreme 
curvature  scores;  how  do  we  decide  when  to  reject  such 
occurances.  In  Figure  2  we  see  a  case  where  a  critpt  (des¬ 
ignated  by  an  arrow)  was  inserted  at  a  location  displaced 
from  the  position  we  consider  correct;  this  was  due,  in 
part,  to  the  length  of  the  arms  of  the  angle  measuring 
’’operator”  relative  to  the  size  of  the  feature  (see  Figure 
2d)  -  it  is  not  always  possible  (or  practical)  to  find  an 
appropriate  operator  size  for  every  potential  feature.  In 
the  following  sections  (and  appendices)  of  this  paper  we 
describe  and  justify  the  methods  we  employ  to  deal  with 
these  problems.  The  issue  we  are  primarily  concerned 
with  in  this  section  is  the  choice  of  a  basic  saliency  met¬ 
ric.  We  justify  our  preference  for  the  F /B-S  metric  on 
two  grounds; 

1.  Unlike  the  fixed  scale  mathematical  (FSM)  curva¬ 
ture  analogs  (e.g.,  R/J-curvature),  F/B-S  rarely 
makes  an  error  in  positioning  a  critpt,  or  in  ignoring 
a  salient  point  that  human  observers  would  select. 
The  issue  here  is  robustness,  F /B-S  integrates  infor¬ 
mation  over  an  extended  set  of ’’looks”  at  the  curve 
segment  containing  the  point  whose  saliency  is  be¬ 
ing  measured.  FSM  techniques  take  a  single  look 
at  the  situation.  Thus,  our  main  problem  with  the 
F/B-S  metric  is  selecting  the  most  salient  of  the  se¬ 
lected  critpts  to  be  retained  as  our  final  result  (the 
filtering  operation  generally  involves  the  elimination 
of  less  than  half  of  the  points  originally  selected). 

2.  The  F/B-S  metric  is  responsive  to  both  the  curva¬ 
ture  and  the  size  of  a  curve  ’feature  ”  This  pro¬ 
vides  a  common  basis  for  ranking  critpts  at  a  given 
scale  (so  that  the  larger  of  two  geometrically  sim¬ 
ilar  objects  is  eissigned  a  higher  saliency  score)  as 
well  as  across  scales  by  taking  into  account  the  size 
of  the  operator.  The  FSM-curvature  analogs  are 
insensitive  to  the  size  of  the  feature  -  they  inherit 
the  mathematical  property  that  curvature  is  a  point 
property  and  only  the  smallest  neighborhood  about 
a  point  that  allows  us  to  measure  curvature  is  rel¬ 
evant  (this  implies  a  single  ’’natural  scale"  at  any 
point  on  a  curve;  a  concept  we  reject,  e.g.,  see  Fig¬ 
ure  4). 
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4.2  Comparison  of  the  Saliency  Selection 
System  (SSS)  with  Human  Performance 

The  primary  criterion  for  judging  the  competence  of 
the  overall  saliency  selection  system  (SSS)  we  present 
in  this  paper  is  its  ability  to  match  human  performance 
-  both  in  the  defined  task  and  with  respect  to  generic 
evaluation  of  the  selected  critpts.  We  performed  a  set  of 
informal  experiments  with  11  human  subjects  (also  see 
the  experiments  described  in  Fischler86).  The  instruc¬ 
tions  given  to  the  subjects  and  the  resulting  selections 
are  shown  in  Figure  1.  We  also  show  the  selections  made 
by  the  SSS  algorithm.  The  results  of  these  (and  addi¬ 
tional  but  not  described)  experiments  can  be  summar 
rized  as  follows 

•  At  least  9  of  the  11  subjects  selected  the  same  set 
of  six  or  more  critpts  on  each  of  the  four  curves  we 
used  in  the  experiments,  and  the  SSS  chose  the  same 
set  of  critpts.  Every  critpt  selected  by  the  SSS  was 
also  selected  by  at  least  one  human  subject. 

•  In  spite  of  the  high  degree  of  consistency  in  the 
overall  selection  of  salient  points,  the  human  sub¬ 
jects  differed  in  the  order  in  which  they  chose  these 
points.  We  tried  a  number  of  experiments  in  which 
the  only  difference  was  a  very  slight  change  in  the 
wording  of  the  instructions,  and  obtained  different 
orderings  (across  the  same  set  of  selected  points) 
from  our  subjects.  It  is  obvious  that  the  subjects 
used  a  global  strategy  to  match  the  task  (differ¬ 
ent  for  each  subject)  to  choose  the  order  in  which 
the  points  were  selected  -  even  though  the  specific 
points  selected  were  largely  determined  by  local  con¬ 
text. 

In  addition  to  the  curves  used  in  the  human  experi¬ 
ments,  we  ran  the  SSS  algorithm  on  (the  order  of)  1000 
randomly  generated  curves  with  no  obvious  errors.  Fig¬ 
ure  5  shows  the  results  of  a  (typical)  sequence  of  40  con¬ 
secutive  experiments. 

5.  Dealing  with  the  Problems  of  Scale 
and  Resolution 

A  vision  system,  concerned  with  creating  a  descrip¬ 
tion  of  some  object  that  may  be  encountered  again  in 
the  future,  perhaps  when  the  object  is  closer  or  further 
away,  must  take  scale  or  magnification  into  account  when 
deciding  what  shape  elements  to  pay  attention  to.  Un¬ 
der  extreme  changes  in  resolution,  when  salient  features 
might  appear  or  disappear,  it  may  not  be  possible  to 
make  an  informed  judgement  in  the  assignment  of  rela¬ 
tive  saliency  scores;  but  for  a  limited  range  about  a  given 
resolution,  this  should  indeed  be  possible. 

Obviously,  geometric  properties  of  objects  that  are  in¬ 
variant  over  scale  are  especially  valuable  in  describing 
and  recognizing  the  objects,  since  absolute  scale  is  of¬ 
ten  impossible  to  judge  in  an  image,  and  even  relative 


scale  can  be  difficult  to  describe  or  measure  if  the  mea¬ 
surement  must  be  referenced  to  the  global  geometry  of 
the  object.  One  of  the  main  issues  we  address  in  this 
paper  is  how  to  define  extrema  in  the  "bending”  of  a 
curve  as  a  local  effectively  scale-invariant  property  that 
is  in  agreement  with  the  judgement  of  the  human  visual 
system. 

If  we  define  criticality  of  points  on  a  digitally  rep¬ 
resented  curve  in  terms  of  quantities  that  have  dimen¬ 
sions  that  must  be  measured  by  some  physical  process, 
then  there  is  no  direct  way  of  invoking  such  formally 
defined  mathematical  concepts  m  the  derivative,  or  cur¬ 
vature,  which  require  limiting  processes  of  infinite  reso¬ 
lution.  Approximations  to  these  concepts  are  resolution 
dependent  (e.g.,  the  size  of  the  operator  employed)  and 
measurements  made  on  most  objects  will  not  "scale”  in 
any  simple  or  uniform  way.  Further,  if  we  examine  a 
curve  through  a  fixed  size  window  (either  a  fixed  region 
of  a  computer  screen,  or  the  foveal  region  of  the  hu¬ 
man  retina),  and  we  successively  increase  the  resolution 
at  which  the  curve  is  displayed,  some  of  its  parts  will 
eventually  disappear  from  view,  and  some  of  the  smaller 
original  structures,  that  were  not  significant,  will  now 
dominate  the  visible  appearance  of  the  curve  (e.g..  Fig¬ 
ure  4). 

If  the  mathematical  definition  of  curvature  were  ap¬ 
plicable  to  digital  imagery,  then  many  (but  not  all)  of 
the  issues  of  scale  could  be  resolved.  There  is  still  the 
problem  that  a  very  small  "glitch”  can  have  a  very  high 
value  of  curvature  but  a  very  low  psychological  signifi¬ 
cance.  Thus  the  scale  or  size  of  a  "feature”  (e.g.,  the 
glitch)  is  an  issue.  The  term  "feature”  does  not  appear 
in  our  problem  definition;  in  fact,  by  focusing  on  local 
curve  properties,  we  had  hoped  to  eliminate  the  need  to 
invoke  this  concept  since  an  appropriate  definition  is  far 
from  obvious.  *  Since  scale  can’t  be  ignored  (even  if 
we  had  a  good  approximation  for  curvature  in  the  digi¬ 
tal  domain  that  was  independent  of  scale)  the  following 
questions  arise: 

•  The  distinction,  if  any,  between  resolution  and  scale 

•  How  to  choose  a  range  of  scales  appropriate  to  the 
specified  performance  criteria 

•  How  to  measure  criticality  at  different  scales 

•  How  to  compare  criticality  values  computed  at  dif¬ 
ferent  scales 

•  The  relationship  between  smoothing  and  scale 
change 

•  Intuitively,  there  are  sections  of  any  given  curve  that  we  call 
features;  these  entities  provide  the  psychological  basis  for  the  se¬ 
lection  and  relative  saliency  of  the  associated  critpts.  Critpts  are 
markers  that  define  the  shape  and  boundary  of  features  -  the  ex¬ 
tent  of  the  curve  corresponding  to  a  feature  will  generally  sub¬ 
sume  the  "region  of  support"  for  the  rurvepoints  comprising  the 
feature.  Features  can  overlap,  an<l  their  boiiiirlaries  are  not  always 
apparent. 


921 


•  The  relation  between  operator  size  and  scale  change 

•  How  to  make  cooperation/competition  judgements 
across  scales 

•  How  to  determine  the  features  for  which  we  ex¬ 
pect  consistency  (of  criticality  scores)  to  hold  across 
scales,  and  where  such  consistency  can’t  be  expected 
(if  the  latter  were  never  the  case,  we  could  always  do 
our  analysis  at  one  scale  and  compute  the  criticality 
values  at  other  scales  as  needed). 

While  consistency  at  all  scales  and  for  all  features  is 
not  possible,  over  some  range  of  scales  (say  5:1)  we  ex¬ 
pect  there  to  be  a  ’’normalization”  factor  which  allows 
us  to  compare  the  saliency  scores  computed  at  one  scale 
with  values  computed  at  other  scales.  We  would  also 
expect  that  relative  locations  of  local  extrema  for  cer¬ 
tain  features  would  remain  fixed  as  a  curve  is  scaled, 
regardless  of  the  size/scale  of  the  operator  that  assigns 
the  criticality  scores. 

Some  of  the  earliest  work  (e.g.,  Rosenfeld  and  John¬ 
ston)  on  finding  salient  points  merged  the  problem  of 
assigning  a  curvature  measure  to  a  point  with  that  of  de¬ 
termining  the  scale  at  which  to  measure  curvature.  The 
key  idea  is  that  each  point  has  a  single  scale  at  which 
its  curvature  should  be  measured  -  this  scale  is  usually 
found  by  a  search  process  over  successively  larger  scales 
until  sbme  measured  quantity  achieves  a  local  extremum. 

5.1  Change  of  Scale  Vs.  Change  of  Res¬ 
olution 

If  we  magnify  a  continuous  curve  that  was  originally 
represented  at  infinite  precision,  every  point  of  the  new 
image  corresponds  to  a  point  in  the  original  image,  but 
its  X  and  y  coordinate  values  have  been  multiplied  by 
some  real  number  which  we  will  call  the  scale  factor. 
No  information  was  introduced  nor  lost,  but  the  phys¬ 
ical  space  required  to  render  the  curve  has  increased. 
However,  if  the  original  curve  was  represented  at  finite 
resolution  (e.g.,  each  point  as  a  pair  of  integer  coordi¬ 
nates),  then  (say)  doubling  the  scale  leaves  us  with  a 
disconnected  set  of  points.  Filling  in  the  gaps  requires 
introducing  new  information.  Here  we  will  say  that  a 
change  of  resolution  has  occurred  (a  change  in  resolu¬ 
tion  can  also  result  in  the  loss  of  information,  as  in  the 
case  of  demagnification  or  smoothing  at  some  fixed  reso¬ 
lution).  Thus,  the  concept  of  a  scale  change  corresponds 
to  a  reversible  transformation,  while,  in  general,  a  change 
in  resolution  involves  an  irreversible  process  in  which  in¬ 
formation  is  lost  (as  in  smoothing),  or  new  information 
is  introduced  (as  can  occur  in  zooming). 

If  we  compute  the  curvature  for  points  on  a  continuous 
(infinite  resolution)  curve  at  two  different  scales,  we  will 
generally  get  two  distinct  sets  of  values  (e.g.,  a  circle 
with  radius  2  is  a  scaled  version  of  a  circle  with  radius 
1,  but  by  definition,  their  curvatures  are  in  the  ratio 
1:2.  On  the  other  hand,  the  angles  of  a  triangle  remain 


unaltered  under  a  scale  change).  It  will  be  the  case, 
however,  that  for  smooth  curves,  the  local  extrema  will 
be  found  at  corresponding  locations  -  but  even  here,  the 
numerical  values  of  curvature  will  not  scale  in  any  simple 
way  (curvature  is  a  nonlinear  function). 

5.2  SSS  Mechanisms  for  Evaluating 
Saliency  at  Different  Scales  and  Resolu¬ 
tions 

In  designing  a  computational  module  to  evaluate 
saliency  subject  to  the  ideas  discussed  above,  we  can 
pursue  at  least  three  distinct  strategies: 

1 .  Assume  that  saliency  is  independent  of  scale,  or  that 
there  is  a  natural  scale  associated  with  each  location 
on  the  curve  that  must  be  discovered. 

2.  Use  a  fixed  scale  saliency  measure,  but  generate 
multiple  versions  of  the  given  curve  at  some  pre¬ 
determined  set  of  scales. 

3.  Parameterize  the  saliency  measure  to  give  results 
approximating  those  that  would  be  obtained  from 
strategy  (2)  for  the  selected  scales. 

We  previously  argued  against  strategy  (1)  on  the  as¬ 
sumption  that  a  unique  natural  scale  cannot  generally 
be  associated  with  a  single  curvepoint  (see  Figure  4). 
We  have  chosen  strategy  (3)  since  strategies  (2)  and  (3) 
are  conceptually  compatible,  but  (3)  could  be  compu¬ 
tationally  more  efficient  if  we  can  find  a  simple  way  to 
use  some  combination  of  operator  scaling  and  score  nor¬ 
malization  so  that  both  approaches  give  (nominally)  the 
same  scores  in  most  situations.  Intuitively,  doubling  the 
stick  length  (in  the  F/B-S  metric)  for  a  simple  convex 
section  of  a  curve  should  result  in  four  times  the  score 
assigned  to  the  corresponding  critpt:  The  stick  is  now 
positioned  twice  the  distance  from  the  critpt  in  most  of 
its  "looks”  (i.e.,  placements  of  the  stick  which  subsume 
a  curve  segment  containing  the  critpt),  and  there  are 
twice  as  many  looks.  Thus,  the  procedure  we  employ, 
normalizing  all  scores  by  dividing  by  the  square  of  the 
sticklength,  will  leave  invariant  the  saliency  scores  as¬ 
signed  to  features  which  should  be  scale  invariant,  such 
as  the  angle  formed  by  two  (effectively)  infinite  straight 
lines.  On  the  other  hand,  for  those  features  that  have 
limited  extent  along  the  curve,  comparable  to  the  scales 
we  wish  to  discriminate  among,  the  larger  scaled  versions 
of  the  features  will  be  assigned  higher  scores. 

6.  Cooperation/Competition  Interac¬ 
tions  Between  Critical  Points 

An  important  contribution  of  this  paper  over  the  work 
presented  in  Fischler  and  Holies  [FischlerSfi]  is  a  major 
revision  of  the  approach  to  filtering  the  critpts,  based 
both  on  comparisons  at  a  given  scale  as  well  as  across 
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different  scales.  At  a  conceptual  level,  there  are  two  main 
differences. 

First,  in  the  earlier  work  we  did  not  use  the  informa¬ 
tion  about  the  sign  (concavity/convexity)  of  the  com¬ 
puted  F/B-Saliency;  in  our  current  algorithm,  we  sepa¬ 
rate  all  the  candidate  critpts  into  two  sets  correspond¬ 
ing  to  positive  and  negative  F/B-S.  ®  These  two  sets 
are  processed  independently  of  each  other  (by  identical 
procedures)  and  the  resulting  selections  are  combined  by 
logical  union  to  produce  the  final  output.  Our  own  obser¬ 
vations  confirm  those  of  other  researchers  (e.g.,  Hoffman 
and  Richards),  that  positive  and  negative  curvature  ex¬ 
trema  appear  to  be  distinguished  from  each  other  by  the 
HVS,  in  part  because  they  play  different  roles  in  parti¬ 
tioning  and  description  tasks. 

Second,  in  the  earlier  work  we  used  a  simple  ’’domi¬ 
nance”  criterion  for  competition  of  closely  spaced  critpts 
detected  at  different  scales..  A  critpt  detected  at  some 
given  scale  would  suppress  all  critpts  detected  at  smaller 
scales  (shorter  ’’sticklength”)  that  were  located  within  a 
specified  scale  related  distance  from  it.  This  rule  rarely 
produced  ’’ugly”  errors,  but  occasionally  caused  the  ob¬ 
viously  correct  critpt  to  be  deleted  in  favor  of  one  slightly 
displaced  from  the  preferred  location.  A  significant  por¬ 
tion  of  the  work  described  in  this  paper  has  been  fo¬ 
cused  on  finding  a  more  effective  and  uniform  basis  for 
establishing  ’’local  dominance.”  In  other  sections  of  this 
paper  we  provided  a  justification  for  a  normalization  fac¬ 
tor  which  would  permit  us  to  assign  a  saliency  ranking 
to  competing  critpts,  regardless  of  the  scale  at  which 
they  were  originally  detected.  Thus,  competition,  both 
within  and  across  different  scales  is  now  treated  in  a 
uniform  manner.  In  the  following  subsection  we  discuss 
some  of  the  specific  problems  that  must  be  resolved  in 
competition  resolution,  and  the  algorithmic  procedures 
we  invoke  to  deal  with  these  problems. 

6.1  Mechanisms  for  Filtering  Competing 
Critpts 

One  of  the  algorithmic  mechanisms  we  devised  to  deal 
with  the  above  problems  (described  in  greater  detail  in 
Appendix  B)  is  to  construct  an  array  with  one  slot  for 
each  indexed  location  along  the  curve  (conceptually  two 
such  arrays,  one  each  respectively  for  positive  and  nega¬ 
tive  saliency  scores).  Each  slot  is  either  free  or  ”owned” 
by  exactly  one  critpt.  A  critpt  occupies  only  one  of  the 
slots  it  owns  -  this  occupied  slot  corresponds  to  its  actual 
location  along  the  curve.  A  ’’new”  critpt,  contending 
for  a  slot,  must  have  a  normalized  score  greater  than  the 

^Por  an  open  curve  segment,  the  assignment  of  positive  vs.  neg¬ 
ative  is  arbitrary;  the  important  consideration  is  that  we  use  the 
information  about  the  direction  of  deviation  of  the  curve  from  the 
stick  to  separate  detected  critpts  into  the  two  possible  categories 
which  are  then  processed  separately. 

***  All  the  potential  critpts  are  detected,  sorted,  and  then  entered 
into  the  array  in  increasing  order  of  saliency  to  avoid  sequence 
dependent  effects. 


existing  value  stored  in  the  slot  to  capture  it.  If  a  new 
critpt  captures  a  slot  occupied  by  (as  opposed  to  sim¬ 
ply  being  owned  by)  a  previously  dominant  critpt,  all  of 
the  slots  of  the  now  dominated  critpt  are  also  captured. 
This  mechanism  provides  a  way  of  avoiding  the  need  to 
choose  a  fixed-sized  ’’base  of  support”  for  a  critpt. 

7.  Algorithm  Performance 

The  algorithm  discussed  in  the  previous  sections  of 
this  paper,  and  described  in  Appendix  B,  has  been  com¬ 
pared  with  human  performance  (Figure  1),  and  has  been 
run  on  hundreds  of  randomly  generated  images  (as  de¬ 
scribed  in  Appendix  A)  without  making  any  obvious  er¬ 
rors.  In  all  these  cases  the  same  set  of  parameters  were 
used  with  no  operator  involvement.  Figure  5  shows  40 
consecutively  generated  random  curves  and  the  critpts 
selected  by  the  algorithm.  Figure  6  in  Appendix  C  shows 
results  of  the  algorithm  run  on  curves  extracted  from  real 
images. 

8.  Discussion 

Curve  partitioning  is  an  active  research  area  which 
not  only  is  of  theoretical  interest  as  a  basic  element  in 
pictorial  description  (e.g.,  Attneave,  Bengtsson  and  Ek- 
lundh,  Hoffman  and  Richards),  and  for  providing  insight 
into  the  partitioning  problem  in  general  (e.g.,  Fischler 
and  Bolles),  but  has  many  potential  applications.  Some 
of  the  more  immediate  ones  include;  data  compression 
by  using  critpts  as  the  basis  for  regenerating  a  curve 
by  straight  line  or  spline  interpolation  (e.g.  Imai  and 
Iri,  Teh  and  Chin),  matching/recognition  using  critpts 
and/or  the  partitioned  curve  segments  (e.g.,  Mokhtar- 
ian  and  Mackworth,  Wuescher  and  Boyer),  and  as  a  key 
component  of  an  interface  for  man-machine  communi¬ 
cation  about  pictorial  objects  (the  ability  to  point  at 
icons  representing  symbolic  objects  has  revolutionized 
the  computer-user  interface;  to  extend  this  capability, 
one  would  like  to  be  able  to  point  to  a  location  in  an 
image  and  have  the  machine  be  able  to  deduce  the  com¬ 
ponent  being  referred  to  -  image  partitioning  in  gen¬ 
eral,  and  especially  curve  partitioning,  are  critical  to  this 
goal). 

In  this  paper  we  have  focused  on  one  specific  aspect 
of  the  curve  partitioning  problem:  Duplicating  human 
performance  in  the  selection  of  a  small  number  of  points 
(called  critpts)  along  a  curve  segment  which  could  be 
used  as  the  basis  for  reconstructing  the  curve  at  some 
future  time.  While  there  will  generally  be  a  significant 
degree  of  overlap  in  the  points  selected  by  the  tech¬ 
niques  referenced  above  (focused  on  different  applica¬ 
tions),  there  are  also  significant  differences.  There  has 
been  very  little  recent  work  on  the  generic  problem  of 
choosing  psychologically  salient  points  with  which  to  di¬ 
rectly  compare  our  results.  On  the  other  hand,  we  have 
conducted  a  relatively  large  number  of  experiments  with 
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uniformly  good  results  (e  g.,  see  Figure  5). 

There  are  two  major  paradigms  underlying  the  pub¬ 
lished  work  on  partitioning  planar  curves.  The  first  in¬ 
volves  obtaining  a  mathematically  differentiable  repre¬ 
sentation  of  the  given  digital  curve  by  the  use  of  splin- 
ing  or  Gaussian  convolution  (e.g.,  MokhtarianSfi).  This 
gives  good  results  for  many  applications,  but  the  salient 
points  on  the  smoothed  curve  are  often  displaced  from 
their  original  locations  (or  eliminated).  This  paradigm 
is  not  suitable  for  our  purposes  in  this  paper. 

The  second  paradigm,  which  includes  the  work  de¬ 
scribed  here,  is  to  first  measure  some  approximation  to 
the  curvature  at  each  point  on  a  curve.  This  usually  in¬ 
volves  choosing,  or  finding,  an  appropriate  scale  at  which 
to  make  the  curvature  measurement.  This  is  typically  ac¬ 
complished  by  making  the  curvature  measurement  over 
increasingly  larger  curve  segments  (centered  on  the  curve 
point  being  evaluated)  until  either  the  computed  curva¬ 
ture  at  the  point,  or  some  related  quantity,  reaches  a  lo¬ 
cal  extrema.  Each  point  is  assigned  a  saliency /criticality 
value  (its  estimated  curvature)  and  an  interval  length 
along  the  curve  centered  on  the  point  (called  its  re¬ 
gion  of  support).  The  region  of  support  is  then  used  for 
non-maximum  suppression  -  each  point  suppresses  other 
points  with  lower  criticality  scores  falling  in  its  region  of 
support. 

Major  differences  between  our  approach  and  other 
work  under  this  second  paradigm  include; 

•  A  generic  saliency  measure  which  often  selects 
points  corresponding  to  local  curvature  extrema, 
but  which  in  many  situations  is  in  better  accord 
with  human  selection  preference  and  placement  ac¬ 
curacy. 

•  A  distinct  approach  to  the  problem  of  dealing  with 
curve  features  salient  at  different  scales.  The  con¬ 
ventional  approach  is  to  associate  a  single  scale  with 
each  curve  point  which  in  turn  defines  a  fixed  re¬ 
gion  of  support  to  be  used  for  non-maximum  sup¬ 
pression.  In  our  approach,  we  measure  the  saliency 
of  each  curve  point  at  a  number  of  different  scales, 
and  have  developed  procedures  for  allowing  poten¬ 
tial  critpts,  found  at  different  scales  and  spatial  loca¬ 
tions  to  compete  with  each  other.  This  competi¬ 
tion  is  not  restricted  to  any  fixed  extent  of  the  curve 
(which  thus  avoids  anomalous  selections  caused  by 
an  important  event  occurring  just  beyond  the  fixed 
limit  of  search,  i.e.,  the  horizon  effect). 

Additional  approarhe*  are  available  for  partitioning  1-D 
curves;  for  example,  see  Fischler  and  Wolf  [Fischler8.3]  or  Witkin 
[Witkin83].  As  noted  in  Appendix  B,  the  I-D  partitioning  tech¬ 
nique  in  the  Fischler83  reference  is  used  as  a  component  of  the  S5S 
algorithm. 

'^It  is  interesting  to  note  that  we  have  not  found  a  use  for  coop¬ 
erative  reinforcement  -  cooperation  appears  to  be  a  global  relation. 
Competition  is  important  at  the  local  level  (e.g.,  lateral  inhibition) 


Our  approach  to  local  saliency  selection  can  be  con¬ 
sidered  a  form  of  automated  preattentive  perception. 
Potential  extensions  could  include  dealing  with  more 
global  curve  features,  such  as  recognizing  the  intersec¬ 
tion  of  extended  straight  line  segments,  or  transition 
points  between  analytic  curves  with  different  parame¬ 
ters,  or  global  symmetries  and  repeated  structure.  Rec¬ 
ognizing  these  more  global  structures,  and  ranking  them 
with  respect  to  human  perceived  saliency,  may  well  fall 
outside  the  competence  of  the  basic  approach  described 
in  this  paper. 

9.  Appendices 

9.1  Appendix  A:  Generation  of  Random 
Curves 

The  following  method  was  used  to  construct  the  ran¬ 
dom  curves  used  in  the  experiments  described  in  the 
body  of  this  paper. 

(1)  Thirty  (x,y)  pairs  are  generated  for  each  curve. 
Each  value  of  x  and  y  are  generated  by  a  uniform- 
distribution  (0-1)  random- number  generator  and  then 
multiplied  by  100  to  produce  numbers  (coordinate- 
values)  uniformly  distributed  between  0  and  100. 

(2)  The  thirty  points  are  next  linked  by  a  minimal- 
spanning-tree  (MST). 

(3)  A  diameter  path  is  extracted  from  the  MST,  and 
the  ordered  subset  of  the  original  randomly  generated 
points  that  fall  along  this  diameter  path  are  the  input 
sequence  provided  to  a  spline-fitting  routine  [Cline74] 
which  returns  a  continuous  curve  represented  by  a  se¬ 
quence  of  (x,y)  coordinate  pairs.  These  sequences,  typ¬ 
ically  containing  on  the  order  of  150-250  points,  are  the 
random  curves  used  in  our  experiments. 

9.2  Appendix  B:  An  Algorithm  For 
Computing  Curve-Point  Criticality 

The  partitioning  algorithm  described  in  Fischler  and 
Bolles  [FischlerSfi]  has  been  modified  and  extended  as 
summarized  below. 

The  algorithm  collects  candidates  (peaks)  for  the  crit¬ 
ical  points  of  a  curve  by  examining  the  deviation  of  the 
points  of  the  curve  from  a  chord  or  ’’stick”  that  is  it¬ 
eratively  advanced  along  the  curve.  Sticks  of  different 
lengths  are  used  to  find  critical  points  that  are  salient 
at  different  ’’natural”  scales  on  the  given  curve.  (Except 
when  explicitly  stated  otherwise,  two  sticks  were  used 
for  all  the  experiments  discussed  in  this  paper;  one  of 
length  10  pixels  and  the  other  of  length  20  pixels.)  The 
algorithm  provides  the  option  of  using  arc-length  along 
the  curve,  or  the  euclidean  length  of  the  stick,  to  de¬ 
termine  the  separation  of  the  endpoints  of  the  stick  on 
the  curve;  we  used  the  euclidean  length  of  the  stick  for 
all  of  the  experiments  discussed  in  this  paper.  One  end 
of  the  stick  is  advanced  along  the  curve,  one  pixel  at  a 
time,  an«l  the  other  end  is  placed  at  the  first  (sequential) 
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position  further  along  the  curve  for  which  the  Euclidean 
distance  equals  or  exceeds  the  specified  stick  length. 

For  each  placement  of  the  stick,  an  accumulator  asso¬ 
ciated  with  the  curve-point  (in  the  interval  of  the  curve 
between  the  two  endpoints  of  the  stick)  of  maximum 
deviation  from  the  stick  is  incremented  by  the  absolute 
value  of  the  distance  from  the  point  to  the  stick  if  this 
distance  exceeds  a  predefined  noise  threshold.  However, 
for  the  given  stick  placement,  if  there  is  more  than  one 
excursion  (exit  and  return)  outside  the  noise  region,  the 
underlying  model  is  violated  and  the  accumulators  are 
not  incremented.  (The  noise  threshold  was  uniformly 
set  to  20  percent  of  stick  length;  thus  a  euclidean  devi¬ 
ation  of  more  than  2  ’’pixels”  from  a  stick  of  length  10 
was  required  to  cause  any  modification  of  the  associated 
accumulator.) 

To  deal  with  direction  dependent  effects,  a  complete 
traverse  is  made  in  both  directions  along  the  curve  sum¬ 
ming  the  results  in  the  same  accumulators.  The  points 
which  have  locally  maximum  scores  in  the  accumulators 
(called  peaks)  for  any  of  a  given  set  of  sticks  are  the 
points  from  which  the  critical  points  will  be  selected. 

The  following  information  is  collected  for  each  peak 
and  used  to  find  the  critical  points: 

•  INDEX;  the  sequence  number  along  the  curve  of  the 
point  at  which  the  peak  was  located. 

•  STICK:  the  length  of  the  stick  (in  pixels)  used  to 
find  the  peak. 

•  DEV:  the  sign  of  the  deviation  of  the  peak  with 
respect  to  the  curve. 

•  NSCORE:  the  ’’normalized”  score  which  is  the  score 
in  the  accumulator  for  the  peak  divided  by  the 
square  of  the  stick  length. 

The  peaks  are  divided  into  two  groups  with  like-signed 
deviation  DEV.  The  critical  points  for  the  two  groups 
are  found  independently  of  each  other  and  their  union  is 
returned  as  the  set  of  critical  points  for  the  curve. 

In  finding  the  critical  points,  we  stipulate  that  each 
peak’s  score  has  a  region  of  support,  plus  and  minus  half 
its  associated  stick  length,  on  each  side  of  its  position 
along  the  curve.  An  array  (the  support  array)  equal  to 
the  length  of  the  curve  is  used  to  store  the  support  in¬ 
formation.  The  support  information  for  a  peak  is  a  list 
(NSCORE  INDEX  STICK).  For  each  peak,  the  support 
information  may  be  entered  at  every  index  location  cov¬ 
ered  by  the  region  of  support  depending  on  what  was 
previously  stored  in  the  location. 

For  all  locations  in  the  support  region  for  the  new 
peak  (in  the  support  array),  an  entry  at  J  is  replaced  by 
the  information  for  the  new  peak  if  there  is  no  previous 
entry  in  the  array  or  if  the  score  for  the  new  peak  is 
>  than  the  score  in  the  existing  entry  in  the  array.  In 
addition,  if  the  entry  J  is  being  replaced,  and  J  is  also 


the  INDEX  for  a  peak  that  was  entered  previously,  the 
support  information  for  the  new  peak  replaces  the  sup¬ 
port  information  of  the  old  peak  wherever  it  occurs  in 
the  support  array  (i.e.  even  outside  of  the  new  peak’s 
original  support  region). 

After  the  above  processing,  the  critical  points  for  the 
curve  are  designated  as  those  points  whose  index  into  the 
support  array  equals  the  index  stored  in  the  information 
list  of  the  array  element. 

It  can  be  seen  that  the  order  in  which  peaks  are  en¬ 
tered  into  the  support  array  can  affect  the  final  selection 
of  the  critical  points  because  a  peak’s  region  of  support 
can  be  altered  by  the  “capture”  process,  and  thus  de¬ 
pends  on  the  state  of  the  support  array  at  the  time  the 
peak  is  entered.  In  out  implementation  of  the  algorithm 
for  running  the  experiments,  we  entered  the  peaks  into 
the  support  array  as  soon  as  they  were  computed  in  or¬ 
der  to  gain  computational  efficiency  and  simplicity,  and 
still  obtained  excellent  results.  In  the  current  version  of 
the  algorithm  we  collect  all  the  peaks  for  all  the  sticks, 
sort  the  peaks  by  their  normalized  scores,  and  then  enter 
them  into  the  support  array  in  order  of  increasing  score. 

There  are  some  additional  aspects  of  the  algorithm 
that  are  further  discussed  in  the  more  complete  version 
of  this  paper,  including  ways  to  handle  problems  aso- 
ciated  with  very  sharp  angles  and  competing  critpts  of 
approximately  equal  saliency  scores, 

9.3  Appendix  C:  Partitioning  Curves 
Extracted  From  Aerial  Imagery 

A  technique  for  detecting  and  delineating  low  resolu¬ 
tion  linear  structures  appearing  in  aerial  imagery,  such  as 
roads  and  rivers,  was  described  by  the  authors  of  this  pa¬ 
per  in  an  earlier  publication  [Fischler83].  The  algorithm 
was  effective  in  finding  such  structure,  but  it  provided  no 
mechanism  for  distinguishing  between  the  semantically 
meaningful  objects  and  the  “accidental”  and  irrelevant 
linear  features  found  in  most  real  images.  In  work  now  in 
progress,  we  use  the  SSS  algorithm  to  “slice  up”  the  in¬ 
dividual  curves  found  by  the  delineation  algorithm.  We 
throw  away  the  very  small  resulting  segments  which  are 
typical  of  accidental  linear  formations,  and  then  further 
filter  the  longer  segments  with  respect  to  a  set  of  seman¬ 
tic  constraints.  Those  segments  that  pass  through  the 
filtering  process  are  then  “glued”  back  together  to  pro¬ 
duce  the  desired  delineation.  This  process  is  illustrated 
in  Figure  6.  Figure  6a  shows  an  aerial  image,  and  6b 
shows  the  linear  segments  extracted  by  use  of  the  orig¬ 
inal  delineation  algorithm.  Figure  6c  shows  those  seg¬ 
ments  that  passed  through  the  filters  mentioned  above, 
and  Figure  fid  shows  the  result  of  a  final  step  to  retain 
only  the  more  significant  roads  and  trails.  The  two  panes 
of  Figure  fie  show  the  results  of  applying  the  SSS  algo¬ 
rithm  to  some  of  the  120  curves  highlighted  in  Figure 
fib  (they  have  been  isolated  and  separated  into  the  two 
panes  to  allow  clear  display  of  the  partition  points  and 
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to  prevent  confusion  due  to  the  intersections  of  distinct 
curves).  The  robustness  of  the  SSS  algoritmn  is  essential 
in  carrying  out  the  filtering  operation.  Insertion  of  extra¬ 
neous  partition  points  would  cause  the  lose  of  portions 
of  the  road  network;  absence  of  valid  partition  points 
would  allow  meaningless  appendages  to  become  part  of 
the  extracted  network. 
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CURVE  PARTITIONING;  Instructions 
For  each  enclosed  curve: 

Assume  that  10  years  from  now  you  will  be  asked  to  reconstruct  the  given  curve. 
A  reasonably  correct  reconstruction  will  be  rewarded  by  a  large  sum  of  money  (say 
$5000).  You  can  record,  for  later  use,  the  locations  of  up  to  nine  points  along  the 
curve  to  help  you  do  the  reconstruction  -  but  it  will  cost  you  $200  for  each  such 
point  (to  be  subtracted  from  your  prize  if  you  receive  the  reward).  Please  mark  your 
selected  points  on  the  curve.  Do  not  select  the  endpoints,  they  will  be  provided  free. 
Do  not  take  more  than  one  minute  per  curve. 


Critical  points  found  by  the  SSS  algorithm 


Points  chosen  by  at  least  1  of  1 1  test  subjects 


Figure  1:  Comparison  of  human  and  SSS  algorithm  performance  in  the  curve 
partitioning  task.  (Each  of  the  curves  used  in  the  experiments  with  human 
subjects  was  contained  in  a  square  that  was  1.5  inches  on  a  side.) 
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(a)  Test  curve  189  (b)  SSS  selected  critpts 


(c)  R/J-curvature  (d)  Anomalous  area 

selected  critpts  (magnified) 


(e)  Plot  of  R/J-curvature  along  test  curve.  Abcissa  =  sequence  number  of  point  on  curve. 
Ordinate  =  angle  (in  degrees)  computed  at  point.  (Angle-arms  are  10  units  each  for 
R/J-C;  standard  stick  lengths  of  10  and  20  units  are  employed  by  SSS.) 


Figure  2:  Comparison  of  SSS  and  R/J-curvature  metrics  evaluated  on  test  curve  189. 
The  continuous  curve  in  (e)  represents  R/J-curvature  along  the  test  curve  shown  in  (a). 
The  vertical  lines  in  (e)  mark  the  sequentially  numbered  critpts  selected  by  SSS  as  shown 
in  (b).  The  critpts  corresponding  to  the  extreme  values  of  R/J-curvature  shown  in  (c) 
are  marked  as  circles  in  (e).  The  arrow  in  (c),  and  in  the  corresponding  location  in  (e), 
illustrates  an  anomalous  selection  using  R/J-curvature.  (d)  shows  the  computed  values  of 
R/J-curvature,  153°,  at  the  preferred  location  and  122°  at  the  location  of  the  anomalous 
selection. 


(a)  Test  curve  166  (b)  SSS  selected  critpts  (c)  R/J-curvature 

selected  critpts 


(d)  Plot  of  R/J-curvature  along  test  curve.  Abcissa  =  sequence  number  of  point  on  curve. 
Ordinate  =  angle  (in  degrees)  computed  at  point.  (Angle-arms  are  10  units  each  for 
R/J-C;  stick  length  is  20  units  for  F/B-S.) 


Figure  3;  Comparison  of  SSS  and  R/J-curvature  metrics  evaluated  on  test  curve  166. 
The  continuous  curve  in  (d)  represents  R/J-curvature  along  the  test  curve  shown  in  (a). 
The  vertical  lines  in  (d)  mark  the  sequentially  numbered  critpts  selected  by  SSS  as  shown 
in  (b).  The  critpts  corresponding  to  the  extreme  values  of  R/J-curvature  shown  in  (c) 
are  marked  as  circles  in  (d).  The  arrows  in  (c),  and  in  the  corresponding  locations  in 
(d),  illustrates  anomalous  selections  using  R/J-curvature. 


(a)  (b) 

Figure  4:  Curvature  and  saliency  are  functions  of  curve  resolution.  As  illustrated  in  (a) 
above,  we  can  draw  more  than  one  visually  acceptable  tangent  to  many  of  the  p>oints  on  this 
curve  at  the  given  resolution.  As  resolution  increases,  tangent  2  would  dominate  at  point  x; 
as  resolution  decreases,  tangent  1  would  dominate  at  the  same  point.  In  (b),  the  angle  at  x 
can  be  seen  as  45**  at  one  scale  and  90°  at  a  larger  scale.  Thus,  curvature  and  saliency  are 
not  unique  properties  of  curve  points. 
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Figure  5:  Critical  points  found  by  the  SSS  algorithm  for  a  set  of  40  random  curves. 
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(a)  Aerial  photograph 


(b)  Initial  extraction  of 
linear  structure 


(c)  Filtered  linear  structure  (d)  Delineation  of  major 
using  SSS  algorithm  roads  and  trails 


(e)  Partition  points  found  by  SSS  algorithm  on  curves  from  (b) 


Figure  6:  Application  of  the  SSS  algorithm  to  the  problem  of  delineating  linear  features  in 
aerial  photographs. 
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Abstract 

This  paper  presents  an  algorithm  for  detecting  oc¬ 
cluding  edges  in  stereo  pairs.  The  algorithm  de¬ 
scribed  here  does  not  require  dense  correspondence 
estimates  and  hence  may  be  applied  in  a  selective 
manner.  It  extracts  intensity  edges  and  tests  them 
for  occlusion  by  sampling  edgels  along  each  edge. 
The  algorithm  works  by  searching  for  matches,  in 
the  right  image,  for  the  regions  to  the  left  and  right 
of  a  left-image  edgel.  The  method  is  based  on  that 
of  Toh  and  Forrest  [1990],  but  extends  that  work  in 
several  ways.  First,  it  adds  an  algorithm  for  auto¬ 
matically  selecting  an  appropriate  size  for  the  corre¬ 
lation  windows  used  to  detect  the  occlusions.  Sec¬ 
ond,  it  adds  a  simple  technique  for  classifying  an 
entire  edge  by  sampling  only  a  few  pixels  along  the 
edge.  Finally,  it  identifies  two  situations  that  lead  to 
false  positive  and  false  negative  classifications,  and 
describes  solutions  to  these  problems. 

1  Introduction 

Occluding  edges  provide  important  information 
about  one’s  environment.  Knowledge  of  the  loca¬ 
tions  of  occluding  edges  can  be  used  to  increase  the 
robustness  of  object  recognition  programs  [Thomp¬ 
son  and  Whillock,  1988].  Also,  occluding  edges  can 
denote  surface  discontinuities  such  as  holes  or  cliffs. 
Finally,  occluding  edges  identify  the  borders  of  re¬ 
gions  that  cannot  be  imaged  from  the  current  cam¬ 
era  viewpoint,  and  hence  can  be  used  to  guide  a 
camera  to  look  into  these  regions.  For  all  of  these 
uses,  it  seems  sufficient  to  view  occluding  edges  as 
qualitative  phenomena;  it  does  not  seem  necessary 
to  construct  a  dense  depth  map. 

Traditionally,  however,  occlusions  have  been  de¬ 
tected  as  a  side-effect  of  the  construction  of  dense 
correspondence  maps.  The  most  widely-described 

'This  material  iS  based  on  work  supported  by  DARPA 
Contract  MDA972-92-J-1012. 


occlusion-detection  techniques  attempt  to  detect  not 
occluding  edges,  but  rather  occluded  points  in  a 
stereo  or  motion  pair  [Mutch  and  Thompson,  1985, 
Geiger  ti  ai,  1992,  Jones  and  Malik,  1992,  Weng 
et  ai,  1992).  These  are  points  that  appear  in  one 
image  but  do  not  appear  in  the  other,  and  hence 
cannot  be  matched  from  one  image  to  the  other. 
Such  points  appear  in  the  vicinity  of  occluding  edges. 
Because  of  the  difficulties  of  determining  matching 
points,  occluded'point  detection  is  typically  incor¬ 
porated  into  algorithms  for  determining  dense  corre¬ 
spondence  maps.  Unfortunately,  such  algorithms  are 
time-consuming,  and  even  once  the  occluded  points 
have  been  located,  the  occluding  edges  still  must  be 
inferred.  Methods  that  do  not  involve  dense  corre¬ 
spondence  and  that  directly  identify  occluding  edges 
would  be  more  desirable. 

Three  such  methods  have  appeared  in  the  literature. 
The  first  uses  active  adjustment  of  a  camera’s  depth- 
of-field  [Toh  and  Forrest,  1990,  Brunnstrom  et  ai, 
1991).  The  remaining  two  techniques  involve  stereo 
pairs.  Little  [1990]  identifies  points  near  an  occlusion 
edge  by  examining  the  distribution  of  match  good¬ 
nesses  as  a  window  around  the  point  is  shifted  hori¬ 
zontally.  If  the  window  overlaps  surfaces  at  different 
depths,  this  distribution  is  bimodal.  Finally,  [Toh 
and  Forrest,  1990]  presents  a  method  that  examines 
two  adjacent  image  patches.  Its  operation  is  illus¬ 
trated  in  Figure  1.  Given  a  near- vertical  edgel*  e  in 
the  left  image  L,  this  algorithm  operates  by  selecting 
two  windows  in  L.  Window  Pl  runs  leftwards  from 
e,  while  window  Ni,  runs  rightwards  from  e.  The 
right  image  R  is  then  searched,  along  roughly  the 
epipolar  line,  for  the  regions  Pr  and  Nr  that  best 
match  Pl  and  Ni,  respectively.  Let  Cp  and  Cn  he 

’  An  edgel  is  simply  a  pixel  at  which  a  significant  intensity 
gradient  exists.  This  paper  describes  only  the  use  of  near¬ 
vertical  edgels  because  it  is  effectivdy  impossible  to  matdi 
horizontal  edgels  in  left  and  right  stereo  pairs  —  usually  a 
match  at  one  disparity  is  indistinguishable  from  a  match  at 
another.  Of  course,  the  same  algorithms  could  be  applied  to 
near-horizontal  edgels  given  a  top  and  bottom  stereo  pair. 
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sented.  Additional  details  can  be  found  in  [Wixson, 
1993]. 
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Figure  1:  Visual  cue  indicating  the  presence  of  an 
occluding  edge.  If  c  is  a  right-occluding  edge,  then 
the  surface  to  its  right  spanned  by  Nl  will  be  visible 
in  the  right  image.  On  the  other  hand,  the  surface 
to  e’s  left  is  blocked  from  the  right  image’s  view.  As 
a  result,  the  region  spanned  by  Pl  is  not  visible  in 
the  ri^jht  image. 

the  goodnesses-of-match  between  Pl  and  Pr,  and 
Nl  and  Nr,  respectively.  If  Cp  and  Cn  are  both 
high,  and  Pl  and  Pr  are  next  to  each  other  in  R, 
this  means  that  the  regions  on  both  sides  of  e  were 
visible  in  the  right  image,  and  therefore  that  e  is  an 
intensity  edge  but  not  an  occluding  edge.  If  Cp  {Cn) 
is  small,  however,  then  this  means  that  Pl  {Nl)  is 
not  visible  in  the  right  image,  and  that  e  is  a  right- 
(left-)occluding  edge.  (A  right-(left-)occluding  edge 
is  an  edge  such  that  the  surface  to  its  right  (left)  is 
closer  to  the  camera.) 

This  paper  describes  experience  with  Toh  and  For¬ 
rest’s  algorithm.  It  contains  solutions  to  two  prob¬ 
lems  that  arise  in  practice,  namely  how  to  select 
Pl  and  Nl  so  that  they  contain  enough  texture  for 
the  matching  algorithm  to  be  able  to  truly  deter¬ 
mine  whether  or  not  matches  exist  in  the  right  im¬ 
age,  and  how  to  use  the  detector  to  rapidly  clas¬ 
sify  edges.  The  paper  also  identifies  two  situations 
that  can  cause  misclassihcations.  First,  if  diagno¬ 
sis  is  based  only  on  Cp  and  Cn,  situations  exist 
in  which  false  negative  responses  can  be  obtained, 
i.e.  occluding  edgels  can  be  classified  as  a  surface 
marking.  Second,  false  positive  responses  can  also 
be  obtained  —  a  non-occluding  edge  e  will  be  clas¬ 
sified  as  an  occluding  edge  if  a  true  occluding  edge 
is  closer  to  the  camera  and  passes  through  c’s  Pl 
or  Nl  window.  Solutions  to  these  problems  are  pre- 


2  Selecting  the  width  of  the 
matching  windows 

Consider  the  stereo  image  pair  shown  in  Figure  4. 
It  contains  little  surface  texture  near  many  of  its 
intensity  edges.  Yet  many  occluding  edges  can  be 
seen  in  this  image  pair  by  comparing  the  distance 
between  pairs  of  edges.  For  example,  consider  the 
distance  between  the  left  edge  of  the  detergent  box 
and  the  right  edge  of  the  cylinder  to  its  left.  In  the 
left  image,  this  distance  is  larger  than  in  the  right 
image,  and  therefore  it  follows  that  the  left  edge  of 
the  box  is  a  right-occluding  edge,  i.  c.  an  occluding 
edge  whose  right  side  is  closer  to  the  camera  than 
its  left.  Similar  inferences  can  be  made  based  on 
the  relationship  between  the  right  side  of  the  card¬ 
board  box  and  the  left  edge  of  the  decahedron  and 
the  coffee  cup  and  the  decahedron.  At  the  same 
time,  of  course,  the  distances  between  many  of  the 
surface  markings  on  the  detergent  box  are  the  same 
in  both  images  and  hence  these  are  not  occluding 
edges.  Thus  we  see  that  in  order  to  detect  occluding 
edges  in  untextured  scenes,  the  match  windows  must 
be  wide  enough  to  reach  the  next  adjacent  edge. 
This  suggests  that  a  good  strategy  for  selecting  the 
width  of  the  matching  window  might  simply  be  to 
make  the  window  wide  enough  to  contain  the  closest 
strong  intensity  edge.  This  can  be  done  as  follows: 
To  find  the  right  bound  ijv  of  the  Nl  window  to 
the  right  of  edgel  Xe.Ve.  step  rightwards  from  ar* 
until  the  i-derivative  exceeds  a  threshold  ft  and  from 
there  keep  stepping  rightwards  until  the  i-derivative 
falls  below  a  threshold  e.  The  purpose  of  stepping 
until  falling  below  t  is  to  ensure  that  enough  of  the 
adjacent  edge  is  contained  within  the  window  for  it 
to  influence  the  goodness  of  match. 

To  ensure  that  the  window  is  big  enough  to  allow 
robust  matching,  a  minimum  distance  m  between 
I*  and  XN  is  required.  If  xn  —  x^  <  m,  i;v  is  reset 
to  Xe  +  m.  On  the  other  hand,  no  limit  is  imposed 
on  the  maximum  window  width. 

The  left  bound  xp  ol  the  left  match  window  Pl  is 
computed  similarly.  If  the  algorithm  for  selecting  xp 
ox  xn  runs  off  the  side  of  the  image  before  a  pixel 
is  found  that  meets  the  criteria,  the  edgel  x* ,  y*  is 
classified  as  “unmeasurable”. 

To  increase  robustness,  both  match  windows  are 
given  some  vertical  extent,  i.c.  are  more  than  one 
row  high.  However,  this  amount  is  fixed  rather  than 
adaptive. 

The  above  method  for  selecting  window  size  is  ad  hoc 
but  effective.  A  more  principled  method  is  described 
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in  [Kanade  and  Okutomi,  1990]. 


Cn  >  T 


3  Matching 


Because  the  sizes  of  the  correlation  windows  vary 
depending  upon  the  surroundings  of  the  edgel  being 
tested,  and  we  would  like  to  have  a  single  threshold 
that  determines  whether  a  match  is  a  good  match, 
the  similarity  measure  must  be  independent  of  win¬ 
dow  size.  Therefore,  match  goodnesses  between  win¬ 
dows  in  the  left  and  right  image  are  computed  using 
normalized  correlation  [Ballard  and  Brown,  1982,  p. 
68]: 


C{p,q) 


pq-pq 
<r(p)  ff(q) 


(1) 


where  p  and  q  are  image  patches,  and  ff(p)  and  (r{q) 
are  the  standard  deviations  of  the  pixel  values  in  the 
patches. 


yes  no 


dpN  <  A 
yes  no 

left-occ 

marking  |  left>occ 

right-occ 

not- visible 

Table  1:  Diagnosis  p>-ocedure  for  determining  the 
nature  of  an  edgel  given  Cp,  Cn,  and  dpN-  Cp  and 
Cn  are  goodnesses-of-match  between  Pl  and  Pr, 
and  Nl  and  Nr,  respectively,  where  Pr  and  Nr  are 
the  windows  in  the  right  image  that  best  match  Pl 
and  Ni.  dpN  is  the  distance  in  pixels  between  the 
right  edge  of  Pr  and  the  left  edge  of  Nr. 


4  Analyzing  the  matching  re¬ 
sults 

Given  the  windows  Pr  and  Nr  in  the  right  image 
that  best  match  Pi  and  Ni,  let  Cp  and  Cn  be  the 
corresponding  similarity  measurements  between  the 
windows,  and  let  dpN  be  the  distance,  in  pixels,  be¬ 
tween  the  right  edge  of  Pr  and  the  left  edge  of  Nr. 
Let  r  be  the  threshold  over  which  a  match  is  con¬ 
sidered  to  be  a  good  match.  Then  the  diagnosis  as 
to  the  nature  of  the  edgel  e  that  Pi  and  Ni  abut  is 
determined  according  to  Table  1.  An  edgel  can  be  di¬ 
agnosed  either  as  a  surface  marking,  a  left-occluding 
edgel  (which  means  that  the  left  side  of  the  edgel  is 
the  closer  surface),  a  right-occluding  edgel,  or  as  a 
point  that  is  not  visible  in  the  other  image. 

Table  1  is  for  the  most  part  straightforward,  except 
for  the  case  in  its  upper-left  corner,  where  both  Cp 
and  Cn  indicate  good  matches.  In  this  case,  it  is 
necessary  to  examine  dpN,  because  there  are  two 
reasons  why  there  could  be  good  matches  for  both 
Pl  and  Ni.  The  first,  of  course,  is  that  e  might  only 
be  a  surface  marking,  not  an  occluding  edge  As  a 
result,  dpN  should  be  small.  The  second  possibil¬ 
ity  is  that  e  is  a  left-occluding  edge,  as  illustrated 
in  Figure  2.  In  this  case,  the  regions  imaged  by  the 
Pl  and  Ni  windows  in  the  left  image  are  both  im¬ 
aged  in  the  right  image,  but  the  right  camera  will 
also  image  some  more  of  the  background  surface  be¬ 
tween  the  Pr  and  Nr  windows.  An  example  of  this 
phenomenon  occurs  with  the  right  side  of  the  de¬ 
tergent  box,  a  left-occluding  edge,  in  Figure  4.  In 
the  left  image,  much  less  of  the  cardboard  box  be¬ 
hind  the  detergent  box  is  visible.  The  portion  of 
the  cardboard  box  visible  in  the  left  image  is  aiso 


visible  in  the  right  image,  but  the  right  imztge  also 
contains  a  region  of  the  cardboard  box  to  the  left  of 
this  portion. 

5  Classifying  edges  based  on 
sparse  measurements 

One  advantage  of  the  occlusion  detection  method 
presented  above  is  that,  since  it  is  non-iterative  and 
does  not  involve  smoothing,  it  does  not  require  dense 
measurement  of  the  pixel  disparities.  As  a  result,  the 
algorithm  can  be  selectively  applied  only  to  points 
of  interest,  saving  computation  time.  One  obvious 
method  for  selecting  these  points  of  interest  is  to  se¬ 
lect  them  along  previously-extracted  intensity  edges. 
This  approach  has  several  advantages.  If  long  edges 
are  more  likely  to  be  important  than  short  edges,  the 
edges  can  be  tested  in  order  of  decreasing  length, 
thus  creating  a  type  of  anytime  algorithm.  Also, 
very  short  edges  can  be  discarded  to  avoid  wasting 
effort  testing  them  for  occlusion. 

Using  an  intensity  edge  to  select  edgels  to  be  tested 
for  occlusion  has  one  additional  advantage.  Edgels 
that  are  on  the  same  edge  are  likely  to  have  the  same 
properties.  If,  for  instance,  several  edgels  along  the 
same  intensity  edge  are  left-occluding  edgels,  then 
the  other  edgels  along  the  edge  are  highly  likely  to 
be  left-occluding.  This  suggests  that  an  intensity 
edge  can  be  efficiently  tested  for  occlusion  by  test¬ 
ing  only  a  subset  of  its  edgels.  In  our  implementa¬ 
tion,  we  test  only  every  i’th  (typically  it  =  7)  edgel 
along  an  edge.  Of  these,  we  discard  all  edgels  discov¬ 
ered  to  be  unmeasurable.  If  more  than  20%  of  the 
remaining  tested  edgels  are  determined  to  be  right- 
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Figure  2:  A  left-occluding  edge  (c)  may  not  be  de¬ 
tectable  using  only  the  match  goodnesses  for  the  left 
image’s  P  and  windows.  Here  the  regions  imaged 
in  the  left  P  and  N  are  also  viewable  in  the  right 
image.  The  clue  that  c  is  a  left-occluding  edge  is 
that  there  is  a  gap  between  the  right  image’s  match 
to  P  and  its  match  to  N.  The  width  of  this  gap  is 
denoted  by  dps- 

(left-)occluding,  then  the  edge  is  classified  as  a  right- 
(left-)occluding  edge.  If  both  the  number  of  left- 
occluding  edgels  and  the  number  of  right-occluding 
edgels  exceed  20%,  the  edge  is  classified  according 
to  whichever  has  the  most  votes. 

6  Experimental  results 

The  above  algorithm  was  applied  to  the  stereo  pairs 
in  Figures  4,  5,  and  6.  Vertical  edges  were  extracted 
by  chaining  together  pixels  at  which  the  absolute 
value  of  the  intensity  ar-derivative  exceeded  6  and 
was  a  local  maxima.  Chains  of  length  <  8  were 
discarded.  The  remaining  edges  are  shown  in  the 
figures’  third  column. 

The  edges  were  classified  using  the  incremental  test 
strategy  described  in  the  previous  section.  This 
strategy  resulted  in  a  significant  reduction  of  com¬ 
putation;  for  each  of  the  240  x  244  pairs  in  Figures 
4-6,  the  number  of  edgels  tested  for  occlusion  was 
between  300  and  500,  which  is  much  less  than  the 
total  number  of  either  pixels  or  edgels  in  the  images. 
The  parameter  values  used  to  determine  the  width 
of  the  match  windows  were  6  =  6,f  =  1.1,  and  m  = 
13.  Using  these  parameters  on  images  that  were 
244  columns  wide,  the  median  width  of  the  resulting 
windows  .selected  in  these  images  was  between  15 
and  20  pixels,  while  the  average  width  Wcis  25  pixels. 
The  match  windows  were  2i;  -(-  1  pixels  high,  and  v 
was  3. 

The  parameter  values  used  to  diagnose  edgels  were 
r  =  .75,  and  A  =  3.  The  m,  r,  and  A  parameters 
are  quite  robust  and  are  unlikely  to  require  modifica¬ 
tion  from  image  to  image.  The  6  and  f  parameters, 
which  are  thresholds  based  on  measured  derivative 


magnitudes,  may  require  modification  depending  on 
the  sharpness  of  the  image.  These  were  selected  by 
examining  the  measured  derivatives. 

The  identified  occluding  edges  are  displayed  in  the 
rightmost  column  of  Figures  4-6.  Right-occluding 
edges  are  displayed  as  solid  lines,  while  left-occluding 
edges  are  dotted  lines.  The  figures  show  that  most 
surface  markings  and  edges  that  are  too  distant  to 
have  significant  stereo  disparity  are  filtered  out  by 
the  occlusion  detector,  and  that  it  succeeds  in  de¬ 
tecting  many  of  the  true  occluding  edges,  such  as 
the  right-occluding  edges  created  by  the  left  edges 
of  Figure  4’s  detergent  box,  mug,  and  decahedron, 
and  the  left  edges  of  the  chair  seat  and  chair  back  in 
Figure  5. 

Note  that  in  some  cases,  significant  vertical  occlud¬ 
ing  edges  were  not  labeled  as  occluding  edges.  No¬ 
tably  missing  in  Figure  4  are  the  top  section  of  the 
detergent  box’s  left  edge,  the  detergent  box’s  right 
edge,  both  edges  of  the  cylinder,  the  left  edge  of  the 
tape  dispenser,  and  the  right  edge  of  the  decahedron. 
These  edges  were  unmeasurable  because  on  one  side 
of  them,  no  matching  window  could  be  selected;  the 
border  of  the  image  was  encountered  before  another 
significant  edge  could  be  found  that  could  serve  as 
the  match  window  border. 

7  False  positives 

Inspection  of  the  results  in  Figures  4-6  reveals  some 
falsely  detected  occlusions.  There  are  three  causes  of 
these  false  positives.  The  first  is  a  problem  with  all 
correlation-based  matching  algorithms  —  they  fail 
to  obtain  good  matches  on  curved  surfaces  or  sur¬ 
faces  that  slope  sharply  away  from  the  stereo  base¬ 
line.  Therefore,  a  surface  marking  on  the  mug  in 
Figure  4  and  the  nearest  edge  of  the  chair  back  in 
Figure  5  are  falsely  classified  as  occluding  edges,  and 
the  right  side  of  the  white  box  in  Figure  6  is  misclas- 
sified  as  a  left-occluding  edge. 

The  second,  relatively  minor,  cause  of  some  false 
positives  is  failure  to  detect  the  proper  terminating 
edge  of  the  or  Ni  window.  This  occurred  for  the 
lower  right  quadrant  of  the  circle  on  the  detergent 
box  in  Figure  4.  The  portion  of  the  detergent  box’s 
right  edge  that  is  to  the  right  of  these  edgels  does 
not  give  rise  to  a  significant  image  gradient.  As  a 
result,  their  N[,  windows  are  extended  to  the  right 
edge  (.f  t)ie  box  behind  the  detergent  box.  Since 
!h'  iistance  is  not  the  same  in  the  right  image,  the 
.V;  M!  h  fails  and  the  edgels  are  diagnosed  as  left- 

or  ilTttt 

i  i  ‘  i  cause  of  false  positives  is  fairly  important 
and  i.s  most  clearly  illustrated  by  the  wall  socket  that 
appears  near  the  chair  back  in  Figure  5.  Its  left  edge 
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Figure  3:  A  occluding  edge  in  the  foreground  may 
cause  a  surface  marking  to  be  incorrectly  diagnosed 
as  an  occluding  edge.  Here,  for  surface  marking  e, 
the  area  imaged  in  the  left  image’s  N  window  is 
not  visible  in  the  right  image  due  to  a  closer  sur¬ 
face.  As  a  result,  Cn  is  small  and  the  classification 
procedure  described  in  Table  1  will  diagnose  e  as  a 
left-occluding  edge. 

is  not  an  occluding  edge,  yet  is  detected  as  such.  As 
illustrated  in  Figure  3,  this  occurs  because  no  match 
can  be  found  for  the  Nl  windows  emanating  from 
edgels  on  the  left  edge  of  the  socket.  No  match  can 
be  found  because  that  region  in  the  right  image  is 
occluded  by  the  chair  back.  Thus,  surface  markings 
that  project  to  image  locations  near  the  projections 
of  closer  occluding  edges  may  be  themselves  incor¬ 
rectly  diagnosed  as  occluding  edges.  Other  examples 
of  this  appear  to  the  left  of  the  left  edge  of  the  chair 
seat  in  Figure  5.  A  heuristic  for  eliminating  these 
false  positives  via  a  postprocessing  step  is  described 
in  [Wixson,  1993]. 

8  Conclusion 

This  paper  has  described  an  occlusion  detection  al¬ 
gorithm  that  does  not  require  dense  correspondence 
estimates  and  hence  can  applied  selectively.  Cur¬ 
rently,  its  output  is  rather  qualitative;  it  classifies 
intensity  edges  as  either  left-  or  right-occluding,  but 
does  not  estimate  the  size  of  the  depth  discontinu¬ 
ity  across  the  edge,  which  would  provide  a  hint  as 
to  the  size  of  the  occluded  region.  However,  the  al¬ 
gorithm  could  be  augmented  to  estimate  the  depth 
discontinuity  magnitude  by  computing  the  relative 
disparity  of  non-occluded  regions  near  each  side  of 
an  occluding  edge. 
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Figure  6;  “Desk”  stereo  pair,  extracted  edges,  and  detected  occluding  edges. 
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Abstract 

Recovering  the  shape  of  an  object  from  two 
views,  or  stereo,  fails  at  occluding  contours 
of  smooth  objects  because  the  extremal  con¬ 
tours  are  view  dependent.  For  three  or  more 
views,  shape  recovery  is  possible,  and  several 
algorithms  have  recently  been  developed  for 
this  purpose.  We  present  a  new  approach  to 
the  multiframe  shape  recovery  problem  which 
does  not  depend  on  differential  measurements 
in  the  image,  which  may  be  noise  sensitive.  In¬ 
stead,  we  use  a  linear  smoother  to  optimally 
combine  all  of  the  measurements  available  at 
the  contours  (and  other  edges)  in  all  of  the 
images.  This  allows  us  to  extract  a  robust 
and  dense  estimate  of  surface  shape,  and  to 
integrate  shape  information  from  both  surface 
markings  and  occluding  contours. 

1  Introduction 

Most  visually-guided  systems  require  representations  of 
surfaces  in  the  environment  in  order  to  integrate  sens¬ 
ing,  planning,  and  action.  The  task  considered  in  this 
paper  is  the  recovery  of  3D  structure  (shape)  of  objects 
with  piecewise-smooth  surfaces  from  a  sequence  of  pro¬ 
files  taken  with  known  camera  motion.  The  profile  (also 
known  as  the  extremal  boundary  or  occluding  contour) 
is  defined  as  the  image  of  the  critical  set  of  the  projec¬ 
tion  map  from  the  surface  to  the  image  plane.  Since 
profiles  are  general  curves  in  the  plane  without  distin¬ 
guished  points,  there  is  no  a  priori  pointwise  correspon¬ 
dence  between  these  curves  in  different  views.  However, 
given  the  motion,  there  is  a  correspondence  based  on  the 
epipolar  constraint.  For  two  images,  i.e.,  classical  stereo, 
this  epipolar  constraint  is  a  set  of  straight  lines.  These 
lines  are  the  intersection  of  the  epipolar  planes  with  the 
image  plane.  The  epipolar  plane  through  a  point  is  de¬ 
termined  by  the  view  direction  at  that  point  and  the 
camera  translation  direction. 

In  the  case  of  contours  that  are  not  view  depen¬ 
dent,  e.g.,  creases  (tangent  discontinuities)  and  sur¬ 
face  markings,  many  techniques  have  been  developed 
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for  recovering  the  3D  contour  locations  from  two  or 
more  images  under  known  camera  motion  [Marr  and 
Poggio,  1979;  Mayhew  and  Frisby,  1980;  Arnold,  1983; 
Bolles  et  ai,  1987;  Baker  and  Bolles,  1989;  Matthies 
et  ai,  1989].  However,  for  smooth  curved  surfaces  the 
critical  set  which  generates  the  profile  is  different  for 
each  view.  Thus,  the  triangulation  applied  in  two-frame 
stereo  will  not  be  correct  along  the  occluding  contour 
for  smooth  surfaces.  For  the  same  recison,  it  is  not 
possible  to  determine  the  camera  motion  from  the  im¬ 
ages  unless  some  assumptions  are  made  either  about 
the  surface  or  the  motion  [Arborgast  and  Mohr,  1992; 
Giblin  et  ai,  1992].  On  the  other  hand,  the  fact  that  the 
critical  sets  sweep  out  an  area  means  that  the  connec¬ 
tivity  of  the  surface  points  can  be  determined,  i.e.  one 
obtciins  a  surface  patch  rather  than  a  set  of  points. 

The  problem  of  reconstructing  a  smooth  surface  from 
its  profiles  has  been  explored  for  known  planar  mo¬ 
tion  by  Giblin  and  Weiss  [l987]  and  subsequently  for 
more  general  known  motion  by  Vaillant  [Vaillant,  1990; 
Vaillant  and  Faugeras,  1992]  and  Cipolla  and  Blake 
[Blake  and  Cipolla,  1990;  Cipolla  and  Blake,  1990; 
Cipolla  and  Blake,  1992].  These  approaches  are  bcised 
on  a  differential  formulation  and  analysis  or  use  three 
frames.  Unfortunately,  determining  differential  quan¬ 
tities  reliably  in  real  images  is  difficult.  This  has  led 
Cipolla  and  Blake  to  use  relative  measurements  in  order 
to  cancel  some  of  the  error  due  to  inadvertent  camera  ro¬ 
tation.  Their  approach  used  B-snakes  which  require  ini¬ 
tialization  for  each  contour  that  is  tracked.  In  addition, 
B-snakes  implicitly  smooth  the  contours  in  the  image. 
Since  the  recovery  of  3D  points  is  a  linear  problem,  the 
smoothing  can  be  done  in  3D  on  the  surface  where  more 
context  can  be  used  in  the  detection  of  discontinuities, 
so  that  detailed  structure  can  be  preserved. 

To  overcome  these  limitations,  the  approach  we  de¬ 
velop  in  this  paper  applies  estimation  theory  (Kalman 
filtering  and  smoothing)  to  make  optimal  use  of  each 
measurement  without  computing  differential  quantities. 
First,  we  derive  a  lineor  set  of  equations  between  the 
unknown  shape  (surface  point  positions  and  radii  of  cur¬ 
vature)  and  the  measurements.  We  then  develop  a  ro¬ 
bust  linear  smoother  ([Gelb,  1974;  Bierman,  1977])  to 
compute  statistically  optimal  current  and  past  estimates 
from  the  set  of  contours.  Smoothing  allows  us  to  com¬ 
bine  measurements  on  both  sides  of  each  surface  point. 
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Our  technique  produces  a  complete  surface  descrip¬ 
tion,  i.e.,  a  network  of  linked  3D  surface  points,  which 
provides  us  with  a  much  richer  description  than  just  a 
set  of  3D  curves.  Due  to  self-occlusion  and  occlusion 
by  other  surfaces,  some  parts  of  the  surface  may  never 
appear  on  the  profile.  Since  the  method  presented  here 
also  works  for  arbitrary  surface  markings  and  creases,  a 
larger  part  of  the  surface  can  be  reconstructed  than  from 
occluding  contours  of  the  smooth  pieces  alone.  Our  ap¬ 
proach  also  addresses  the  difficult  problem  of  contours 
that  merge  and  split  in  the  image,  which  must  be  re¬ 
solved  if  an  accurate  and  complete  3D  surface  model  is 
to  be  constructed. 

The  method  we  develop  has  applications  in  many  ar¬ 
eas  of  computer  vision,  computer  aided  design,  and  vi¬ 
sual  communications.  The  most  traditional  application 
of  visually  based  shape  recovery  is  in  the  reconstruction 
of  a  mobile  robot’s  environment,  which  allows  it  to  per¬ 
form  obstacle  avoidance  and  planning  tasks  [Curwen  et 
ai,  1992]. 

Our  paper  is  structured  as  follows.  We  begin  in  Sec¬ 
tion  2  with  a  description  of  our  edge  detection,  contour 
linking,  and  edge  tracking  algorithms.  In  Section  3,  we 
discuss  the  estimation  of  the  epipolar  plane  for  a  se¬ 
quence  of  three  or  more  views.  Section  4  presents  the 
linear  measurement  equations  which  relate  the  edge  po¬ 
sitions  in  each  image  to  the  parameters  of  the  circular  arc 
being  fitted  at  each  surface  point.  Section  5  then  reviews 
robust  least  squares  techniques  for  recovering  the  shape 
parameters  and  discusses  their  statistical  interpretation. 
Section  6  shows  how  to  extend  least  squares  to  a  time- 
evolving  system  using  the  Kalman  filter,  and  develops 
the  requisite  forward  mapping  (surface  point  evolution) 
equations.  Section  7  extends  the  Kalman  filter  to  the  lin¬ 
ear  smoother,  which  optimally  refines  and  updates  pre¬ 
vious  surface  point  estimates  from  new  measurements. 
Section  8  presents  a  series  of  experiments  performed  both 
on  noisy  synthetic  contour  sequences  and  on  real  video 
images.  We  close  with  a  discussion  of  the  performance 
of  our  new  technique  and  a  discussion  of  future  work. 

2  Contour  detection  and  tracking 

The  problem  of  edge  detection  has  been  extensively 
studied  in  computer  vision  [Marr  and  Hildreth,  1980; 
Canny,  1986].  The  choice  of  edge  detector  is  not  cru¬ 
cial  in  our  application,  since  we  are  interested  mostly  in 
detecting  strong  edges  such  as  occluding  contours  and 
visible  surface  markings.^  For  our  system,  we  have  cho¬ 
sen  the  steerable  filters  developed  by  Freeman  and  Adel- 
son  [l99l],  since  they  provide  good  angular  resolution  at 
moderate  computation  cost,  and  since  they  can  find  both 
step  and  peak  edges.  An  example  of  our  edge  detector 
operating  on  the  input  image  in  Figure  la  is  shown  in 
Figure  lb. 

Once  discrete  edgels  have  been  detected,  we  use  local 
search  to  link  the  edgels  into  contours.  We  find  the  two 


‘Unlike  many  edge  detection  applications,  however,  our 
systems  does  provide  us  with  a  quantitative  way  to  measure 
the  performance  of  an  edge  detector,  since  we  can  in  many 
cases  measure  the  accuracy  of  our  final  3D  reconstruction. 


neighbors  of  each  edgel  based  on  proximity  and  conti¬ 
nuity  of  orientation.  Note  that  in  contrast  to  some  of 
the  previous  work  in  reconstruction  from  occluding  con¬ 
tours  [Cipollaand  Blake,  1990;  Cipollaand  Blake,  1992], 
we  do  not  fit  a  smooth  parametric  curve  to  the  contour 
since  we  wish  to  directly  use  all  of  the  edgels  in  the  shape 
reconstruction,  without  losing  detail. 

We  then  use  the  known  epipolar  lines  (Section  3)  to 
find  the  best  matching  edgel  in  the  next  frame.  Our  tech¬ 
nique  compares  ail  candidate  edgels  within  the  epipolar 
line  se2urch  range  (defined  by  the  expected  minimum  and 
maximum  depths),  and  selects  the  one  which  matches 
most  closely  in  orientation  (see  Figure  Ic). 

Since  contours  are  maintained  as  a  list  of  discrete 
points,  it  is  necessary  to  resample  the  edge  points  in 
order  to  enforce  the  epipolar  constraint  on  each  track. 
We  occasionally  start  new  tracks  if  there  is  a  sufficiently 
large  (2  pixel  wide)  gap  between  successive  samples  on 
the  contour.  While  we  do  not  operate  directly  on  the 
spatiotemporal  volume,  our  tracking  and  contour  linking 
processes  form  a  virtual  surface  similar  to  the  weaving 
wall  technique  of  Harlyn  Baker  [1989].  Unlike  Baker’s 
technique,  however,  we  do  not  assume  a  regular  and 
dense  sampling  in  time. 

3  Reconstructing  surface  patches 

The  surface  being  reconstructed  from  a  moving  camera 
can  be  parametrized  in  a  natural  way  by  two  families 
of  curves  [Giblin  and  Weiss,  1987;  Cipolla  and  Blake, 
1990]:  one  family  constists  of  the  critical  sets  on  the 
surface;  the  other  is  tangent  to  the  family  of  rays  from 
the  camera  focal  points.  The  latter  curves  are  called 
epipolar  curves.  The  problem  is  that  any  smooth  surface 
reconstruction  algorithm  which  is  more  than  a  first  order 
approximation  requires  at  least  three  images  and,  that 
in  general,  the  three  corresponding  tangent  rays  will  not 
be  coplanar.  However,  there  arc  many  cases  when  this 
will  be  a  good  approximation.  One  such  case  is  when 
the  camera  trajectory  is  almost  linear. 

Cipolla  and  Blake  [l990;  1992]  and  Vaillant  and 
Faugeras  [1990;  1992]  noticed  that  to  compute  the  cur¬ 
vature  of  a  planar  curve  from  three  tangent  rays,  one  can 
determine  a  circle  which  is  tangent  to  these  rays.  The 
assumption  that  one  needs  to  make  is  that  the  surface  re¬ 
mains  on  the  same  side  of  the  tangent  rays.  This  is  true 
for  intervals  of  the  curve  which  do  not  have  inflections. 
Note  that  for  the  reconstruction  of  opaque  surfaces,  the 
epipolar  curve  on  the  surface  ends  at  an  inflection  be¬ 
cause  the  critical  set  disappears  from  view.  This  gener¬ 
ally  corresponds  to  a  cusp  of  the  profile.  In  addition,  the 
epipolar  curve  can  end  where  the  normal  to  the  profile 
is  parallel  to  the  instantaneous  axis  of  rotation  or  where 
the  critical  set  is  occluded  as  at  a  T-junction  [Giblin  and 
Weiss,  1993]. 

Given  three  or  more  edgels  tracked  with  our  technique, 
we  would  like  to  compute  the  location  of  the  surface  and 
its  curvature  by  fitting  a  circular  arc  to  the  lines  de¬ 
fined  by  the  view  directions  at  those  edgels.  In  general, 
a  space  curve  will  have  a  unique  circle  which  is  closest 
to  the  curve  at  any  given  point.  This  is  called  the  os¬ 
culating  circle,  and  the  plane  of  this  circle  is  called  the 
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(c)  (d) 

Figure  1:  Input  processing:  (a)  sample  input  image  (dodecahedral  pussle),  (b)  estimated  edgels  and  orientations 
(maxima  in  |Gri|^),  (c)  tracked  edgels,  (d)  correspondence  of  points  on  the  occluding  contours  using  the  epipolar 
constraint. 
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Figure  2:  Local  coordinate  axes  and  circle  center  point 
calculation 

osculating  plane.  It  is  easy  to  see  that  the  epipolar  plane 
is  an  estimate  of  the  osculating  plane  [Cipolla  and  Blake, 
1992],  and  the  lines  defined  by  the  view  directions  can 
be  projected  onto  this  plane. 

The  relationship  between  the  curvature  of  a  curve  such 
as  the  epipolar  curve  and  the  curvature  of  the  surface 
is  determined  by  the  angle  between  the  normal  to  the 
curve  in  the  osculating  plane  and  the  normal  to  the  sur¬ 
face.  The  curvature  of  the  curve  scaled  by  the  cosine  of 
this  angle  is  the  normal  curvature.  Meusnier’s  Theorem 
says  that  the  normal  curvature  is  the  same  for  all  curves 
on  the  surface  with  a  given  tangent  direction.  Thus,  if 
one  were  to  project  the  epipolar  curve  onto  the  plane 
spanned  by  the  view  direction  and  the  normal  to  the 
surface,  this  would  give  the  normal  curvature.  Instead, 
we  do  our  fitting  in  the  epipolar  plane;  we  can  always 
recover  the  normal  curvature,  if  desired,  by  Meusnier’s 
Theorem. 

4  Measurement  equations 

Once  we  have  selected  the  reconstruction  plane  to  be  the 
epipolar  plane  for  fitting  the  circular  arc,  we  must  com¬ 
pute  the  set  of  lines  in  this  plane  which  should  be  tangent 
to  the  circle.  This  can  be  done  either  by  projecting  the 
3D  lines  corresponding  to  the  linked  edgels  directly  onto 
the  plane,  or  by  intersecting  the  tangent  planes  (defined 
by  the  edgels  and  their  orientations)  with  the  reconstruc¬ 
tion  plane. 

We  represent  the  3D  line  corresponding  to  an  edge! 
in  frame  i  by  a  3D  point  qi  (say,  where  the  viewing  ray 
hits  a  reference  z  plane)  and  a  direction  ti  =  - 

Cj),  where  Cj  is  the  camera  center  and  normalizes  a 
vector.  We  choose  one  of  these  lines  as  the  reference 
frame  (no,  to)  centered  at  qo  (where  no  =  to  x  nepi), 
e.g.,  by  selecting  the  middle  of  n  frames  for  a  batch  fit, 
or  the  last  frame  for  a  Kalman  filter.  This  line  lies  in 
the  reconstruction  plane  defined  by  nepi. 

If  we  parameterize  the  osculating  circle  by  its  center 


c  =  (xc,yc)  and  radius  r  (Figure  2),  we  find  that  the 
tangency  condition  between  line  i  and  the  circle  can  be 
written  as 

CiXc  +  SiVc  +  r  =  di  (1) 

where  =  ti  ■  to,  Si  =  — ti  •  no,  and  di  =  (qi  -  qo)  •  ni- 
Thus,  we  have  a  linear  estimation  problem  in  the  quanti¬ 
ties  (xc,  Ve,  r)  given  the  known  measurements  (ci,  Si,  di). 
This  linearity  is  central  to  the  further  developments  in 
the  paper,  including  the  least  squares  fitting,  Kalman 
filter,  and  linear  smoother  which  we  develop  in  the  next 
three  sections. 

5  Least  squares  fitting 

While  in  theory  the  equation  of  the  osculating  circle  can 
be  recovered  given  the  projection  of  three  non-parallel 
tangent  lines  onto  the  epipolar  plane,  a  much  more  re¬ 
liable  estimate  can  be  obtained  by  using  more  views. 
Given  the  set  of  equations  (1),  how  can  we  recover 
the  best  estimate  for  (xe,ye,r)?  Regression  theory  [Al¬ 
bert,  1972;  Bierman,  1977]  tells  us  that  the  minimum 
least  squared  error  estimate  of  the  system  of  equations 
Ax  =  d  can  be  found  by  minimizing 

c  =  lAx  -  d|®  =  ^(ai  •  X  -  dj)*.  (2) 

i 

This  minimum  can  be  found  by  solving  the  set  of  normal 
equations^ 

(A^A)x  =  A’^d  (3) 

or 

(£®**^)*  ^S***^- 

i  t 

A  statistical  justification  for  using  least  squares  will  be 
presented  shortly  (Section  5.1). 

In  our  circle  fitting  case,  =  (cj,  Sj,  1),  x  =  (xc,  ye,  r), 
and  the  normal  equations  are 


Ei  C?  Ei  Ei  ‘=i 

■  »e  ■ 

Ei 

Vc 

= 

^»idi 

[EiCi  E,5i  E.iJ 

r 

L  Eidi 

(4) 

If  we  solve  the  above  set  of  equations  directly,  the  es¬ 
timates  for  Xc  and  r  will  be  very  highly  correlated  and 
both  will  be  highly  unreliable  (assuming  the  range  of 
viewpoints  is  not  very  large).  This  can  be  seen  both  by 
examining  Figure  2,  where  we  see  that  the  location  of 
c  is  highly  sensitive  to  the  exact  values  of  the  di,  or  by 
computing  the  covariance  matrix  P  =  (A^A)“^  (Sec¬ 
tion  5.1). 

We  cannot  do  much  to  improve  the  estimate  of  r  short 
of  using  more  frames  or  a  larger  camera  displacement, 
but  we  can  greatly  increase  the  reliability  of  our  shape 
estimate  by  directly  solving  for  the  surface  point  (x,,  y,), 
where  x,  =  Xc-l-r  and  y,  =  yc.^The  new  set  of  equations 
is  thus 

CiX,  +  Siy,  -I-  (1  -  Ci)r  =  di.  (5) 

^Alternative  techniques  for  solving  the  least  squares  prob¬ 
lem  include  singular  value  decomposition  [Press  et  ai,  1986] 
and  Householder  transforms  [Bierman,  1977]. 

’While  the  point  (x,,  y,)  will  not  in  generad  lie  on  the  line 
(qo.to),  the  tangent  to  the  circle  at  {x,,y,)  will  be  parallel 
to  to. 
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While  there  is  still  some  correlation  between  x,  and  r, 
the  estimate  for  x,  is  much  more  reliable  (Section  5.1). 
Once  we  have  esti.nated  (x,,i/,,r),  we  can  convert  this 
estimate  back  to  a  3D  surface  point, 

Po  =  qo  +  aijno  +  y,to,  (6) 

a  3D  center  point 

c  =  qo  +  (a:, -r)no  +  y,to  =  po-»‘no,  (7) 
or  a  surface  point  in  some  other  frame  i 

Pi  =  c  +  rAi  =  po  +  r(ni  -  no),  (8) 

where 

Aj  —  ti  X  Aepi 

is  the  osculating  circle  normal  direction  perpendicular  to 
line  li  (Figure  2). 

5.1  Statistical  interpretation 

The  least  squares  estimate  is  also  the  minimum  variance 
and  maximum  likelihood  estimate  (optimal  statistical  es¬ 
timate)  under  the  assumption  that  each  measurement 
is  contaminated  with  additive  Gaussian  noise  [Bierman, 
1977].  If  each  measurement  has  a  different  variance  trf, 
we  must  weight  each  i:erm  in  the  squared  error  measure 
(2)  by  lOj  =  or,  equivalently,  multiply  each  equation 

aii  X  =  di  by  CTfi. 

In  our  application,  the  variance  of  di,tT^,  can  be  deter¬ 
mined  by  analyzing  the  edge  detector  output  and  com¬ 
puting  the  angle  between  the  edge  orientation  and  the 
epipolar  line 

•  Aepi)’  =  -  (ihi  •  A.pi)’), 

where  a,  is  the  variance  of  qi  along  the  surface  normal 
ihi.  This  statistical  model  makes  sense  if  the  measure¬ 
ments  di  are  noisy  and  the  other  parameters  (cj,  Si)  are 
noise-free.  This  is  a  reasonable  assumption  in  our  case, 
since  the  camera  positions  are  known  but  the  edgel  loca¬ 
tions  are  noisy.  The  generalization  to  uncertain  camera 
locations  is  left  to  future  work. 

When  using  least  squares,  the  covariance  matrix  of  the 
estimate  can  be  computed  from  P  =  (A’’A)“‘.  We  can 
perform  a  simple  analysis  of  the  expected  covariances 
for  n  measurements  spaced  ff  apart.  Using  Taylor  series 
expansions  for  c^  =  cosOi  and  5^  =  sin0t,  and  assuming 
that  i  E  [— m . . .  m],  n  =  2m+l,  we  obtain  the  covariance 
matrices 

0  -%e-*  ' 

Pi  =  0  0 

[  -66-*  0  66-* 


where  Pi  is  the  3  point  covariance  for  the  center-point 
formulation,  and  Pi  is  the  3  point  covariance  for  the 
surface-point  formulation.  As  we  can  see,  variance  of 
the  surface  point  local  x  estimate  is  four  orders  of  mag¬ 
nitude  smaller  than  that  of  the  center  point.  Similar  re¬ 
sults  hold  for  the  overdetermined  case  (n  >  3).  Extend¬ 
ing  the  analysis  to  the  asymmetrical  case,  i  €  [0 . .  .2m], 
we  observe  that  the  variance  of  the  x,  and  y,  estimates 
increases. 


5.2  Robustifying  the  estimate 

To  further  improve  the  quality  and  reliability  of  our  es¬ 
timates,  we  can  apply  robust  statistics  to  reduce  the  ef¬ 
fects  of  outliers  (grossly  erroneous  measurements)  (Hu¬ 
ber,  198l].  Many  robust  techniques  are  based  on  first 
computing  residuals,  —  ai  •  x,  and  then  re¬ 

weighting  the  data  by  a  monotonic  function 

(cr')-’  =  a-’ff(|r,|) 

or  throwing  out  measurements  whose  |ri|  ^  Oi.  Alterna¬ 
tively,  least  median  squares  can  also  be  used  to  compute 
a  robust  estimate,  but  at  an  increased  complexity. 

In  our  application,  outliers  occur  mainly  from  gross 
errors  in  edge  detection  (e.g.,  when  adjacent  edges  inter¬ 
fere)  and  from  errors  in  tracking.  Currently,  we  compute 
residuals  after  eru:h  batch  fit,  and  keep  only  those  mea¬ 
surements  whose  residuals  fall  below  a  fixed  threshold. 

6  Kalman  filter 

The  Kalman  filter  is  a  powerful  technique  for  efR- 
ciently  computing  statistically  optimal  estimates  of  time- 
varying  processes  from  series  of  noisy  measurements 
[Gelb,  1974;  Bierman,  1977;  Maybeck,  1979].  In  com¬ 
puter  vision,  the  Kalman  filter  has  been  applied  to  di¬ 
verse  problems  such  as  motion  recovery  [Rives  et  al., 
1986],  multiframe  stereo  [Matthies  et  al.,  1989],  and  pose 
recovery  [Lowe,  1991].  In  this  section,  we  develop  a 
Kalman  filter  for  contour-based  shape  recovery  in  two 
parts:  first,  we  show  how  to  perform  the  batch  fitting  of 
the  previous  section  incrementally;  second,  we  show  how 
surface  point  estimates  can  be  predicted  from  one  frame 
(and  reconstruction  plane)  to  another. 

The  update  part  of  the  Kalman  filter  is  derived  di¬ 
rectly  from  the  measurement  equation  (1)  [Gelb,  1974]. 
It  provides  an  incremental  technique  for  estimating 
quantities  in  a  static  system,  e.g.,  for  refining  a  set  of 
{xcVe,^)  measurements  as  more  edgels  are  observed. 
For  our  application,  however,  we  need  to  produce  a 
series  of  surface  points  which  can  be  linked  together 
into  a  complete  surface  description.  If  we  were  using 
batch  fitting,  we  would  perform  a  new  batch  fit  centered 
around  each  new  2D  edgel.  Instead,  we  use  the  complete 
Kalman  filter,  since  it  has  a  much  lower  computational 
complexity.  The  Kalman  filter  provides  a  way  to  deal 
with  dynamic  systems  where  the  state  Xi  is  evolving  over 
time.  We  identify  each  measurement  Xi  with  the  surface 
point  {x,,y,,r)  whose  local  coordinate  frame  is  given  by 
(Aj,  tj,  Aepi)  centered  at  qi  in  frame  i.  The  equations  for 
the  prediction  part  of  the  Kalman  filter  are  derived  from 
the  mapping  equations  between  frames  (8)  [Gelb,  1974]. 

The  overall  sequence  of  processing  steps  is  therefore 
the  following.  Initially,  we  perform  a  batch  fit  to  n  >  3 
frames,  using  the  last  frame  as  the  reference  frame.  Next, 
we  convert  the  local  estimate  into  a  global  3D  position 
(6)  and  save  it  as  part  of  our  final  surface  model.  Then, 
we  project  the  3D  surface  point  and  its  radius  onto  the 
next  frame,  i.e.,  into  the  frame  defined  by  the  next  2D 
edgel  found  by  the  tracker.*  Then,  we  update  the  state 

*  For  even  higher  accuracy,  we  could  use  the  2D  projection 
of  our  3D  surface  point  as  the  input  to  our  tracker. 


943 


estimate  using  the  local  line  equation  and  the  Kalman 
filter  updating  equations.  We  repeat  the  above  process 
(except  for  the  batch  fit)  so  long  as  a  reliable  track  is 
maintained  (i.e.,  the  residuals  are  within  an  acceptable 
range).  If  the  track  disappears  or  a  robust  fit  is  not 
possible,  we  terminate  the  recursive  processing  and  wait 
until  enough  new  measurements  are  available  to  start  a 
new  batch  fit. 

7  Linear  smoothing 

The  Kalman  filter  is  most  commonly  used  in  control  sys¬ 
tems  applications,  where  the  current  estimate  is  used 
to  determine  an  optimal  control  strategy  to  achieve  a 
desired  system  behavior  [Gelb,  1974].  In  certain  appli¬ 
cations,  however,  we  may  wish  to  refine  old  estimates 
as  new  information  arrives,  or,  equivalently,  to  use  “fu¬ 
ture”  measurements  to  compute  the  best  current  esti¬ 
mate.  Our  shape  recovery  application  falls  into  this  lat¬ 
ter  category,  since  we  wish  to  obtain  th’  most  ;urate 
estimate  possible  for  the  complete  surf  and  not  just 
the  3D  curve  corresponding  to  the  cun.  ly  visible  oc¬ 
cluding  contour. 

The  generalization  of  the  Kalman  filter  to  update  pre¬ 
vious  estimates  is  called  the  linear  smoother  [Gelb,  1974]. 
The  smoothed  estimate  of  Xj  based  on  all  the  measure¬ 
ments  between  0  and  N  is  denoted  by  Three  kinds 
of  smoothing  are  possible  [Gelb,  1974].  In  fixed-interval 
smoothing,  the  initial  and  final  times  0  and  N  are  fixed, 
and  the  estimate  xq/^  is  sought,  where  i  varies  from  0  to 
N.  In  fixed-point  smoothing,  i  is  fixed  and  is  sought 
as  N  increases.  In  fixed-lag  smoothing,  x/v-£|Ar  is  sought 
as  N  increases  and  L  is  held  fixed. 

For  surface  shape  recovery,  both  fixed-interval  and 
fixed-lag  smoothing  are  of  interest.  Fixed-interval 
smoothing  is  appropriate  when  shape  recovery  is  per¬ 
formed  off-line  from  a  set  of  predetermined  measure¬ 
ments.  The  results  obtained  with  fixed-interval  smooth¬ 
ing  should  be  identical  to  those  obtained  with  a  series 
of  batch  fits,  but  at  a  much  lower  computational  cost. 
The  fixed-interval  smoother  requires  a  small  amount  of 
overhead  beyond  the  regular  Kalman  filter  in  order  to 
determine  the  optimal  combination  between  the  outputs 
of  a  forward  and  backward  Kalman  filter  [Gelb,  1974; 
Bierman,  1977]. 

For  our  contour-based  shape  recovery  algorithm,  we 
have  developed  a  new  fixed-lag  smoother,  which,  while 
sub-optimal,  fits  in  naturally  with  the  batch  and  Kalman 
filter  approaches  developed  in  the  previous  two  sections. 
Our  fixed-lag  smoother  begins  by  computing  a  centered 
batch  fit  to  n  >  3  frames.  The  surface  point  is  then  pre¬ 
dicted  from  frame  i  - 1  to  frame  i  as  with  the  Kalman  fil¬ 
ter,  and  a  new  measurement  from  frame  i-\-L,  L  —  [n/2J 
is  added  to  the  predicted  estimate.  The  addition  of  mea¬ 
surements  ahead  of  the  current  estimate  is  straightfor¬ 
ward  using  the  projection  equations  for  the  least-squared 
(batch)  fitting  algorithm. 

The  batch  fitting,  Kalman  filter,  and  linear  smoothers 
all  produce  a  series  of  surface  point  estimates,  one  for 
each  input  image.  Because  our  reconstruction  takes 
place  in  object  space,  features  such  as  surface  marking 
and  sharp  ridges  are  stationary  in  3D  (and  have  r  =  0). 


For  these  features,  we  would  prefer  to  produce  a  sin¬ 
gle  time-invariant  estimate.  While  the  detection  of  sta¬ 
tionary  features  could  be  incorporated  into  the  Kalman 
filter  or  smoother  itself,  we  currently  defer  this  deci¬ 
sion  to  a  post- processing  stage,  since  we  expect  the  esti¬ 
mates  of  position  and  radius  of  curvature  to  be  more 
reliable  after  the  whole  sequence  has  been  processed. 
The  post-processing  stage  collapses  successive  estimates 
which  are  near  enough  in  3D  (say,  less  than  the  spacing 
between  neighboring  sample  points  on  the  3D  contour). 
It  adjusts  the  neighbor  (contour)  and  temporal  (previ¬ 
ous/next)  pointers  to  maintain  a  consistent  description 
of  the  surface. 

8  Experimental  results 

To  determine  the  performance  of  our  shape  reconstruc¬ 
tion  algorithm,  we  generated  a  synthetic  motion  se¬ 
quence  of  a  truncated  ellipsoid  rotating  about  its  z  axis 
(Figure  3).  The  camera  is  oblique  (rather  than  per¬ 
pendicular)  to  the  rotation  axis,  so  that  the  motion  of 
the  pixels  is  not  one-dimensional,  and  the  reconstruc¬ 
tion  plane  is  continuously  varying  over  time.  We  chose 
to  use  a  truncated  ellipsoid  since  it  is  easy  to  analyt¬ 
ically  compute  its  projections  (which  are  ellipses,  even 
under  perspective),  and  since  its  radius  of  curvature  is 
continuously  varying  (unlike,  say,  a  sphere  or  a  cylinder). 

When  we  run  these  edge  images  through  our  least- 
squares  fitter  or  Kalman  filter/smoother,  we  obtain  a 
series  of  3D  curves.  The  curves  corresponding  to  the 
surface  markings  and  ridges  (where  the  ellipsoid  is  trun¬ 
cated)  should  be  stationary  and  have  0  radius,  while  the 
curve  corresponding  to  the  occluding  contour  should  con¬ 
tinuously  sweep  over  the  surface. 

We  can  observe  this  behavior  using  a  three- 
dimensional  graphics  program  we  have  developed  for  dis¬ 
playing  the  reconstructed  geometry.  This  program  al¬ 
lows  us  to  view  a  series  of  reconstructed  curves  either 
sequentially  (as  an  animation)  or  concurrently  (overlayed 
in  different  colors),  and  to  vary  the  3D  viewing  param¬ 
eters  either  interactively  or  as  a  function  of  the  original 
camera  position  for  each  frame.  Figure  4  shows  all  of 
the  3D  curves  overlayed  in  a  single  image.  As  we  can 
see,  the  3D  surface  is  reconstructed  quite  well.  The  left 
hand  pair  of  images  shows  an  oblique  and  top  view  of  a 
noise-free  data  set,  using  the  linear  smoother  with  n  =  7 
window  size.  The  right-hand  pair  shows  the  same  al¬ 
gorithm  with  Qi  =  0.1  pixels  noise  added  to  the  edge 
positions. 

To  obtain  a  quantitative  measure  of  the  reconstruc¬ 
tion  algorithm  performance,  we  can  compute  the  root 
median  square  error  between  the  reconstructed  3D  coor¬ 
dinates  and  the  true  3D  coordinates  (which  are  known 
to  the  synthetic  sequence  generating  program).  Table  1 
shows  the  reconstruction  error  and  percentage  of  surface 
points  reconstructed  as  a  function  of  algorithm  choice 
and  various  parameter  settings.  The  table  compares  the 
performance  of  a  regular  3-point  fit  with  a  7-point  mov¬ 
ing  window  (batch)  fit,  and  a  linear  fixed-lag  smoother 
(labeled  Kalman)  with  n  =  7.  Results  are  given  for 
the  noise-free  and  a,  =  0.1  pixels  case.  The  different 
columns  show  how  by  being  more  selective  about  which 
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Figure  4;  Oblique  and  top  view  of  reconstructed  3D  surface  (all  3D  curves  are  superimposed).  The  left  pair  shows 
only  the  reconstructed  profile  curves,  while  the  right  pair  shows  the  profiles  linked  by  the  epipolar  curves  (only  a 
portion  of  the  complete  meshed  surface  is  shown  for  clarity).  A  total  of  72  images  spaced  5”  apart  were  used. 
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Table  1:  Root  median  square  error  and  percentage  of  edges  reconstructed  for  different  algorithms,  window  sises  (n), 
input  image  noise  Ui,  and  criteria  for  valid  estimates  (n/:  minimum  number  of  frames  in  fit,  erl:  covariance  in  local 
X  estimate).  These  errors  are  for  an  ellipse  whose  major  axes  are  (0.67, 0.4, 0.8)  and  for  a  128  x  120  image. 
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Figure  5:  Sample  real  image  sequences  used  for  experiments:  (a)  dodecahedron  (b)  diet  coke  (c)  coffee  (d)  tea. 
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Figure  6:  Oblique  and  top  views  of  the  3D  reconstructed  curves  from  (a)  dodecahedron,  (b)  diet  coke,  (c)  coffee,  and 
(d)  tea. 
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3D  estimates  are  considered  valid  (either  by  requiring 
more  fr2mies  to  have  been  successfully  fit,  or  lowering 
the  threshold  on  maximum  covariance),  a  more  reliable 
estimate  can  be  obtained  at  the  expense  of  fewer  recov¬ 
ered  points. 

We  have  also  applied  our  algorithm  to  thv;  four  real  im¬ 
age  sequences  shown  in  Figure  5.  These  sequences  were 
obtained  by  placing  an  object  on  a  rotating  mechanized 
turntable  whose  edge  has  a  Gray  code  strip  used  for  read¬ 
ing  back  the  rotation  angle  [Szeliski,  199l].  The  camera 
motion  parameters  for  these  sequences  were  obtained  by 
first  calibrating  the  camera  intrinsic  parameters  and  ex¬ 
trinsic  parameters  to  the  turntable  top  center,  and  then 
using  the  computed  turntable  rotation. 

Figure  6  shows  two  views  of  each  set  of  reconstructed 
3D  curves.  We  can  see  that  the  overall  shape  of  the 
objects  has  been  reconstructed  quite  well.  We  show  only 
the  profile  curves,  since  the  epipolar  curves  would  make 
the  line  drawing  too  dense  for  viewing  at  this  resolution. 

9  Discussion  and  Conclusion 

This  paper  extends  previous  work  on  both  the  recon¬ 
struction  of  smooth  surfaces  from  profiles  and  on  the 
epipolar  analysis  on  spatiotemporal  surfaces.  The  ulti¬ 
mate  goal  of  our  work  is  the  construction  a  complete  de¬ 
tailed  geometric  and  topological  model  of  a  surface  from 
a  sequence  of  views.  Towards  this  end,  our  observations 
are  connected  by  tracking  edges  over  time  as  well  as  link¬ 
ing  neighboring  edges  into  contours.  The  information 
represented  at  each  point  includes  the  position,  surface 
normal,  and  curvatures  (currently  only  in  the  viewing  di¬ 
rection).  In  addition,  error  estimates  are  also  computed 
for  these  quantities.  Since  the  sensed  data  does  not  pro¬ 
vide  a  complete  picture  of  the  surface,  e.g.,  there  can 
be  self-occlusion  or  parts  may  be  missed  due  to  coarse 
sampling,  it  is  necessary  to  build  partial  models.  In  the 
context  of  active  sensing  and  real-time  reactive  systems, 
the  reconstruction  needs  to  be  incremental  as  well. 

Because  our  equations  for  the  reconstruction  algo¬ 
rithm  are  linear  with  respect  to  the  measurements,  it 
is  possible  to  apply  statistical  linear  smoothing  tech¬ 
niques,  as  we  have  demonstrated.  This  satisfies  the  re¬ 
quirement  for  incremental  modeling,  and  provides  the 
error  estimates  which  are  needed  for  integration  with 
other  sensory  data,  both  visual  and  tactile.  The  appli¬ 
cation  of  statistical  methods  has  the  advantage  of  provid¬ 
ing  a  sound  theoretical  basis  for  sensor  integration  and 
for  the  reconstruction  process  in  general  [Szeliski,  1989; 
Clark  and  Yuille,  1990J. 

In  future  work,  we  intend  to  develop  a  more  complete 
and  detailed  surface  model  by  combining  our  technique 
with  regularization-based  curve  and  surface  models.  W'e 
also  plan  to  investigate  the  integration  of  our  edge-based 
multiframe  reconstruction  technique  with  other  visual 
and  tactile  techniques  for  shape  recovery. 
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Abstract 

We  address  the  problem  of  constructing  an  integrated  sur¬ 
face  description  (f  an  existing  object  from  multiple  range  im¬ 
ages.  The  two  main  problems  that  need  to  be  solved  in  such  a 
task  are  integration  and  description.  We  propose  to  use  sur¬ 
face  triangulation  as  our  representation  for  object  descrip¬ 
tion  and  data  integration,  since  we  believe  that  intermediate 
level  representations,  such  as  planar  surface  patches,  are 
very  flexible  for  shape  description  and  their  construction  is 
local.  We  start  from  an  triangulated  shell  and  project  the  shell 
onto  the  range  data  points.  An  integrated  approximation  er¬ 
ror  measure  is  introduced  to  effectively  evaluate  triangle  ap- 
proxirruition  error  in  the  context  of  data  integration.  An 
iterative  subdivision  process  is  then  applied  to  improve  the 
approjdmation  error  of  the  triangulation  to  the  desired  preci¬ 
sion.  Test  results  on  real  range  data  are  shown. 

1  Introduction 

Our  goal  is  to  construa  an  integrated  surface  de¬ 
scription  of  an  existing  object  from  multiple  unregis¬ 
tered  range  images.  The  two  main  problems  that  need  to 
be  solved  in  such  a  task  are  integration  and  description. 
Most  of  the  previous  research  deals  with  the  issues  in 
description,  drat  is,  the  choice  of  representation,  how  to 
achieve  it,  etc.  Less  work  has  been  done  on  sensor  data 
integration.  It  is  our  belief  that  these  problems  should  be 
dealt  with  together,  since  how  we  perform  data  integra¬ 
tion  directly  affects  our  choice  of  representation 
schemes,  and  thus  the  capability  of  the  representation  in 
describing  objects. 

Data  integration  can  be  achieved  at  the  pixel  level 
relatively  easily,  once  we  have  all  the  range  image 
views  registered  [Chen  and  Medioni  1992].  But  the  rep¬ 
resentations  that  can  be  used  (an  image  parameterized  in 
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spherical  coordinates,  for  example)  are  very  restricted, 
so  that  only  compact  (star-shaped)  objects  can  be  repre¬ 
sented.  Integration  can  also  be  done  at  a  high  level,  after 
description  of  each  view  has  been  obtained  [Parvin  and 
Medioni  1991].  The  disadvantage  of  performing  high 
level  integration  is  that  accuracy  may  be  sacrificed  since 
we  lose  information  when  we  go  to  a  higher  level.  In  ad¬ 
dition,  building  descriptions  from  each  individual  view 
is  more  difficult  than  if  complete  data  were  availaUe, 
since  single  views  are  inherently  ambiguous  due  to  self¬ 
occlusion  and  noise. 

Therefore  we  believe  that  integration  should  be 
performed  at  a  relative  low  level  with  a  flexible  repre¬ 
sentation.  We  have  found  approximation  by  triangles 
(we  will  call  it  surface  triangulation,  or  simply  triangu¬ 
lation  hereafter)  to  be  a  good  candidate  for  such  tasks, 
because  it  is  a  relatively  low  level  representation  and  its 
construction  is  local.  A  triangulated  surface  model  can 
represent  a  variety  of  solid  objects,  and  theoretically  to 
any  kind  of  resolutioa  We  certainly  understand  the  lim¬ 
itations  of  a  triangulated  representation.  It  is  not  ideal 
for  high  level  vision  tasks,  such  as  recognition,  because, 
first,  the  representation  is  still  at  low  level,  second,  it  is 
sensitive  to  many  parameters,  arul  therefore  unstable. 
However,  we  think  that  it  is  a  good  intermediate  repre¬ 
sentation  for  integration  and  for  building  high  level  de¬ 
scription  through  surface  interpolation  from  triangula¬ 
tion  [Peters  1990]. 

In  this  paper  we  present  a  new  approach  to  object 
description  using  multiple  range  images  by  triangula¬ 
tion,  with  emphasis  on  integration.  We  start  with  a  trian¬ 
gulated  shell  and  map  it  onto  the  object  surface.  The  tri¬ 
angular  approximation  is  refined  by  a  oiangle  subdivi¬ 
sion  process.  In  the  rest  of  this  paper,  we  first  review 
some  previous  research  on  object  description  using  tri¬ 
angulation  (section  2).  Then  we  introduce  the  mapping 
by  projection  method  (section  3),  followed  by  approxi¬ 
mation  error  estimation  scheme  (section  4).  We  present 
some  test  results  on  real  data  in  section  6  and  discuss  the 
some  future  extensions  to  our  system  in  section  7. 
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2  Triangulated  Representation  and 
Triangulated  Shell 

In  previous  work,  researchers  have  addressed  the 
problem  of  achieving  triangular  approximations  of  sin¬ 
gle  view  range  images  ([De  Floriani  and  Puppo 
1988],[Schmitt  and  Chen  IWl]).  The  main  idea  is  to 
extend  some  existing  2-D  triangulation  techniques  to 
find  3-D  triangulations  that  approximate  arbitrary  3-D 
surfaces. 

Busch  [Busch  1989]  builds  a  triangulated  surface 
of  an  entire  object  from  its  voxel  representation  by  an 
adaptive,  locally  controlled  growing  process,  which  is 
mainly  heuristically  driven. 

Soucy  and  Laurendeau  [Soucy  and  Laurendeau 
1992]  use  triangulation  to  describe  objects  from  multi¬ 
ple  range  images.  In  their  approach,  the  registered  range 
image  views  are  first  triangulated  and  then  integrated  to 
form  the  so-called  canonical  views.  Then  the  canonical 
views  are  further  triangulated  and  the  results  merged  by 
deleting  overlaps  and  closing  gaps.  Their  approach  re¬ 
quires  that  the  range  image  views  be  segmented  along 
depth  discontinuities  before  being  triangulated. 

Building  a  triangulation  of  an  object  can  also  be 
considered  as  mapping  of  a  triangulated  mesh  onto  the 
surface  of  the  object,  so  that  the  triangles  locally  ap¬ 
proximate  the  surface.  Instead  of  building  pieces  of  the 
triangulation  and  then  putting  them  together,  we  can 
also  start  with  a  closed  surface  tessellated  with  trian¬ 
gles,  or  any  suitable  primitive  surface  patches  in  gener¬ 
al,  and  map  it  onto  (or  defomi  it  so  as  to  match)  the  sur¬ 
face.  This  way  we  can  avoid  much  of  the  heuristic  ap¬ 
proach  in  dealing  with  growing  the  triangulation  or 
merging  the  pieces  of  ttiangulation.  The  adaptive  shell 
proposed  by  Vasilescu  and  Terzopoulos  [Vasilescu  and 
Terzopoulos  1992]  and  the  deformable  surface  by 
Delingette  etal.  [Clelingette  et  al.  1991]  are  two  exam¬ 
ples  of  such  an  approach. 

There  are  two  issues  we  need  to  consider  in  such  a 
mapping.  The  first  is  the  capability  of  the  mapping  in 
handling  objects  with  complex  shape  (e.g.,  non-star- 
shaped  objects).  The  second  is  the  computational  com¬ 
plexity  and  convergence.  The  two  aforementioned 
methods  all  use  a  dynamic  shape  model  which  incorpo¬ 
rates  internal  surface  forces  and  external  data  forces.  In 
principle,  these  methods  are  able  to  describe  complex 
objects.  The  difficulty  lies  in  defining  a  good  external 
energy  function  that  can  be  both  easy  to  compute  and 
accurate  in  reflecting  the  fit  of  the  shape  model  to  the 
data.  There  are  also  problems  in  selecting  an  initial  sur¬ 
face  and  in  dealing  with  local  minima.  The  computa¬ 
tional  cost  is  also  very  high  for  such  dynamic  systems. 
In  addition,  their  approach  docs  not  address  the  problem 
of  integration,  which  is  one  of  our  goals.  We  also  try  to 


achieve  such  a  mapping  without  solving  large  dynamic 
systems. 

3  Mapping  the  IViangulated  Shell  by 

Projection 

Here  we  present  a  new  method  for  mapping  the  tri¬ 
angulated  shell  to  the  object  surface.  In  this  method,  a 
triangulated  shell  is  first  initialized  around  the  object 
Then  the  triangles  are  mapped  onto  the  object  surface 
by  projecting  them  in  the  radial  direction  from  the  cen¬ 
ter  of  the  projection.  The  projection  is  obtained  by  com¬ 
puting  the  intersections  of  the  lines  of  projection  and  the 
object  surface  represented  by  the  range  images.  first 
I»esent  the  details  of  the  method  and  then  discuss  the 
advantages  and  disadvantages  of  the  method. 

3.1  Input  Data 

Our  input  is  a  set  of  range  image  views  of  an  object 
acquired  with  a  liquid-crystal  range  finder  [Sato  and  In- 
okuchi  1987].  In  a  previous  paper[Chen  and  Medioni 
1992],  we  have  presented  a  method  to  register  range  im¬ 
ages  of  an  object,  which  can  find  registration  transfor¬ 
mation  between  range  images  of  the  same  object  from 
different  views.  We  then  register  the  acquired  views  and 
record  the  global  transformation  information  for  each 
view.  Although  registered,  the  individual  views  might 
not  align  exactly  everywhere,  partly  reflecting  some  de¬ 
fects  present  in  the  raw  data  and  i^y  due  to  mis-reg- 
istration  caused  by  noise  and  other  defects.  We  assume 
that  the  range  data  is  dense  enough  for  evaluating  sur¬ 
face  properties  such  as  surface  normals  using  neighbor¬ 
hood  fitting. 

3.2  Initializing  and  Projecting  the  IViangulated 

Shell 

Our  goal  is  to  mtq)  a  triangulated  shell  onto  the  sur¬ 
face  of  an  object  We  start  with  an  icosahedron  and  di¬ 
vide  each  of  its  triangular  faces  into  Nx-N  sub-triangles 
[Delingette  et  al.  1991]  (Figure  1.  (a)),  where  is  the 
order  of  initial  subdivision.  The  triangulated  shell  is 
then  re-initialized  into  an  ellipsoidal  shell  of  approxi¬ 
mately  the  size  of  the  object  (Figure  1 .  (b)),  based  on  the 
distribution  information  of  ^e  sample  points  from  the 
range  images.  This  helps  in  finding  the  intersections  be¬ 
tween  the  line  of  projection  and  the  object  surface,  as 
explained  later. 

The  triangulated  ellipsoidal  shell  is  projected  onto 
the  object  surface  by  projecting  the  vertices,  from  the 
center  of  the  projection  (see  later  in  this  section  for  the 
selection  of  the  center).  Tb  find  the  projection  for  a  ver¬ 
tex,  a  ray  (the  line  of  projection)  is  constructed  from  the 
center  of  the  projection  through  the  initial  position  of 
the  vertex  and  the  intersection  between  the  ray  and  the 
surface  is  computed  using  an  iterative  algoritlm  [Chen 
and  Medioni  1992],  This  line-surface  intersection  algo¬ 
rithm  works  directly  with  range  images  without  an  ana- 
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(a)  An  icosahedron  with  each  (b)  An  initialized  ellipsoid 
of  its faces  subdivided  into  25  ( with  sample  data  super¬ 
sub-triangles  (iV=5)  imposed) 


Figure  1.  Initializing  the  ellipsoidal  shell 


Figure  2.  Result  of  projecting  the  ellipsoidal  shell 
onto  the  surface  of  an  object.  Shown  with  the 
wire  frames  are  sample  surface  points  from  the 
range  image  views  of  the  object. 

lytical  representation  of  the  surface.  At  each  iteration, 
the  object  surface  near  the  prospective  intersection 
point  is  approximated  linearly  using  its  tangent  plane, 
which  intersects  the  line  in  consideration.  This  intersec¬ 
tion  converges  to  the  intersection  of  the  surface  with  the 
line.  The  starting  point  for  such  an  iterative  algorithm 
must  be  relatively  close  to  the  intersection  point,  which 
also  means  close  to  the  surface.  In  our  case  the  starting 
point  is  chosen  to  be  initial  position  of  the  vertex  itself. 
Thus  an  initial  triangulated  shell  approximating  the  ob¬ 
ject  surface  is  desirable.  An  example  of  the  projection  is 
shown  in  Figure  2.. 

In  summary,  what  are  essential  in  the  projection  are 
(a)  the  direction  of  the  projection  (in  this  case  the  radial 
direction),  and  (b)  an  initial  point  on  the  line  of  projec¬ 
tion.  As  we  have  mentioned  early,  the  initial  ellipsoidal 


sheU  is  ideal  for  initializing  the  vertices  close  to  the  ob¬ 
ject  surface,  while  (a)  requires  that  a  center  of  projection 
be  selected.  The  requirements  for  such  a  point  are  that  it 
lies  inside  of  the  object,  and  there  is  as  much  as  possible 
surface  areas  visible  from  this  point.  We  discuss  the  im¬ 
plications  of  these  requirements  further  in  section  7. 

3.3  Multiple  Intersections 

Since  the  input  data  is  in  the  form  of  multiple  range 
images  from  many  view  points,  there  are  overlaps  in  the 
surface  areas  that  each  range  image  covers.  This  results 
in  multiple  projection  points  on  the  range  images  when 
a  vertex  is  projected  using  the  method  described  above. 
Assuming  that  there  is  only  one  intersection  between 
the  projection  ray  and  the  object  surface,  we  need  to 
combine  the  set  of  intersections  computed  horn  range 
images  of  different  views  into  one.  Simple  averaging  of 
all  the  points  is  one  solution,  but  it  does  not  work  well 
as  it  assume  the  noise  distribution  to  be  Gaussian.  This 
is  clearly  not  the  case  when  two  surface  are  slightly  out 
of  registration.  We  have  adopted  a  weighted  average 
method  that  takes  into  account  the  reliability  of  the  in¬ 
put  data.  This  is  because  range  data  from  the  surface  ar¬ 
eas  facing  the  sensor  are  much  less  noisy  and  thus  more 
reliable  than  data  from  areas  with  large  incidence  angles 

with  respect  to  the  sensor.  Letp.  e  R^,  r=l ...  m  be  the  in¬ 
tersection  points  of  the  line  of  projection  with  range  im¬ 
age  view  V.,  and  n.  and  s.  the  surface  normal  at  p.  and 

the  sensor  direction  for  range  image  respectively. 
The  weighted  average  p  of  the  intersection  points  p.  is 
defined  by 

p  =  -i-^^ -  (1) 

I", 

I 

where  a  ■  b  stands  for  the  inner  product  of  two  vectors 
a  and  b.  In  our  experiments,  the  surface  normal  vectors 
are  obtained  from  the  range  images  by  using  a  neighbor¬ 
hood  surface  fitting  technique. 

4  Approximation  via  Triangle  Subdivision 

When  the  triangulated  shell  is  projected  onto  the 
object  surface,  each  triangle  may  or  may  not  ai^roxi- 
mate  the  covered  surface  well.  In  this  section,  we  dis¬ 
cuss  how  to  evaluate  the  approximate  error  of  the  trian¬ 
gles,  especially  in  the  context  of  data  integration,  arxl 
how  to  improve  the  approximation  error  by  subdividing 
the  triangles. 

The  first  problem  is  to  define  the  data  points  that 
need  to  be  considered  in  evaluating  the  approximation 
error  for  a  triangle  T.  Naturally  we  do  not  need  to  con¬ 
sider  those  range  images  on  which  none  of  the  triangle 
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truly  reflect  the  status  of  the  approximation  in  such  sit¬ 
uations.  Therefore  we  have  adopted  a  statistical  ap¬ 
proach  which  allows  us  to  eliminate  the  bad  data  points 
and  bad  range  image  views  for  each  approximating  tri¬ 
angle. 

Let  F.  be  the  projected  2-D  triangle  of  T  for  range 
image  view  V.,  i=l ...  m,  where  m  is  the  total  number  of 

range  images  that  have  at  least  one  vertex  of  7  projected 
on  them.  Let  =  {^,^1  be  the  set  of  3D  points  in  V.  cov¬ 
eted  by  and  e..  be  the  signed  Euclidean  distance  of 
q^.  to  the  plane  containing  T.  We  can  compute  the  mean 
and  standard  deviation  of  e  . : 


‘y.jg  Qt _ 


(2) 


Figure  3.  A  3-D  triangle  projected  into  range  image 
parameter  space. 

vertices  have  projections.  For  the  range  images  on 
which  at  least  one  of  the  vertices  has  projections,  we 
project  (not  to  be  confused  with  the  projection  men¬ 
tioned  early)  the  triangle  T  into  the  range  image  param¬ 
eter  space  to  obtain  a  2-D  triangle  T,  as  though  the  tri¬ 
angle  had  been  in  the  place  when  range  images  were 
taken  (see  Figure  3. ). 

In  the  simplest  case  of  a  Cartesian  range  image, 
where  the  depth  value  z  is  indexed  by  the  x  and  y  coor¬ 
dinates  as  z=f(,x,y),  T  is  simply  the  oithographic  projec¬ 
tion  of  T.  In  general,  however,  the  projection  T  of  T  in¬ 
volves  a  perspective  transformation  from  3-D  world  co¬ 
ordinates  into  sensor  coordinate.  Such  a  transformation 
is  usually  available  from  system  calibration  of  the  range 
flndeilSato  and  Inokuchi  1987].  Once  the  projected  2-D 
triangle  for  each  view  is  obtained,  all  surface  points  in 
that  view  that  fall  insid  •  of  the  2-D  triangle  will  be  con¬ 
sidered  in  evaluating  tiie  approximation  error.  We  call 
those  3-D  points  the  covered  points  of  triangle  T. 

The  second  problem  in  the  approximation  error 
evaluation  is  to  define  a  measure  for  the  approximation 
error.  In  triangulating  single  view  range  images 
[Schmitt  and  Chen  1991](or  the  reparameterized,  com¬ 
bined  range  images  as  in  [Soucy  and  Laurendeau 
1992]),  we  could  just  use  the  maximum  Euclidean  dis¬ 
tance  of  the  covered  points  to  the  plane  containing  the 
triangle  in  consideration.  But  this  will  not  work  when 
there  are  multiple  views.  The  reason  is  that  although  the 
range  image  views  are  well  registered  in  general  [Chen 
and  Medioni  1992],  some  views  might  deviate  from  the 
rest.  There  can  also  be  regions  that  contains  contaminat¬ 
ed  data.  A  simple  maximum  distance  measure  does  not 


where  /t.=||QJ|  is  the  cardinality  of  the  set  Q.  Then  we 

adopt  a  majority  vote  scheme  to  effectively  eliminate 
the  bad  views  from  being  considered  in  the  final  approx¬ 
imation  error  estimation  for  that  triangle.  That  is,  we 
compute  a  subset  G  £  [Q.,  i=\...  m) ,  such  that  the  abso¬ 
lute  value  of  the  difference  of  the  mean  aj^roximation 
error  e.  between  any  pair  in  G  is  below  certain  thresh¬ 
old; 

G  =  {G,  l  Vej^6G,|ei-CjJ<e}  (3) 

and  that  the  cardinality  of  G  is  at  least  half  of  the  origi¬ 
nal  set; 

IIGII^0.5||{G.}||  (4) 

The  value  of  e  can  be  set  based  on  the  range  image 
resolution  and  the  statistics  of  the  registration  errors 
from  the  registration  process  [Chen  and  Medioni  1992]. 
Once  G  is  selected,  we  recompute  the  mean  and  stan¬ 
dard  deviation  of  the  approximation  error  Jor  all  the 
points  in  G  for  triangle  T.  Let  the  results  be  e  and  a  re¬ 
spectively.  In  the  case  when  G  is  an  empty  set,  or  G  is  a 
minority  group,  then  we  simply  select  the  Q.  that  has  the 

least  absolute  e: 

I 

G'  =  {G,  I  |e,|  =  ming^g  j  (|ct|)  }  (5) 

The  overall  approximation  measure  is  defined  as 
E  =  e  +  sign(e)aa  (6) 

where  signO  returns  1  or  - 1  depending  on  whether  its  ar¬ 
gument  is  larger  or  smaller  than  0.  a  is  a  constant  that 
reflects  the  noise  present  in  the  data.  The  approximation 
measure  E  in  equation  above  is  used  to  simulate  the 
maximum  approximation  error  for  a  triangle.  have 
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Figure  4.  Triangles  are  subdivided  in  groups  with 
the  same  type  of  approximation  errors.  The 
dashed  lines  define  the  new  triangles. 


found  ot=2  to  be  a  good  choice,  which  can  eliminate 
most  of  the  outliers  in  the  data. 

5  Triangle  Subdivision  and  Reprojection 

Once  the  approximation  error  for  a  projected  trian¬ 
gle  is  evaluated,  those  triangles  whose  approximation 
error  E  does  not  meet  the  predefined  approximation 
threshold  must  be  subdivided  and  remapped.  Our  subdi¬ 
vision  algorithm  tries  to  subdivide  trian^es  without  al¬ 
tering  the  neighborhood  adjacency  property  of  the  trian¬ 
gulation,  i.e.,  each  triangle  must  have  3  neighbors  that 
share  an  edge  with  it.  This  regularity  makes  it  easy  for 
higher  order  surface  reconstruction  in  the  future.  For 
this  reason,  we  group  triangles  with  the  same  type  of  ap¬ 
proximation  error  (i.e.,  triangles  with  e  of  the  same  sign 
and  their  |e|  exceeding  a  certain  given  threshold)  into 
connected  components.  Here,  two  triangles  are  connea- 
ed  when  they  share  an  edge.  The  subdivision  method  is 
illustrated  in  Figure  4.,  categorized  by  the  size  of  the 
connected  components. 

The  rule  here  is  that,  except  for  the  components  that 
contain  only  one  triangle,  triangles  with  one  neighbor¬ 
ing  triangle  in  the  connected  component  will  be  divided 
into  two,  those  with  two  neighbors  three,  and  those  with 
three  neighbors  four.  To  improve  the  fit  of  the  triangles 
to  the  object  surface,  the  newly  created  triangles  must 
be  remapped  onto  the  object  surface.  This  is  done  by 
projecting  the  triangle  vertices  in  the  direction  of  the 


(a)  A  free-style  wood  blob  (the  Vood”) 


(b)  A  plaster  model  tooth  (the  *tooth”) 


Figure  5.  Intensity  images  of  die  test  objects. 

surface  normals  of  the  object  at  that  location,  which  can 
be  approximated  by  the  average  nonnal  vector  of  the 
surrounding  triangles  of  the  vertex,  wei^ted  by  the  ar¬ 
eas  of  those  triangles.  The  difference  between  this  pro¬ 
jection  and  the  initial  projection  is  the  direction  in 
which  the  projection  is  done. 

This  division-and-projection  process  continues  un¬ 
til  all  the  triangles  have  an  approximation  error  within 
the  desired  threshold. 

6  Experimental  Results 

Now  we  show  some  test  results  of  the  triangle  sub¬ 
division  and  approximation.  Ftgure  S.  shows  two  ob¬ 
jects  ("wood”  and  "tooth”)  in  intensity  images,  whidi 
are  approximately  8  and  10cm  high  respectively. 

Range  images  are  acquired  for  each  object  from  16 
different  viewpoints  and  then  registered.  Ftgure  6. 
shows  four  views  of  the  range  images  of  "free”  used  in 
the  test  Figure  7.  shows  wire  frames  of  the  triangle  sub¬ 
division  process.  Figure  7.  (a)  shows  the  triangulated 
shell  projected  onto  the  object  surface.  Figure  7.  (b) 
shows  the  shell  after  one  subdivision  iteration  with  ap- 
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(a)  Range  image  view  1  Gi)  Range  image  view  2 


(c)  Range  image  view  3 


(d)  Range  image  view  4 


Figure  6.  Sample  range  image  views  of  the 
object  "wood"  used  for  model  construction, 
shown  here  as  shaded  images. 

proximation  error  threshold  set  to  2mm  (see  Section  S 
and  equation  (6)).  Figure  7.  (c)  is  the  final  result  after 
three  iterations  with  all  triangles  having  approximation 
errors  under  2mm.  As  can  be  seen,  most  of  the  subdivi¬ 
sions  take  place  around  the  ridge  line  in  the  lower  part 
of  the  object.  Figure  8.  and  Figure  9.  show  some  ren¬ 
dered  intensity  images  of  the  "wood”  and  the  "tooth” 
using  the  derived  triangulation  models  from  various 
view  points.  In  both  cases,  the  initial  icosahedrons  are 
subdivided  with  N=5  and  the  center  of  the  fitting  ellip¬ 
soidal  shells  are  used  as  the  center  of  the  projection.  Our 
test  programs  are  written  in  Lisp.  On  a  Sun  SparcStation 
2,  with  Sun  Common  Lisp,  the  test  on  the  "wood”  com¬ 
pleted  in  3  minutes  and  18  seconds,  resulting  a  total  of 
649  triangles,  while  the  “tooth”  took  3  minutes  and  34 
seconds  yielding  872  triangles.  Note  that  the  original 


(a)  Before  subdivision 


(b)  After  1  iteration 


(c)  After  3  iterations 


Figure  7.  Iterative  triangle  subdivision 


range  images  have  a  resolution  of  about  1  to  2mm  with 
a  comparable  registration  error. 

7  Working  with  Complex  Objects 

As  mentioned  in  section  5,  in  projecting  the  initial 
triangulated  shell,  we  need  to  select  a  projection  center, 
which  must  be  inside  of  the  object.  TTiis  is  because,  if 
the  center  is  outside  of  the  object,  the  projected  triangles 
will  not  always  be  able  to  maintain  their  proper  relation¬ 
ship  among  iem  and  surface  self  intersection  may  oc¬ 
curs,  as  shown  in  Figure  10.  (a).  Another  important  is¬ 
sue  is  that  when  the  object  is  non-star  shaped,  we  will 
have  multiple  intersections  when  performing  the  pro¬ 
jection,  regardless  of  where  we  put  the  projection  cen¬ 
ter,  as  shown  in  Figure  10.  (b).  There  are  two  possible 
ways  to  deal  with  such  situations.  The  first  is  to  choose 
the  inner  most  (closet  to  the  center  of  projectiwi)  inter- 
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Figure  8.  Two  rendered  views  of  the  constructed 
triangulation  of  the  object  “wood' 

sections  (Figure  10.  (b))  and  the  second  is  to  choose  the 
outer  most  intersections  (hgureFiguie  10.  (c)).  In  both 
cases,  the  problems  which  remain  are  how  to  identify 
triangles  that  cut  through  the  object  and  how  to  ptqxily 
group  them  together  and  find  a  local  projection  center 
for  subdivided  triangles  derived  from  them. 

8  Conclusion 

We  have  presented  a  new  method  for  constructing 
an  integrated  surface  description  from  multiple  range 
images  based  on  surface  triangulation.  Our  system  com¬ 
bines  the  surface  description  and  data  integration 
through  an  effective  evaluation  for  triangle  approxima¬ 
tion  error  using  an  integrated  error  measure.  Projecting 
a  triangulated  shell  onto  the  surface  of  the  object  has 
proven  to  be  very  efficient  and  effective  in  acquiring  tri¬ 
angulated  surface  models  compared  to  systems  that  em¬ 
ploy  global  optimization  techniques.  Preliminary  tests 
show  good  results  on  free-formed  compact  objects.  Our 
future  research  is  concerned  with  extending  the  current 
system  to  handle  more  complex  objects. 
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Figure  9.  Three  rendered  views  of  die  constructed 
triangulation  of  the  object  “tooth" 
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Abstract 

This  paper  is  concerned  with  active  sensing  of  range 
information  from  focus.  It  describes  a  new  type  of 
camera  whose  image  plane  is  not  perpendicular  to 
the  optical  axis  as  is  standard.  This  special  imaging 
geometry  eliminates  the  usual  focusing  need  of  inv 
age  plane  movement.  Camera  movement,  which  is 
anyway  necessary  to  process  large  visual  fields,  in¬ 
tegrates  panning,  focusing,  and  range  estimation. 
Thus  the  two  standard  mechanical  actions  of  fo¬ 
cusing  and  panning  are  replaced  by  panning  alone. 
Range  estimation  is  done  at  the  speed  of  panning. 
An  implementation  of  the  proposed  camera  design 
is  described  and  experiments  with  range  estimation 
are  reported. 

1  INTRODUCTION 

This  paper  is  concerned  with  active  sensing  of  range 
information  from  focus.  It  describes  a  new  type  of 
camera  which  integrates  the  processes  of  image  ac¬ 
quisition  and  range  estimation.  The  camera  can  be 
viewed  as  a  computational  sensor  which  can  per¬ 
form  high  speed  range  estimation  over  large  scenes. 
Typically,  the  field  of  view  of  a  camera  is  much 
smaller  than  the  entire  visual  field  of  interest.  Con¬ 
sequently,  the  camera  must  pan  to  sequentially  ac¬ 
quire  images  of  the  visual  field,  a  part  at  a  time,  and 
for  each  part  compute  range  estimates  by  acquiring 
and  searching  images  over  many  image  plane  loca¬ 
tions.  Using  the  proposed  approach,  range  can  be 
computed  at  the  speed  of  panning  the  camera. 

At  the  heart  of  the  proposed  design  is  active  control 
of  imaging  geometry  to  eliminate  the  standard  me¬ 
chanical  adjustment  of  image  plane  location,  and 
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further,  integration  of  the  only  remaining  mechan¬ 
ical  action  of  camera  panning  with  focusing  and 
range  estimation.  Thus,  imaging  geometry  and  op¬ 
tics  are  exploited  to  replace  explicit  sequential  com¬ 
putation.  Since  the  camera  implements  a  range 
from  focus  approach,  the  resulting  estimates  have 
the  following  characteristics  as  is  true  for  any  such 
approach  [3,  5].  The  scene  surfaces  of  interest  must 
have  texture  so  image  sharpness  can  be  measured. 
The  confidence  of  the  estimates  improves  with  the 
amount  of  surface  texture  present.  Further,  the 
reliability  of  estimates  is  inherently  a  function  of 
the  range  to  be  estimated.  However,  range  esti¬ 
mation  using  the  proposed  approach  is  much  faster 
than  any  traditional  range  from  focus  approach, 
thus  eliminating  one  of  the  major  drawbacks. 

The  next  section  describes  in  detail  the  pertinence 
of  range  estimation  from  focus,  and  some  prob¬ 
lems  that  characterize  previous  range  from  focus 
approaches  and  serve  as  the  motivation  for  the  work 
reported  in  this  paper.  Then,  Section  3  presents 
the  new,  proposed  imaging  geometry  whose  center- 
piece  is  a  tilting  of  the  image  plane  from  the  stan¬ 
dard  frontoparallel  orientation.  It  shows  how  the 
design  achieves  the  results  of  search  over  focus  with 
high  computational  efficiency.  Section  4  presents  a 
range  from  focus  algorithm  that  uses  the  proposed 
camera.  Section  5  describes  the  results  of  experi¬ 
ments  demonstrating  the  feasibility  of  our  method. 
Section  6  presents  concluding  remarks. 

2  BACKGROUND  & 
MOTIVATION 

2.1  Range  Estimation  From  Focus 
and  Its  Utility 

Focus  based  methods  obtain  a  depth  estimate  of  a 
scene  point  by  varying  the  focal-length  (/)  and/or 
the  focus  distance  (r).  Without  loss  of  general¬ 
ity,  we  will  always  assume  that  the  parameter  be¬ 
ing  controlled  is  r.  This  means  that  the  v  value 
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is  changed  by  mechanically  relocating  the  image 
plane.  When  the  scene  point  appears  in  sharp 
focus,  the  corresponding  u  and  v  values  satisfy 
the  standard  lens  law:  ^  ^  =  j-  The  depth 

value  u  for  the  scene  point  can  then  be  calcu¬ 
lated  by  knowing  the  values  of  the  focal  length 
and  the  focus  distance  [14,  2,  6].  To  determine 
when  a  scene  is  imaged  in  sharp  focus,  several  aut¬ 
ofocus  methods  have  been  proposed  in  the  past 
[7,  16,  17,  8,  12,  15,  10,  9,  2,  13]. 

Like  any  other  visual  cue,  range  estimation  from 
focus  is  reliable  under  some  conditions  and  not  so 
in  some  other  conditions.  Therefore,  to  use  the  cue 
appropriately,  its  shortcomings  and  strengths  must 
be  recognized  and  the  estimation  process  should  be 
suitably  integrated  with  other  processes  using  dif¬ 
ferent  cues,  so  as  to  achieve  superior  estimates  un¬ 
der  broader  conditions  of  interest  [1,  11,  4].  When 
accurate  depth  information  is  not  needed,  e.g.,  for 
obstacle  avoidance  during  navigation,  range  esti¬ 
mates  from  focus  or  some  other  cue  alone  may  suf¬ 
fice,  even  though  it  may  be  less  accurate  than  that 
obtained  by  an  integrated  analysis  of  multiple  cues. 

2.2  Motivation  for  Proposed  Approach 

The  usual  range  from  focus  algorithms  involve  two 
mechanical  actions,  those  of  panning  and  for  each 
chosen  pan  angle  finding  the  best  v  value.  These 
steps  make  the  algorithms  slow.  The  purpose  of  the 
first  step  is  to  acquire  data  over  the  entire  visual 
field  since  cameras  typically  have  narrower  field  of 
view.  This  step  is  therefore  essential  to  construct 
a  range  map  of  the  entire  scene.  The  proposed  ap¬ 
proach  is  motivated  primarily  by  the  desire  to  elim¬ 
inate  the  second  step  involving  mechanical  control. 
Consider  the  set  of  scene  points  that  will  be  im¬ 
aged  with  sharp  focus  for  some  constant  value  of 
focal  length  and  focus  distance.  Let  us  call  this 
set  of  points  the  SF  surfaced  For  the  conventional 
case  where  the  image  is  formed  on  a  plane  perpen¬ 
dicular  to  the  optical  axis,  and  eissuming  that  the 
lens  has  no  optical  aberrations,  this  SF  surface  will 
be  a  surface  that  is  approximately  planar  and  nor¬ 
mal  to  the  optical  axis.  The  size  of  SF  surface  will 
be  a  scaled  version  of  the  size  of  the  image  plane, 
while  its  shape  will  be  the  same  as  that  of  the  im¬ 
age  plane.  Figure  1(a)  shows  the  SF  surface  for  a 
rectangular  image  plane. 

As  the  image  plane  distance  from  the  lens,  r,  is 
changed,  the  SF  surface  moves  away,  or  toward  the 
camera.  As  the  entire  range  of  v  values  is  traver.sed. 
the  SF  surface  sweeps  out  a  cone  shaped  volume 
in  three-dimensional  space,  henceforth  called  the 

'Actually,  the  depth-of-field  effect  will  cause  the  SF  sut- 
f2w:e  to  be  a  3-D  volume.  We  ignore  this  for  the  moment,  as 
the  arguments  being  made  hold  irrespective  of  whether  we 
have  a  SF  surface,  or  a  SF  volume. 


SF  cone.  The  vertex  angle  of  the  cone  represents 
the  magnification  or  scaling  achieved  and  is  propor¬ 
tional  to  the  /  value.  Figure  1(b)  shows  a  frustum 
of  the  cone. 

Only  those  points  of  the  scene  within  the  SF  cone 
are  ever  imaged  sharply.  To  increase  the  size  of 
the  imaged  scene,  the  /  value  used  must  be  in¬ 
creased.  Since  in  practice  there  is  a  limit  on  the 
usable  range  of  /  value,  it  is  not  possible  to  image 
the  entire  scene  in  one  viewing.  The  camera  must 
be  panned  to  repeatedly  image  different  parts  of 
the  scene.  If  the  solid  angle  of  the  cone  is  ai,  then 
to  image  an  entire  hemisphere  one  must  clearly  use 
at  least  ^  viewing  directions.  This  is  a  crude  lower 
bound  since  it  does  not  take  into  account  the  con¬ 
straints  imposed  by  the  packing  and  tessellability  of 
the  hemisphere  surface  by  the  shape  of  the  camera 
visual  field. 

If  specialized  hardware  which  can  quickly  identify 
focused  regions  in  the  image  is  used,  then  the  time 
required  to  obtain  the  depth  estimates  is  bounded 
by  that  required  to  make  all  pan  angle  changes  and 
to  process  the  data  acquired  for  each  pan  angle. 

The  goal  of  the  approach  proposed  in  this  paper 
is  to  select  the  appropriate  v  value  for  each  scene 
point  without  conducting  a  dedicated  mechanical 
search  over  all  v  values.  The  next  section  describes 
how  this  is  accomplished  by  slightly  changing  the 
camera  geometry  and  exploiting  this  in  conjunc¬ 
tion  with  the  pan  motion  to  accomplish  the  same 
result  as  traditionally  provided  by  the  two  mechan¬ 
ical  motions. 

3  A  NON-FRONTAL  IMAGING 
CAMERA 

The  following  observations  underlie  the  proposed 
approach.  In  a  normal  camera,  all  points  on  the 
image  plane  lie  at  a  fixed  distance  (v)  from  the 
lens  So  all  scene  points  are  always  imaged  with  a 
fixed  value  of  r,  regardless  of  where  on  the  image 
plane  they  are  imaged,  i.e.,  regardless  of  the  cam¬ 
era  pan  angle.  If  we  instead  have  an  image  surface 
such  that  the  different  image  surface  points  are  at 
different  distances  from  the  lens,  then  depending 
upon  where  on  the  imaging  surface  the  image  of  a 
scene  point  is  formed  (i.e.,  depending  on  the  pan 
angle),  the  imaging  parameter  t>  will  assume  dif¬ 
ferent  values  This  means  that  by  controlling  only 
the  pan  angle,  we  could  achieve  both  goals  of  the 
traditional  mechanical  movements,  namely,  that  of 
changing  v  values  as  well  as  that  of  scanning  the 
visual  field,  in  an  integrated  way. 

In  the  rest  of  this  paper,  we  will  consider  the  sim¬ 
plest  ca.se  of  a  nonstandard  image  surface,  namely 
a  plane  which  has  been  tilted  relative  to  the  stan¬ 
dard  frontoparallel  orientation.  Consider  the  tilted 
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(a)  SF  surface 


(b)  SF  cone 


Figure  1:  (a)  Sharp  Focus  object,  surface  for  the  stan¬ 
dard  planar  imaging  surface  ortliogonal  to  tlie  optical 
axis.  Object  points  that  lie  on  the  SF  surface  are  im¬ 
aged  with  the  least  blur.  The  location  of  the  SF  surface 
is  a  function  of  the  camera  parameters,  (b)  A  frustum 
of  the  cone  swept  by  the  SF  surface  a.s  the  value  of  e  is 
changed.  Only  those  points  that  lie  inside  the  SF  cone 
can  be  imaged  sharply,  and  therefore,  range-from-fociis 
algorithms  can  only  calculate  the  range  of  thesi’  points. 


image  plane  geometry  shown  in  Figure  2.  For  dif¬ 
ferent  angles  0,  the  distance  from  the  lens  center 
to  the  image  plane  is  different.  Consider  a  point 
object  at  an  angle  0.  The  following  relation  follows 
from  the  geometry: 


I  OC  1= 


dcoso 
cos(0  —  a) 


Since  for  a  tilted  image  plane,  v  varies  linearly  with 
position,  it  follows  from  the  lens  law  that  the  corre¬ 
sponding  SF  surface  is  a  plane  whose  u  value  mir¬ 
rors  the  V  variation.  The  SF  surface  is  shown  in 
Figure  3(a).  The  volume  swept  by  the  SF  surface 
as  the  camera  is  rotated  is  shown  in  Figure  3(b). 

If  the  camera  turns  about  the  lens  center  O  by  an 
angle  <t>,  then  the  object  will  now  appear  at  an  angle 
0+4>.  The  new  image  distance  (for  the  point  object) 
will  now  be  given  by  the  following  equation. 


I  cTc  \= 


dcosa 

cos(0  +  0  —  o) 


As  the  angle  0  changes,  the  image  distance  also 
changes.  At  some  particular  angle,  the  image  will 
appear  perfectly  focused  and  as  the  angle  keeps 
changing,  the  image  will  again  go  out  of  focus.  By 
identifying  the  angle  0  at  which  any  surface  ap¬ 
pears  in  sharp  focus,  we  can  calculate  the  image 
distance,  and  then  from  the  lens  law,  the  object 
surface  distance. 

As  the  camera  rotates  about  the  lens  center,  new 
parts  of  the  scene  enter  the  image  at  the  left  edge* 
and  some  previotisly  imaged  parts  are  discarded  at 
the  right  edge.  The  entire  scene  can  be  imaged  and 
ranged  by  completely  rotating  the  camera  once. 


4  RANGE  ESTIMATION 
ALGORITHM 

Let  the  image  plane  have  N  x  N  pixels  and  let  the 
range  map  be  a  large  array  of  size  N  x  sN,  where 
s  >=  1  is  a  number  that  depends  on  how  wide  a 
scene  is  to  be  imaged.  The  image  frame  is  rep¬ 
resented  by  It  and  the  cumulative,  environment 
centered  range  map  with  origin  at  the  camera  cen¬ 
ter  is  represented  by  R.  Every  element  in  the  range 
array  is  a  structure  that  contains  the  focus  criterion 
values  for  different  image  indices,  i.e.,  for  different 
pan  angles.  When  the  stored  criterion  value  shows 
a  maximum,  then  the  index  corresponding  to  the 
maximum*  can  be  u.sed  to  determine  the  range  for 
that  .scene  point. 

■^Or  I  lie  right  edge,  depending  upon  the  direction  of  the 
rot  ation 

^Knowing  the  index  value,  we  can  find  out  the  amount 
of  camera  rotation  that  was  needed  before  the  scene  point 
was  sharply  focused.  Using  the  row  and  column  indices  for 
the  range  point,  and  the  image  index,  we  can  then  find  out 
the  exact  distance  from  the  lens  to  the  image  plane  (r).  We 
liiii  then  n.se  the  lens  law  to  calculate  the  range. 
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Figure  2:  Tilted  Image  Surface.  A  point  object,  ini¬ 
tially  at  an  angle  of  6,  is  imaged  at  point  C.  The  focus 
distance  OC  varies  as  a  function  of  6.  When  the  cam¬ 
era  undergoes  a  pan  motion,  0  changes  and  so  does  the 
focus  distance. 


Let  the  camera  start  from  one  side  of  the  scene  and 
pan  to  the  other  side.  Figures  4(a)  and  4(b)  illus¬ 
trate  the  geometrical  relationships  between  succes¬ 
sive  pan  angles,  pixels  of  the  images  obtained,  and 
the  range  array  elements. 

4.1  Algorithm 

Let  j  =  0.  ^  =  0.  Initialize  all  the  arrays  and  then 
execute  the  following  steps. 

•  Capture  the  j**  image  Ij . 

•  Pass  the  image  through  a  focus  criterion  filter  to 
yield  an  array  Cj  of  criterion  values. 

•  For  the  angle  0  (which  is  the  angle  that  the  cam¬ 
era  hets  turned  from  its  starting  position),  cal¬ 
culate  the  offset  into  the  range  map  required 
to  align  image  Ij  with  the  previous  images. 
For  example.  Pixel  /j[50][75]  might  correspond 
to  the  same  object  as  pixels  /j+i[50][125]  and 
/,+2[50][175]. 

•  Check  to  see  if  the  criterion  function  for  any 
scene  point  has  crossed  the  maximum.  If  so, 
compute  the  range  for  that  scene  point  using  the 
pan  angle  (and  hence  v  value)  for  the  image  with 
maximum  criterion  value. 

•  Rotate  the  camera  by  a  small  amount.  Update 
^  and  j. 

•  Repeat  the  above  steps  until  the  entire  scene  is 
imaged. 


(a)  SF  surface 


(b)  SF  cone 


Figure  3:  (a)  The  SF  surface  for  the  proposed  camera 
with  a  tilted  image  plane.  The  SF  surface  is  not  parallel 
to  the  lens  and  the  optical  axis  is  not  perpendicular 
to  the  SF  surface,  (b)  The  3D  volume  swept  by  the 
proposed  SF  surface  as  the  non-frontal  imaging  crunera 
is  rotated.  For  the  same  rotation,  a  frontal  imaging 
camera  would  sweep  out  an  SF  cone  having  a  smaller 
depth. 
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(b) 


5  EXPERIMENTAL  RESULTS 

In  these  experiments,  we  attempt  to  determine  the 
range  of  scene  points.  The  scene  in  experiment 
1  consists  of,  from  left  to  right,  a  planar  surface 
(range  =  73  in),  part  of  the  background  curtain 
(range  =  132  in),  a  planar  surface  (range  =  54in) 
and  a  planar  surface  (range  =  38  in).  The  scene  in 
experiment  2  consists  of,  from  left  to  right,  a  planar 
surface  (range  =  70  in),  a  planar  surface  (range  = 
50  in),  and  a  face  of  a  box  (at  a  depth  of  35  in). 
The  camera  is  turned  in  small  steps  of  50  units  (of 
the  stepper  motor),  that  corresponds  to  a  shift  of 
15  pixels  (in  pixel  columns)  between  images.  A 
scene  point  will  thus  be  present  in  a  maximum  of 
thirty  four^  images.  In  each  image,  for  the  same 
scene  point,  the  effective  distance  from  the  image 
plane  to  lens  is  different.  There  is  a  1-to-l  relation¬ 
ship  between  the  image  column  number  and  the 
distance  from  lens  to  image,  and  therefore,  by  the 
lens  law,  a  1-to-l  relationship  between  the  image 
column  number  and  the  range  of  the  scene  point. 
The  column  number  at  which  a  scene  point  is  im¬ 
aged  with  greatest  sharpness,  is  therefore  also  a 
measure  of  the  range. 


Results  Among  the  focus  criterion  functions  that 
were  tried,  the  Tennegrad  function  [17]  seemed  to 
have  the  best  performance/speed  characteristics. 
In  addition  to  problems  like  depth  of  field,  lack 
of  detail,  selection  of  window  size  etc.,  that  are 
present  in  most  range-from-focus  algorithms,  the 
range  map  has  two  problertis  as  described  below. 

•  Consider  a  scene  point.  A,  that  is  imaged  on 
pixels,  /,[230][470],  /2[230][455],  /3[230][440]  ... 
Consider  also  a  neighboring  scene  point  B,  that 
is  imaged  on  pixels  /i[230][471],  /2[230][456], 
/3[230][441]  ...  The  focus  criterion  values  for 
point  A  will  peak  at  a  column  number  that  is 
470  —  n  X  15  (where  0  <  n).  If  point  B  is  also 
at  the  same  range  as  A,  then  the  focus  criterion 
values  for  point  B  will  peak  at  a  column  number 
that  is  471  —  n  x  15,  for  the  same  n  as  that  for 
point  A.  The  peak  column  number  for  point  A 
will  therefore  be  1  less  than  that  of  point  B.  If  we 
have  a  patch  of  points  that  are  all  at  the  same 
distance  from  the  camera,  then  the  peak  column 
numbers  obtained  will  be  numbers  that  change 
by  I  for  neighboring  points®.  The  resulting  range 
map  therefore  shows  a  local  ramping  behavior. 

•  As  we  mentioned  before,  a  scene  point  is  imaged 
about  34  times,  at  different  levels  of  sharpness 
(or  blur).  It  is  very  likely  that  the  least  blurred 
image  would  have  been  obtained  for  some  camera 


Figure  4:  (a)  Panning  camera,  environment  fixed  range 
array,  and  the  images  obtained  at  successive  pan  an¬ 
gles.  Each  range  array  element  is  a.s.sociated  with  mul¬ 
tiple  criterion  function  values  which  are  computed  from 
different  overlapping  views.  The  maximum  of  the  val¬ 
ues  in  any  radial  direction  is  the  one  finally  selected 
for  the  corresponding  range  array  element,  to  compute 
the  depth  value  in  that  direction,  (b)  Steps  involved  in 
obtaining  range  from  focus. 


"Roughly  ^ 

^Neighbours  along  vertical  columns  will  not  have  this 
problem 
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parameter  that  corresponds  to  a  value  between 
two  input  frames. 

To  reduce  these  problems,  we  fit  a  gaussian  to  the 
three  focus  criterion  values  around  the  peak  to  de¬ 
termine  the  location  of  the  real  maximum.  For 
brevity,  we  have  not  included  some  sample  iniages 
from  the  experiments.  Figure  5  shows  two  views  of 
the  range  disparity  values  for  experiments  1  and  2. 
Parts  of  the  scene  where  we  cannot  determine  (he 
range  disparity  values  are  shown  blank. 

6  SUMMARY  AND 
CONCLUSIONS 

In  this  paper  we  have  shown  that  using  a  camera 
whose  image  plane  is  not  perpendicular  to  the  opti¬ 
cal  axis,  allows  us  to  determine  estimates  of  range 
values  of  object  points.  We  showed  that  the  SF 
surface,  which  appears  in  sharp  focus  when  imaged 
by  our  non-frontal  imaging  camera,  is  an  inclined 
plane.  When  the  camera’s  pan  angle  direction 
changes,  by  turning  about  the  lens  center,  an  SF 
volume  is  swept  out  by  the  SF  surface.  The  points 
within  this  volume  comprise  those  for  which  range 
can  be  estimated  correctly.  We  have  described  an 
algorithm  that  determines  the  range  of  scene  points 
that  lie  within  the  SF  volume.  We  point  out  some 
of  the  shortcomings  that  are  unique  to  our  method. 
We  have  also  described  the  results  of  some  experi¬ 
ments  that  were  conducted  to  prove  the  feasibility 
of  our  method. 
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Abstract 

This  paper  studies  the  problem  of  obtain¬ 
ing  depth  information  from  focusing  and 
defocusing,  which  have  long  been  noticed 
as  important  sources  of  depth  information 
for  human  and  machine  vision.  In  depth 
from  focusing,  we  try  to  eliminate  the  local 
maxima  problem  which  is  the  main  source 
of  inaccuracy  in  focusing;  in  depth  from 
defocusing,  a  new  computational  model  is 
proposed  to  achieve  higher  accuracy. 

The  major  contributions  of  this  paper  are: 

(1)  In  depth  from  focusing,  instead  of  the 
popular  Fibonacci  search  which  is  often 
trapped  in  local  maxima,  we  propose  the 
combination  of  Fibonacci  search  and  curve 
fitting,  which  leads  to  an  unprecedentedly 
accurate  result;  (2)  New  model  of  the  blur¬ 
ring  effect  which  takes  the  geometric  blur¬ 
ring  as  well  as  the  imaging  blurring  into 
consideration,  and  the  calibration  of  the 
blurring  model;  (3)  In  spectrogram-based 
depth  from  defocusing,  a  maximal  resem¬ 
blance  estimation  method  is  proposed  to 
decrease  or  eliminate  the  window  effect. 

This  paper  reports  focus  ranging  with  less 
than  1/1000  error  and  the  defocus  rang¬ 
ing  with  less  than  1  /200  error.  With  this 
precision,  depth  from  focus  ranging  is  be¬ 
coming  competitive  with  stereo  vision  for 
reconstructing  3D  depth  information. 

1  Introduction 

Obtaining  depth  information  by  actively  control¬ 
ling  camera  parameters  is  becoming  more  and  more 
important  in  machine  vision,  because  it  is  pas¬ 
sive  and  monocular.  Compared  with  the  popular 
stereo  method  for  depth  recovery,  this  focus  method 
doesn’t  have  the  correspondence  problem,  therefore 
it  is  a  valuable  method  as  an  alternative  of  the  stereo 
method  for  depth  recovery. 

There  are  two  distinct  scenarios  for  using  focus  in- 
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formation  for  depth  recovery: 

•  Depth  From  Focus:  We  try  to  determine  dis¬ 
tance  to  one  point  by  taking  many  images  in 
better  and  better  focus.  Also  called  “autofo¬ 
cus”  or  “software  focus”.  Best  reported  result 
is  1/200  depth  error  at  about  1  meter  distance 
[Subbarao,  1992]. 

•  Depth  From  Defocus:  By  taking  small  number 
of  images  under  different  lens  parameters,  we 
can  determine  depth  at  all  points  in  the  scene. 
This  is  a  possible  range  image  sensor,  compet¬ 
ing  with  laser  range  scanner  or  stereo  vision. 
Best  reported  result  is  1.3%  RMS  error  in  terms 
of  distance  from  the  camera  when  the  target  is 
about  0.9  m  away  [Ens  and  Lawrence,  1991]. 

Both  methods  have  been  limited  in  past  by  low  pre¬ 
cision  hardware  and  imprecise  mathematical  models. 
In  this  paper,  we  will  improve  both: 

•  Depth  From  Focus;  We  propose  a  stronger 
search  algorithm  with  its  implementation  on  a 
high  precision  camera  motor  system. 

•  Depth  FVom  Defocus:  We  propose  a  new  esti¬ 
mation  method  and  a  more  realistic  calibration 
model  for  the  blurring  effect. 

With  this  new  results,  focus  is  becoming  viable  as 
technique  for  machine  vision  applications  such  as 
terrain  mapping  and  object  recognition. 

2  Depth  FVom  Focusing 

Focusing  has  long  been  considered  as  one  of  major 
depth  sources  for  human  and  machine  vision.  In 
this  section,  we  will  concentrate  on  the  precision 
problem  of  focusing.  We  will  approach  high  pre¬ 
cision  from  both  software  and  hardware  directions, 
namely,  stronger  algorithms  and  more  precise  cam¬ 
era  system. 

Most  previous  research  on  depth  from  focusing  con¬ 
centrated  on  developments  and  evaluations  of  dif¬ 
ferent  focus  measures,  such  as  [Tenenbaum,  1970, 
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Krotkov,  1987,  Nayar  and  Nakagawa,  1990,  Sub- 
barao  et  ai,  1992].  As  described  by  all  these  re¬ 
searchers,  an  ideal  focus  measure  should  be  uni- 
modal,  monotonic,  and  should  reach  the  maximum 
only  when  the  image  is  focused.  But  the  focus  mear 
sure  profile  has  many  local  maxima  due  to  noises 
and/or  the  side-lobe  effect  ([Subbarao  ei  ai,  1992]) 
even  after  magnification  compensation  ([Willson  and 
Shafer,  1991]).  This  essentially  requires  a  more  com¬ 
plicated  peak  detection  method  compared  with  the 
Fibonacci  search  which  is  optimal  under  the  uni- 
modal  assumption  as  in  [Beveridge  and  Schechter, 
1970,  Krotkov,  1987].  In  this  paper,  we  use  a  rec¬ 
ognized  focus  measure  from  the  literature,  which  is 
the  Tenegrad  with  zero  threshold  in  [Krotkov,  1987] 
or  Mi  method  in  [Subbarao  ei  ai,  1992].  Our  ma¬ 
jor  concern  is  to  discover  to  what  extent  the  preci¬ 
sion  of  focus  ranging  can  scale  up  with  more  precise 
camera  systems  and  more  sophisticated  search  algo¬ 
rithms.  We  propose  the  combination  of  Fibonacci 
search  and  curve  fitting  to  detect  the  peak  of  focus 
measure  profile  precisely  and  quickly. 

To  evaluate  the  results  from  peak  detections,  an  er¬ 
ror  analysis  method  is  presented  to  analyze  the  un¬ 
certainty  of  the  peak  detection  in  the  motor  count 
space,  and  to  convert  the  uncertainty  in  the  motor 
count  space  into  uncertainty  of  depth.  We  compute 
the  variance  of  motor  positions  resulted  from  peak 
detections  over  equal  depth  targets.  The  Rayleigh 
criterion  of  resolution  is  applied  to  the  distribution 
of  motor  positions  to  calculate  the  minimal  differen¬ 
tiable  motor  displacement.  With  the  assumption  of 
local  linearity  of  the  mapping  from  the  motor  count 
space  to  focus  depth,  the  minimal  differentiable  mo¬ 
tor  displiicement  can  be  converted  to  the  minimal 
differentiable  depth. 

The  lack  of  high  precision  equipment  has  been  a  lim¬ 
iting  factor  to  previous  implementations  of  various 
focus  ranging  methods.  Many  implemented  systems, 
such  as  SPARCS,  have  fairly  low  motor  resolution, 
which  actually  prohibits  more  precise  results.  We 
will  give  brief  description  of  the  motor-driven  camera 
system  in  Calibrated  Imaging  Lab  later,  and  further 
details  can  be  found  in  [Willson  and  Shafer,  1992]. 

2.1  Fibonacci  Search  and  Curve 
Fitting 

When  the  focus  motor  resolution  is  high,  we  usually 
have  a  very  large  parameter  space  which  prevents 
us  from  exhaustively  searching  all  motor  positions. 
Based  on  the  unimodal  assumption  of  focus  measure 
profile,  Fibonacci  search  was  employed  to  narrow  the 
parameter  space  down  to  the  peak  [Beveridge  and 
Schechter,  1970]. 

Assume  the  initial  interval  is  [z,  j/],  and  we  know  the 
focus  measure  profile  is  unimodal  in  this  interval,  if 
X  <  x\  <  Xi  <  y  and  F{xi)  <  F(x2),  where  F  is 
the  focus  measure  function,  then  the  peak  can  not 


be  within  interval  [z,zi),  otherwise  the  unimodal 
assumption  will  be  violated.  Therefore,  if  we  can 
properly  choose  zi  and  zji  the  peak  can  be  found 
optimally.  Fibonacci  search  is  the  optimal  search 
under  the  unimodal  assumption. 

Figure  1  shows  the  target  used  for  testing  the  focus 
measure,  and  Figure  2  is  the  focus  measure  profile 
of  the  target. 


Figure  1:  Step  Edge  Image  as  Target 

It  is  clear  from  Figure  2  that  Fibonacci  search  will 
fail  to  detect  the  peak  precisely  because  of  the  jagged 
profile.  Fortunately,  those  local  maxima  are  small  in 
size,  and  therefore  can  be  regarded  as  disturbances. 
FVom  previous  paragraphs,  we  know  that  the  Fi¬ 
bonacci  search  only  evaluates  at  two  points  within 
the  interval,  which  gives  rise  to  the  hope  that  when 
the  interval  is  large,  Fibonacci  search  is  still  appli¬ 
cable  because  it  will  overlook  those  small  ripples. 

As  the  search  goes  on,  the  interval  becomes  smaller 
and  smaller.  Consequently,  Fibonacci  search  must 
be  aborted  at  some  point  when  the  search  might  be 
misleading.  We  can  experimentally  set  up  a  thresh¬ 
old,  when  the  length  of  the  interval  is  less  than  the 
threshold,  Fibonacci  search  is  replaced  by  an  exhaus¬ 
tive  search.  After  the  exhaustive  search,  a  curve  is 
fitted  to  the  part  of  profile  resulting  from  the  ex¬ 
haustive  search. 

In  our  experiments,  we  set  the  threshold  to  be  5 
motor  counts.  So  when  the  Fibonacci  search  narrows 
down  the  whole  motor  space  to  [a,fc],  where  b—a  <  5, 
an  exhaustive  search  is  fired  on  the  interval  [a  — 
c,(  -f  c],  where  c  is  a  positive  constant.  A  Gaussian 
function  is  fitted  to  the  profile  in  the  interval  [a  — 
c,  b  -b  c]  using  the  least  square  method  described  in 
[Press  et  ai,  1988]. 

Figure  3  shows  the  result  when  Fibonacci  search 
alone  is  applied  to  the  focus  measure  profile.  Ap¬ 
parently,  the  search  is  trapped  in  a  local  maximum. 
Figure  4  shows  the  result  from  Gaussian  function 
fitting.  Both  graphs  show  only  a  part  of  the  whole 
motor  space. 

2.2  Error  Analysis 

Because  of  the  depth  accuracy  we  expected,  a  di¬ 
rect  measurement  of  absolute  depth  is  impossible. 
Instead,  we  prefer  to  use  the  minimal  differentiable 
depth  as  an  indication  of  the  depth  accuracy.  If 
we  assume  the  peak  motor  positions  resulting  from 
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Figure  2:  Focus  Measure  Profile 


Figure  3:  Fibonacci  Search 

iM  * 


Figure  4:  Curve  Fitting 


the  same  repeated  experiments  have  a  Gaussian  dis¬ 
tribution,  we  can  define  the  minimal  differentiable 
motor  displacement  as  the  minimal  difference  of  two 
motor  counts  which  have  pre-defined  probability  of 
representing  different  peaks.  For  example,  in  Fig¬ 
ure  5,  the  Gaussian  distribution  is  artificially  cut  at 
90%  line,  so  we  can  say,  if  we  do  one  focus  ranging 
experiment  on  a  target  at  the  depth  corresponding 
to  the  peak  a,  there  is  a  90%  of  probability  the  motor 
count  will  be  within  the  interval  A. 

There  can  be  different  pre-defined  probability  for  the 
definition  of  minimal  differentiable  motor  displace¬ 
ment.  We  define  the  minimal  differentiable  motor 
displacement  based  on  Rayleigh  criterion  for  res¬ 
olution  [Born  and  Wolf,  1964]  which  specifies  the 
saddle-to-peak  ratio  as  8/w^.  In  case  of  the  Gaus¬ 
sian  distribution,  the  cut-off  line  corresponding  to 
the  Rayleigh  criterion  is  about  0.9(r. 

There  is  a  mapping  from  a  motor  count  to  an  abso¬ 
lute  depth  value  definitely.  Assume  d  =  /(m)  where 


Figure  5:  Minimal  Differentiable  Motor  Displace¬ 
ment 


d  b  the  depth,  m  the  motor  count  and  /  the  map¬ 
ping,  we  have 

^=/»,  (1) 

where  /'(m)  is  the  first  order  derivative  with  respect 
to  m.  Because  what  we  really  want  to  know  is  the 
minimal  differential  depth  or  depth  resolution  Ad, 
and  we  already  have  the  minimal  differentiable  mo¬ 
tor  displacement  Am,  the  only  thing  need  to  be  cal¬ 
ibrated  is  /'(m).  If  we  assume  /'(m)  is  a  constant  in 
the  vicinity  of  d  =  D,  and  the  motor  count  distribu¬ 
tion  has  its  center  at  m  =  M ,  then  when  the  target 
is  moved  AD,  the  distribution  center  moves  AM, 
and  we  will  have  the  minimal  differentiable  depth 


AM 


(2) 


where  Am  is  the  minimal  differentiable  motor  dis¬ 
placement. 


2.3  Implementation  and  Result 
2.3.1  Hardware 

We  implemented  this  focus  ranging  algorithm  in 
the  Calibrated  Imaging  Laboratory,  using  the  Fu- 
jinon/Photometric  camera  system  [Willson  and 
Shafer,  1992].  The  focal  length  can  change  between 
10  mm  to  130  mm  with  11100  motor  steps,  and  the 
focus  distance  can  change  from  approximately  1  me¬ 
ter  to  infinity  with  5100  motor  steps,  the  aperture 
can  change  from  FI. 7  to  completely  closed  with  2700 
motor  steps.  The  SNR  of  the  camera  can  be  as  low 


as  400/1  because  of  the  pixel  by  pixel  digitization 
and  the  -40®C  temperature  of  the  sensor. 

2.3.2  Experiments  and  Results 

We  put  the  target  of  Figure  1  at  about  1.2  me¬ 
ters  away  from  the  front  lens  element  of  the  cam¬ 
era.  Maximal  focal  length  and  maximal  aperture 
are  employed  to  achieve  the  minimal  depth  of  held. 
The  evaluation  window  is  40x40,  while  the  gradient 
operator  is  a  3x3  Sobel  operator. 

The  distribution  of  motor  positions  are  sketched  in 
Figure  6  resulting  from  an  experiment  repeated  40 
times.  With  the  mean  as  the  center  of  a  Gaussian, 
and  the  standard  deviation  as  <7  of  the  Gaussian,  we 
have  the  minimal  differentiable  motor  displacement 
as  2  X  0.9  <7  =  4.5  motor  counts. 


Then  the  target  is  moved  toward  the  camera  1  cen¬ 
timeter,  and  we  repeated  the  above  experiments. 
The  center  of  the  motor  count  distribution  moves 
38  counts.  Therefore,  by  Eq.  2,  we  have  the  mini¬ 
mal  differentiable  depth: 

Ad=  =  ^  X  lcm  =  0.118cm.  (3) 

AM  38 

And  the  relative  depth  error  is  about  0.118  /  120  = 
0.098%. 

3  Depth  From  Defocusing 


imaging  blurring,  and  another  one  by  a  wide  aper¬ 
ture  camera  [Pentland,  1987].  In  this  paper,  we  in¬ 
tend  to  employ  two  images  which  are  defocused  to 
different  extents. 

Window  effects  have  largely  been  ignored  in  the  lit¬ 
erature  of  this  field,  except  [Ens  and  Lawrence,  1991, 
Ens,  1990],  where  the  author  derived  a  function  of 
RMS  depth  error  in  terms  of  the  size  of  window. 
For  example,  when  the  window  is  4  pixels  by  4  pix¬ 
els,  the  RMS  error  from  the  window  effect  can  be 
as  large  as  65.8%!  The  maximal  resemblance  esti¬ 
mation  method  we  propose  is  capable  of  eliminating 
the  window  effect.  It  is  also  noticed  that  the  size 
of  the  window  is  the  decisive  factor  that  limits  the 
resolution  of  depth  maps  if  we  try  to  obtain  a  dense 
depth  map.  Therefore  if  we  can  use  smaller  win¬ 
dow  without  reducing  the  quality  of  the  results,  the 
resolution  of  dense  depth  maps  can  be  much  higher. 

Previous  work  has  employed  oversimplified  camera 
models  to  derive  the  relationship  between  blurring 
functions  and  camera  configurations.  In  [Pentland, 
1987,  Subbarao,  1988,  Bove,  1989],  the  radius  of 
blurring  circles  are  derived  from  the  ideal  thin  lens 
model.  In  this  paper,  we  will  propose  a  more  sophis¬ 
ticated  function  which  directly  relates  the  blurring 
function  with  camera  motors.  Experimental  results 
are  very  consistent  with  this  model  as  to  be  shown 
later. 

3.1  Background 

The  depth  from  defocus  method  is  based  on  the  idea 
that,  the  amount  of  blurring  change  is  directly  re¬ 
lated  to  the  depth  and  camera  parameters.  Since 
the  camera  parameters  can  be  calibrated,  the  depth 
can  be  expressed  by  the  amount  of  blurring  change 
correspondingly.  To  estimate  the  amount  of  blur¬ 
ring  change,  we  need  a  model  of  the  optical  blur¬ 
ring.  IVaditionally,  the  blurring  effect  is  modeled 
as  the  convolution  of  Gaussian  in  computer  vision 
literature,  partly  due  to  its  mathematical  tractabil- 
ity.  Here  we  will  still  assume  a  Gaussian  model,  i.e. 
(considering  ID  case): 


The  depth  from  defocusing  method  uses  the  direct 
relationships  among  the  depth,  camera  parameters 
and  the  amount  of  blurring  in  images  to  derive  the 
depth  from  parameters  which  can  be  directly  mea¬ 
sured.  In  this  part  of  the  paper,  we  propose  the  max¬ 
imal  resemblance  estimation  method  to  estimate  the 
amount  of  defocusing  accurately,  and  a  calibration- 
based  blurring  model. 

Because  the  blurring  in  an  image  can  be  caused  by 
either  the  imaging  process  or  the  scene  itself,  it  gen¬ 
erally  requires  at  least  two  images  taken  under  dif¬ 
ferent  camera  configurations  to  eliminate  this  ambi¬ 
guity.  Pentland  solved  this  problem  by  taking  one 
picture  by  a  pin-hole  camera,  which  can  be  regarded 
as  the  orthographic  projection  of  the  scene  with  zero 


= 


e  2^ 


(4) 

(5) 


The  basic  idea  of  the  depth-from-defocus  method  is 
that,  in  Eq.  4,  since  the  I{z)  which  is  the  image  and 
c  which  results  from  camera  calibration  are  known, 
and  d  and  /o(z)  are  unknown,  we  can  take  two  im¬ 
ages  under  different  camera  settings  cj  and  cz,  then 
at  least  theoretically  d  can  be  computed.  But  be¬ 
cause  the  Eq.  4  is  not  a  linear  equation  with  respect 
to  unknown  d,  directly  solving  d  is  either  impossi¬ 
ble  or  numerically  unstable.  Pentland  proposed  a 
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(14) 


method  to  solve  d  by  Fourier  transforms  [Pentland, 
1987]: 

If 

Ii{x)  =  Io{x)  *  ga^ix)  (6) 

J2(x)  =  Jo(*)*ff<73(*)  (7) 

then, 


In 


mix)] 


In 

In 


Mf) 


Mf) 

QoAS) 

Qcif) 


=  In 


Mf)G.Af) 

Uf)Qo.if) 


=  (8) 

(Ti  =  ff{d,ci)  (9) 

(T2  =  <r(d,cs)  (10) 

Replacing  Eq.  9  and  Eq.  10  into  Eq.  8,  we  can  get: 


In  =  -lfH<r\d,  c)  -  e\d,  c,))  (11) 

where  the  function  a  can  be  calibrated. 


Obviously,  in  Eq.  11,  the  only  unknown  is  d,  there¬ 
fore,  depth  recovery  from  two  images  is  straightfor¬ 
ward. 


3.2  Gabor  Transform  and  Window 
Effect 

The  method  explained  above  is  based  on  /'[/(x)], 
which  is  the  Fourier  transform  of  the  entire  image, 
Thus,  only  one  d  can  be  calculated  from  the  entire 
image.  If  our  goal  is  to  obtain  a  dense  depth  map 
d(x,y),  we  are  forced  to  use  the  STFT  (Short  Time 
Fourier  lYansform)  to  preserve  the  depth  locality. 
To  eliminate  the  spurious  high  frequency  compo¬ 
nents  generated  by  the  discontinuity  at  the  window 
boundary,  people  usually  multiply  the  window  by  a 
window  function.  Unfortunately,  the  elegant  cancel¬ 
lation  in  Eq.  8  doesn’t  hold  any  more  if  we  introduce 
the  window  function  W{x): 


nhix)w{x)]  _  .  ii(/)*>v(/) 
mix)wix)] 

_  1  (Io(/)ga.(/))*W(/](o^ 

■  iUf)G^\{f))*Mfr 

The  convolutions  in  Eq.  12  introduce  blurring  in 
both  the  time  (space)  and  frequency  domains.  The 
Gabor  transform  [Rioul  and  Vetterli,  1991],  which 
uses  a  Gaussian  as  the  window  function,  can  mini¬ 
mize  the  production  of  spatial  uncertainty  and  spec¬ 
tral  uncertainty.  Assuming  a  Gaussian  function  is 
used  as  W{x)  =  ga^ix),  and  its  Fourier  transform  is 
W(/)  =  we  have: 


...  1 


,_/x^_|W(x^_^ 

/  I  Wix)  |>  df  2 

The  above  equations  state  that,  in  the  frequency 
domain,  two  frequency  components  A/  away  can’t 
be  discriminated,  and  in  the  space  domain,  two  im¬ 
pulses  Ax  away  can’t  be  discriminated  either.  Ap¬ 
parently,  one  interpretation  of  why  Elq.  8  doesn’t 
hold  is  that  A/  is  not  zero. 

3.3  Maximal  Resemblance 
Estimation 

FVom  observation,  we  know  that  when  ai  is  ap¬ 
proaching  <r2,  Eq.  12  also  approaches  zero,  in  other 
words,  when  ai  almost  equals  aj,  the  Eq.  8  can  be  a 
good  approximation  in  terms  of  absolute  error.  This 
observation  suggests  an  iterative  method  in  which 
the  blurring  difference  A  is  refined  by  blurring  one 
image  to  resemble  the  other  in  the  vicinity  of  one 
pixel.  In  symbols:  (Assuming  A(t)  is  the  the  kth 
estimation  of  —  ir^) 

1.  4“^  =  71,4“^  =  h  and  A  =  0.0,  fc  =  0; 

2.  l{*^  = 

3.  Fit  a  curve  to  In  jjty  =  — /*A(*)/2.  (Refer  to 
Eq.  8) 

4-  A  =  53,*_o  A«). 

5.  If  A  >  0,  then 

else, 

4*^’^  =  /i  ♦ 

7^*-*^*^  =  7,; 

Note  all  these  convolutions  are  done  very  lo¬ 
cally  because  of  the  window  function  multipli¬ 
cation  in  step  2. 

6.  If  the  termination  criteria  are  satisfied,  exit. 

7.  k  =  k-t-I,  go  to  step  2. 

AH  above  operations  involve  only  local  pixels,  and 
don’t  require  taking  new  pictures.  Therefore,  the 
computation  can  be  done  in  parallel  to  all  pixels  to 
obtain  a  dense  depth  map. 

Let’s  trace  the  above  iterations,  at  the  first  cycle,  we 
have  (Assuming  <tx  >  o’.)* 


^(0)  =  {t'l  - 

while  is  the  error  of  estimating  —  ffj. 
7{*U7,  =7o*G,. 


(13) 
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7^‘^  =  72  ♦Gyj  =  7o  ♦ 


We  can  see,  after  the  first  iteration,  we  actually 
switched  to  estimate  £(o).  So  after  k  iterations,  we 
have 

fc 

A  =  =  +  (15) 

•=0 

Now  the  problem  is  whether  the  sequence  E(t)(k  = 
0, 1,2, ...)  converges  to  zero.  Unfortunately,  there  is 
no  way  to  prove  this  mathematically  because  it  de¬ 
pends  on  the  fitting  method  used  in  step  3.  Notice 
that  if  in  step  3,  we  get  an  estimate  of  —  <r§  only 
based  on  one  particular  frequency  component, 
may  diverge.  Previous  depth  from  defocus  methods 
usually  counted  on  a  pre-selected  frequency  band, 
such  as  in  [Pentland,  1987],  sometimes  this  may 
cause  a  very  large  error  if  there  is  not  enough  energy 
of  the  image  content  within  that  frequency  band. 

3.4  Fitting  Algorithm 

Common  to  any  frequency  analysis,  we  need  a  robust 
algorithm  to  extract  in  Eq.  8  in  a  noisy  en¬ 

vironment.  Ignoring  the  phase  information  resulting 
from  Gabor  transform,  Eq.  8  becomes: 


F^om  Eq.  16,  we  know  this  estimation  problem  is  a 
typical  linear  regression  problem.  With  the  uncer¬ 
tainty  measurement  approximated  in  Eq.  19,  o\—a\ 
can  be  estimated  robustly.  More  details  about  lin¬ 
ear  regression  methods  can  be  found  in  [Press  et  al., 
1988). 

There  is  one  more  problem  need  to  be  addressed. 
Since  we  obtain  images  under  different  camera  con¬ 
figurations,  the  total  energy  within  one  image  is  dif¬ 
ferent  from  that  within  the  other.  Usually,  a  bright¬ 
ness  normalization  is  performed  to  every  image  be¬ 
fore  Gabor  transforms,  as  in  [Subbarao  and  Wei, 
1992).  But  since  this  normalization  will  have  differ¬ 
ent  effects  over  the  noise  in  two  images,  which  will 
complicate  the  uncertainty  analysis,  we  prefer  not  to 
normalize  the  brightness  in  two  images.  Instead  we 
assume: 

/i(*)  =  ci/o(i)*  <!».(*)  (20) 

h{x)  =  C2/o(i)  ♦  (21) 

where  ci  and  cj  are  two  unknown  constants.  Re¬ 
placing  the  two  equations  into  Eq.  16,  we  have: 


In 


\Uf)  P 
\Hf)  I’ 


-/’(<rj  -  <r|) 


which  is  still  a  linear  problem,  while  the  uncertainty 
(10)  analysis  still  holds. 


Assuming  an  additive  white  noise  model,  we  have:  3.5  Blurring  Model 


■  Ili(/)  1^  -fn. 
\Uf)  I’  +"2 


ln(|Iu/)  I’  +ni)- 
ln(|Z2(/)|’-»-n2),  (17) 


where  ni  and  nj  are  energy  of  noises.  Because  In(z-1- 
dx)  »  Inz  -b  jdz,  if  we  assume  |  Ii(/)  n\  and 
I  ^iif)  1^^  ^2,  Uq-  17  can  be  approximated  as: 


|Ji(/)|^  /  ni  nz  . 

\l2{f)\^^\\Iiif)\^  IW)IV 


Therefore,  at  each  frequency,  the  left  hand  of  Eq.  16 
can  be  approximated  by  dividing  corresponding 
spectral  energy  of  two  images  at  the  specific  fre¬ 
quency,  provided  that  the  energy  in  that  frequency  is 
much  larger  than  the  energy  of  noise.  The  deviation 
of  this  approximation  can  be  expressed  as: 

where  c„  is  a  constant  related  to  the  noise  energy  of 
the  camera. 

Certainly,  Eq.  19  is  an  approximation  to  model  the 
error  distribution  as  an  Gaussian.  As  an  intuition, 
when  1  Xiif)  P  or  1 12(/)  P  large,  i.e.  the  energy 
within  the  frequency  is  high,  the  deviation  is  small, 
and  vice  versa. 


Since  the  defocus  ranging  method  derives  the  depth 
instead  of  searching  for  the  depth,  it  requires  a  direct 
modeling  of  defocusing  in  terms  of  camera  param¬ 
eters  and  depth.  Previous  researchers  usually  de¬ 
rived  the  relation  among  lens  parameters,  the  depth 
and  the  blurring  radius,  such  as  in  [Pentland,  1987, 
Subbarao,  1988].  For  example,  in  [Pentland,  1987], 
by  simple  geometric  optics,  Pentland  derived  the  for¬ 
mula: 


D  = 


Fvo 

Vo  —  F  —  akf 


(23) 


where  D  is  the  depth,  F  the  focal  length,  /  the  /- 
number  of  the  lens,  vo  the  distzmce  between  lens  and 
image  plane,  a  the  blurring  circle  radius,  and  k  a 
constant. 


The  basic  limitation  of  this  approach  is  that  those 
parameters  are  based  on  the  ideal  thin  lens  model 
and  in  fact,  they  can  never  be  measured  precisely 
on  any  camera.  We  desire  a  function  which  is  in 
terms  of  motor  counts,  which  are  measurable  and 
controllable.  For  instance,  if  we  use  for  zoom 
motor  count,  m/  for  focus  motor  count,  and  m, 
for  aperture  motor  count,  we  wish  to  get  a  func¬ 
tion  in  the  form  o{  D  =  F(m,,my, ma,<T)  or  <t  = 
F{rn,,mj ,ma,D).  Due  to  the  depth  ambiguity  in 
the  former  form  ([Pentland,  1987],  Appendix),  we 
prefer  to  express  the  blurring  radius  (t  as  a  function 
of  motor  counts  and  the  depth. 
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(24) 


FVom  Eq.  23,  we  can  express  a  as: 


_v^  —  F  Fvojkf 


Since  all  the  lens  parameters  can  be  thought  as  de¬ 
rived  from  motor  counts,  we  can  rewrite  Eq.  24  as: 


<T  =  fci(m,,m/,ma)-»- 


D  +  i3(m,,m/,ma) 


(25) 


Notice  that  there  is  another  term  kz  added.  Because 
D  is  the  distance  between  lens  and  object,  and  the 
position  of  lens  changes  as  camera  parameters  are 
changed,  we  intend  to  use  a  fixed  plane  perpendic¬ 
ular  to  the  optical  axis  as  the  depth  reference  plane 
at  z  =  Jba.  FVom  now  on,  we  always  refer  to  depth 
as  the  distance  between  the  target  and  the  depth 
reference  plane. 


Eq.  25  says,  at  some  point,  <r  will  drop  to  zero,  which 
is  the  best  focused  point.  But  we  knew  that  we  can 
never  get  a  real  step  edge,  because  there  is  always 
high  frequency  loss  in  the  imaging  process.  It  can  be 
attributed  to  the  pixel  quantization,  diffraction,  etc. 
Similarly,  we  can  model  this  as  a  convolution  with 
a  Gaussian  independent  of  the  geometric  blurring 
which  we  already  modeled.  We  use  k^  to  model  its 
width. 


Since  two  consecutive  convolutions  with  Gaussians 
are  equivalent  with  one  Gaussian  convolution: 

(26) 

we  have  our  final  blurring  model  expressed  as: 


Figure  8:  Variation  Measure  and  Threshold 


the  functions,  that  is,  when  the  tr  of  the  Gaussian 
function  is  too  small  with  respect  to  the  pixel  width, 
the  discrete  Gaussian  is  no  longer  a  good  approxi¬ 
mation  of  the  real  Gaussian  function,  and  the  results 
bejpn  to  degenerate.  Generally,  in  the  absence  of 
noise,  the  estimation  errors  are  less  than  1/1000  of 
the  true  values. 


+*4(m,,m/,ma)  (27) 


3.6  Implementation  and  Results 
3.6.1  Simulation 

Our  first  simulation  examines  how  precise  the  esti¬ 
mate  of  irj  —  (72  can  be.  We  use  step  function  as 
lo,  and  convolve  it  with  two  different  Gaussian 
and  G„,.  The  window  function  is  also  a  Gaussian 
with  (7  equals  to  three  pixel  widths.*  FVom  Eq.  14, 
the  locality  of  the  window  function  is  about  2  pixel 
widths. 

The  result  of  the  iterative  method  is  illustrated  in 
Fig.  7.  And  we  can  see  that,  when  the  window  func¬ 
tion  is  narrow,  how  poor  the  first  estimation  can  be. 
As  the  iteration  goes  on,  the  estimated  value  con¬ 
verges  fast  to  the  true  value.  Experimentally,  the 
final  error  is  lower  bounded  by  the  discretization  of 

‘in  this  paper,  all  a  values  are  in  pixel  width. 


3.6.2  Variation  Measure  and 
Thresholding 

Certainly,  if  there  is  no  texture  or  little  texture 
within  the  image,  we  can  not  expect  to  obtain  ac¬ 
curate  estimation  of  depth.  Therefore,  we  need  a 
measure  which  can  discriminate  image  patchs  with 
enough  texture  from  those  without  enough  texture. 
Another  reason  why  we  need  a  variation  measure 
is  because  of  the  socalled  edge  bleeding  [Nair  and 
Stewart,  1992]. 

Assuming  the  image  patch  is  f{x,y),  we  have  the 
variation  measure  expressed  as  in  Eq.  28. 

9oA*,y)dxdy  (28) 

To  better  illustrate  the  relation  between  the  varia¬ 
tion  measure  and  the  result  of  estimate  of  <7^  —  <7^, 
we  demonstrate  the  selection  of  the  threshold  to  ex¬ 
clude  the  effects  of  the  low  variation  content  and  the 
edge  bleeding  in  Figure  8. 
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3.6.3  Calibration  of  Blurring 

Function  and  Blurring  Model 

First,  we  tried  to  confirm  our  assumption  that  the 
blurring  function  can  be  approximated  by  the  Gaus¬ 
sian  function.  Ideally,  if  we  have  a  point  light  source, 
the  image  of  this  light  source  should  be  the  blurring 
function  because 

6(z)*F(x)  =  F{x).  (29) 


Due  to  the  technical  difficulty  of  an  ideal  point  light 
source,  a  step  edge  image  is  used  instead  as  shown 
in  Figure  1.  Assume  the  step  function  is  u(z),  the 
image  of  the  step  edge  should  be 


/■+“  1 

»'(*)•“<*)  =  I 

=  ..+c,Erf(^)  (30) 

where  Erf  is  the  error  function,  ci  and  C2  are  two 
constants. 


The  Figure  9  illustrates  the  least  square  fitting  re¬ 
sults  for  a  blurred  step  edge. 


The  coefficients  ibi,  ib],  Jba,  are  constants  in  Eq.  27 
when  motors  are  fixed.  We  can  move  the  calibration 
target  over  four  different  places,  and  assume  at  the 
first  place,  the  depth  of  the  target  is  zero  (Note  the 
depth  is  w.r.t.  the  reference  plane),  we  will  have 
four  non-linear  equations  with  four  unknowns.  To 
suppress  noise,  we  can  measure  at  more  than  four 
places  and  fit  the  blurring  model  to  the  results. 

Using  the  rail  table  in  CIL  ([Willson  and  Shafer, 
1992j),  the  whole  process  of  calibration  can  be  con¬ 
trolled  by  the  computer.  The  target  moves  from 
about  1  meter  from  the  camera  to  about  3  meters, 
and  the  blurred  edges  are  fed  to  the  least  square  fit¬ 
ting,  the  resulting  <r’s  are,  in  turn,  fitted  against  the 
model  expressed  in  Eq.  27. 

Experiments  have  shown  very  consistent  results  with 
the  model  as  illustrated  in  Figure  10.  The  target  is 
moved  from  far  to  near,  at  the  furthest  distance  from 
the  camera,  the  rail  motor  position  is  zero.  And 
when  it  moved  through  the  whole  range  of  the  rail, 
the  blurring  circle  first  becomes  smaller  and  smaller. 


Figure  10:  Blurring  Model 


Figure  13:  cr^-Map  Recovery  Without  Maximal  Re¬ 
semblance  Estimation 


then  after  a  point,  it  becomes  larger  and  larger.  It 
is  very  clear  that  the  this  effect  can  be  well  modeled 
by  Elq.  27. 

3.6.4  <r*-Map  and  Shape  Recovery 

The  first  step  toward  a  dense  depth  map  is  to  com¬ 
pute  <r*  —  without  loss  of  generality  we 

assume  ai  >  (Tj,  for  every  pixel,  using  the  maximal 
resemblance  estimation.  In  Figure  11,  we  bent  a 
sheet  of  paper  in  different  directions  about  1.0  inchs 
and  took  images.  The  target  is  about  100  inchs  away 
from  the  camera.  The  focal  length  is  130mm,  the  f- 
number  is  f/4.7  for  (a)  and  (c),  f/8.1  for  (b)  and 
(d). 

Then  we  recover  c^-map  for  those  two  objects.  The 
rectangle  in  Figure  11  (a)  is  the  area  for  <r^-map. 
The  tr^  for  Gabor  transform  is  5.0  pixel  size.  Fig¬ 
ure  12  shows  the  tr^-map  recovery  based  on  the  im¬ 
ages  in  Figure  11.  The  holes  within  the  (r’-maps  are 
those  patchs  without  enough  texture. 

Compared  with  the  (r^-map  recovery  without  itera¬ 
tive  maximal  resemblance  estimation  showed  in  Fig¬ 
ure  13,  we  can  see  that  results  without  iteration  are 
much  more  noisy. 

With  <r’-map  recovered  and  the  coefficients  in  Eq.  27 
calibrated  w.r.t.  the  two  camera  configurations,  the 
depth  map  recovery  is  straightforward  by  using  the 
Brent’s  method  [Press  ei  ai,  1988]  to  numerically 
solve  the  nonlinear  equation.  Figure  14  showed  the 
depth  map  (in  inch)  of  the  convex  object  in  Fig¬ 
ure  11  (c)  and  (d),  with  respect  to  the  depth  refer- 
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(a)  Concave  Object  Image  No.  1  (b)Concave  Object  Image  No.2 


(c)  Convex  Object  Image  No.  1  (d)Convex  Object  Image  No.  2 


Figure  11;  Pictures  of  Different  Objects 


(a)  Concave  Object  (b)  Convex  Object 


Figure  12:  <r*-Map  Recovery 


Figure  14;  Shape  Recovery  For  the  Convex  Object 
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ence  plane,  which  is  behind  the  object.  A  conserva¬ 
tive  estimation  of  depth  relative  error  is  1/200  when 
the  target  is  100  inchs  away. 

4  Summary 

In  summary,  we  have  described  two  sources  of  depth 
information — depth  from  focusing  and  depth  from 
defocusing — separately.  In  depth  from  focusing,  we 
pursued  high  accuracy  from  both  the  software  and 
hardware  directions,  and  experiments  proved  that  a 
great  improvement  was  obtained.  In  depth  from  de¬ 
focusing,  we  re-examined  the  whole  underlying  the¬ 
ory,  from  signal  processing  to  camera  calibration, 
and  established  a  new  computational  model,  which 
has  been  successfully  demonstrated  on  real  images. 

The  significance  of  these  works  is  two-fold.  First, 
there  are  few  previous  reports  talking  about  the 
shape  from  defocusing  or  focusing,  and  the  main 
reason  of  the  inefhcacy  of  shape  from  defocusing  or 
focusing  is  it  low  precision.  Demonstrated  in  this  pa¬ 
per,  the  improvements  on  precisions  of  focus  ranging 
and  defocus  ranging  can  lead  to  efficient  shape  re¬ 
covery  methods.  Second,  it  has  been  shown  that  the 
maximal  resemblance  method  proposed  in  this  paper 
is  capable  of  preserving  the  depth  locality,  which  is 
also  essential  to  obtain  dense  depth  maps. 
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Abstract —  We  have  buOt  a  range-image  sensor  that  acquires  a 
complete  28  x  32  range  frame  in  as  little  as  one  millisecond.  Using 
VLSI,  sensing  and  processing  are  combined  into  a  unique  sensing 
element  that  measures  range  in  a  Ailly-paraOel  fashion.  The  ac¬ 
curacy  and  repeatability  of  the  sensed  data  is  0.1%  or  better.  In 
this  paper,  we  review  the  cell-parallel  method  used,  describe  our 
VLSI  implementation,  outline  procedures  for  calibrating  the  cell- 
parallel  sensor  and  present  some  experimental  results.  We  con¬ 
clude  by  describing  a  second-generation  range  sensor  integrated 
circuit  which  is  now  being  tested. 

I.  INTRODUCTION 

A  cell-parallel  implementation  greatly  improves  the  perfor¬ 
mance  of  a  light-stripe  range-imaging  sensor[l,  2, 3].  Though 
equivalent  to  conventional  light-striping  from  optical  and  ge¬ 
ometrical  standpoints,  cell-parallel  light-stripe  sensors  incor¬ 
porate  a  fundamental  improvement  in  the  range  measurement 
process.  As  a  result,  the  acquired  range  data  is  more  robust  and 
more  accurate.  Furthermore,  range  image  acquisition  time  is 
made  independent  of  the  number  of  data  points  in  each  frame. 
By  fully  exploiting  the  capability  of  VLSI  to  both  sense  and 
process  information,  we  have  built  a  smart  sensor  that  acquires 
a  complete  frame  of  10-bit  range  image  data  in  a  millisecond. 

II.  A  CELL-PARALLEL  APPROACH  TO  LIGHT-STRIPE 
RANGE  IMAGING 

Range  information  is  crucial  to  many  robotic  applications. 
A  range  image  is  a  2-D  array  of  pixels,  each  of  which  rep¬ 
resents  the  distance  to  a  point  in  the  imaged  scene.  Many 
techniques  for  the  direct  measurement  of  range  images  have 
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been  developed[4].  Of  these,  the  light-stripe  methods  have 
proven  to  be  among  the  most  robust  and  practical. 

Fig.  1  illustrates  the  principle  on  which  a  light-stripe  sensor 
is  based.  The  scene  to  be  imaged  is  lit  by  a  stripe  —  a  plane  of 
light  formed  by  fanning  a  collimated  source  in  one  dimension. 
The  stripe  is  projected  in  a  known  direction  using  a  precisely 
controlled  mirror.  When  viewed  by  an  imaging  sensor,  it  ap¬ 
pears  as  a  contour  which  follows  the  profile  of  objects.  The 
shsqje  of  this  contour  encodes  range  information.  In  particu¬ 
lar,  if  projector  and  imaging  sensor  geometry  are  known,  the 
distance  to  every  point  lit  by  the  stripe  can  be  determined  via 
triangulation. 

A  conventional  light-stripe  range  sensor  builds  a  range  image 
using  a  “step-and-repeat”  procedure.  A  stripe  is  projected  onto 
a  scene,  as  described  above,  and  one  column  of  range  image 
data  is  measured.  The  stripe  is  stepped  to  a  new  position  and 
the  process  is  repeated  until  the  entire  scene  has  been  scanned. 

Unfortunately,  step-and-repeat  implementations  are  slow.  In 
order  to  build  a  complete  range  image  using  data  from  N  stripe 
positions,  N  intensity  images  are  required.  The  total  time 
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The  cell-parallel  technique  is  an  elegant  modification  of 
the  basic  light-stripe  algorithm.  The  technique  is  a  dynamic 
one,  with  time  an  important  aspect  of  the  range  measurement 
process[6]. 

Consider  the  geometry  of  a  three-pixel,  single-row  cell- 
parallel  range  sensor,  seen  from  above  in  Fig.  2.  In  the  fig¬ 
ure,  the  stripe  plane  is  perpendicular  to  the  page.  The  stripe  is 
quickly  swept  across  the  scene  from  right  to  left,  briefly  illumi¬ 
nating  object  features.  A  sensing  element,  say  S2,  monitors  the 
light  intensity  I2  return^  to  it  along  a  fixed  line-of-sight  ray 
R2.  When  the  position  of  the  stripe  is  such  that  it  intersects  R2 
at  a  point  on  the  surface  of  an  object,  a  'Hash”  will  be  observed 
by  the  sensing  element. 

Range  to  the  object  is  measured  by  recording  the  time  tz 
at  which  the  flash  is  seen.  The  location  of  the  stripe  as  a 
function  of  time  is  known  because  its  projection  angle  6l  (t) 
is  controlled  by  the  system.  The  “time-stamp”  tz  acquired  by 
the  sensing  element  measures  the  position  of  the  stripe  when 
its  light  is  reflected  back  to  the  sensor.  The  three-dimensional 
coordinates  of  one  object  point  are  uniquely  determined  at  the 
intersection  of  the  line-of-sight  ray  R2  with  the  stripe  plane  at 

(fz)  on  the  surface  of  the  object. 

A  sensor  which  collects  a  dense  range  image  is  formed  by 
arranging  identical  sensing  elements  into  a  two-dimensional 
array.  The  cells  of  the  array  work  in  parallel,  gathering  a 
range  image  during  a  single  pass  of  the  light  stripe.  The  time 


The  frame  time  of  a  cell-parallel  sensor  is  set  by  the 
bandwidth  of  the  photo-receptor  used  in  its  sensing  elements. 
Very  high  frame  rates  ( 1  can  be  achieved.  The  photodi¬ 

odes  used  in  our  cell  design  have  bandwidth  into  the  megahertz. 
They  can  detect  a  stripe  moving  at  angular  velocities  in  excess 
of  6,000 rpm. 

B.  Cell-Parallel  System  Geometry 

Cell-parallel  system  geometry  can  be  described  using  homo¬ 
geneous  coordinate  transformations!?,  8].  Referring  to  Fig.  3, 
the  origin  of  the  frame  0$  is  placed  at  the  optical  center  of 
the  imager.  The  stripe  is  a  half-plane  which  radiates  out  from 
an  axis-of-rotation  aligned  with  the  y-axis  of  the  frame  and 
passing  through  the  point 

XL  =  [  6  0  0  1  ]  .  (3) 

Stripe  rotation  0l  is  measured  counter-clockwise  about  its  axis 
when  viewed  from  the  positive  y  direction  and  defined  to  be 
zero  when  the  stripe  lies  in  the  y2-plane.  In  a  homogeneous 
representation,  a  plane  is  described  in  terms  of  a  column  vector 
P  that  satisfies  the  scalar  product  xP  =  0,  where  x  is  a  ho¬ 
mogeneous  point  that  lies  in  P.  In  the  sen.sor  coordinate  frame 
defined  above,  the  stripe  plane  is  modeled  in  terms  of  b  and  0l 
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Fig.  4.  Basic  sensing  element  block  diagram. 


as 


Pl  = 


-COS^L 

0 

sin^ 

bcos^L 


(4) 


The  position  xs  =  (zsi  Vs,  2$)  of  a  sensing  element  on  the 
sensor  image  plane  defines  the  line-of-sight  ray  Rs.  llie  para¬ 
metric  equation  for  a  line  in  three  dimensions  is  used  to  repre¬ 
sent  Rs  as 

X  =  —  (xs  -  Os)  -I-  Os  (5) 

where  rs  =  HxsH  =  \/»s  + +  al-  parameter 

r,  when  normalized  by  is  simply  the  distance  along  Rs 
measured  from  Os  heading  toward  the  object. 

The  point  of  intersection  xo,  between  the  stripe  and  the 
line-of-sight,  is  found  by  solving  xPl  =  0  for  t: 


*s  -  *5  tan^L 


(6) 


In  the  coordinate  frame  of  the  sensor,  this  point  is 


xo  =  [  ^»s  ^*s  1 ; . 


(7) 


Thus,  the  3-D  position  xo  of  imaged  object  points  can  be 
recovered  from  the  scalar  distance  measurement  r. 


111.  VLSI  RANGE  SENSOR 
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Fig.  6.  Sensing  element  circuitry. 


A.  A  28  X  32  Cell-Parallel  Sensor  Chip 

The  multi-pixel  cell-parallel  range  sensor  we  have  developed 
is  shown  in  Fig.  5.  This  chip  consists  of  896  sensing  elements 
ananged  in  a  28  x  32  array.  It  was  fabricated  using  a  2/im 
p-well  CMOS,  double-metal,  double-poly  process  and  mea¬ 
sures  9.2  mm  x  7.9  mm  (width  x  height).  Of  the  total  73  mm^ 
chip  area,  the  sensing  element  array  takes  up  59  mm^,  read¬ 
out  column-select  circuitry  0.37  mm^  and  the  output  integrator 
0.06 mm^.  The  remaining  14  mm^  is  used  for  power  bussing, 
signal  wiring,  and  die  pad  sites. 


A  practical  implementation  of  the  cell-parallel  range  imag¬ 
ing  algorithm  requires  a  smart  sensor  —  one  in  which  optical 
sensing  is  local  to  the  required  processing.  Silicon  VLSI  tech¬ 
nology  provided  the  means  for  building  such  a  sensor. 

Fig.  4  summarizes  the  operation  of  elements  in  the  smart  cell- 
parallel  sensor  array.  Functionally,  each  must  convert  light 
energy  into  an  analog  voltage,  determine  the  time  at  which 
the  voltage  peaks  and  remember  the  time  at  which  the  peak 
occurred. 


B.  Sensing  Element  Design 

The  architecture  chosen  for  the  range  sensing  elements 
is  shown  in  Fig.  6.  Areas  of  interest  in  the  diagram  in¬ 
clude  the  photo-receptor  (PDiode),  the  photo-cu.."ent  trans- 
impedance  amplifier  (PhotoAmp),  threshold  comparison  stage 
(n2Comp),  stripe  event  memory  (RS  Jlop),  time-stamp  track- 
and-hold  circuitry  (PGatel/CCell)  and  cell  read-out  logic 
(PGateO/TokenCell). 

In  operation,  sensing  elements  cycle  between  two  phases  — 
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Fig.  7.  Non-linear  transimpedance  amplifier.  f^'8-  8.  The  cell-parallel  range-finding  system. 


acquisition  and  read  out. 

During  the  acquisition  phase,  each  sensing  element  imple¬ 
ments  the  cell-parallel  procedure  of  Fig.  4.  The  photodiode 
within  a  cell  monitors  light  energy  reflected  back  from  the 
scene.  Photocurrent  output  is  amplified  and  continuously  com¬ 
pared  to  an  external  threshold  voltage  Vth.  When  photorecep¬ 
tor  output  exceeds  this  threshold,  the  “stripe-detected”  latch  in 
the  cell  is  tripped.  The  value  of  the  time-stamp  voltage  at  that 
instant  is  held  on  the  capacitor  CCell,  recording  the  time  of 
the  stripe  detection. 

The  acquisition  phase  is  synchronized  with  stripe  motion  and 
ends  when  the  stripe  completes  its  scan.  At  that  time,  the  array 
sensing  elements  recorded  a  range  image  in  the  form  of  held 
time-stamp  values.  This  raw  range  data  must  now  be  read  from 
the  chip. 

A  time-multiplexed  read-out  scheme  off  loads  range  image 
data  in  raster  order  through  a  single  chip  pin.  One  bit  of  token 
state  is  passed  through  the  sensing  element  array,  selecting 
ceils  for  output.  Dual  n/p-transistor  pass  gate  structures  are 
used  throughout  the  time-stamp  data  path.  They  permit  the  use 
of  rail-to-rail  time-stamp  voltages,  maximizing  the  dynamic 
range  of  the  analog  time-stamp  data. 

C.  Stripe  Detection 

One  of  the  more  challenging  aspects  of  the  cell  design  in¬ 
volved  the  circuitry  which  detected  the  stripe. 

A  photodiode  forms  the  light  sensitive  area  within  each  cell. 
This  diode  is  a  vertical  structure,  built  using  the  n-substrate 
as  the  cathode  and  the  p-well  of  the  CMOS  process  as  the 
anode.  An  additional  implant,  driven  into  the  well,  reduces 
the  surface  resistivity  of  the  anode  and  increases  the  device 
bandwidth. 

The  non-linear  transimpedance  amplifier  of  Fig.  7  was  a  key 
element  of  the  sensor  cell  design.  Reflected  light  from  the 
swept  stripe  source  generates  nano-amp  photo-current  pulses 
and  thus  a  very  high-gain  amplifier  is  required  to  convert  this 
current  into  a  usable  voltage.  In  addition,  very  little  die  area 


could  be  devoted  to  photo-current  amplification  if  cell  area 
was  to  be  kept  small.  The  three  transistor  amplifier  design 
of  Fig.  7  satisfies  both  requirements.  Its  logarithmic  transfer 
characteristic  provides  freedom  from  output  saturation  even 
when  input  light  levels  vary  over  several  orders  of  magnitude. 
The  output  rise-time  of  photodiode/amplifier  test  structures  in 
response  to  a  stripe  was  measured  to  be  a  few  microseconds. 

D.  Analog  Signal  Processing 

Analog  signal  processing  techniques  played  an  important 
role  in  the  design  of  this  smart  sensor.  As  shown  in  Fig.  6, 
sensing  elements  use  analog  circuitry  to  amplify  the  photo¬ 
current,  to  detect  the  stripe  and  to  record  the  per-cell  time- 
stamp  information.  Stripe  timing  is  represented  in  analog  form 
as  a  0-5  V  sawtooth  broadcast  to  all  cells  of  the  array.  This 
allowed  the  time-stamp  value  to  be  stored  as  charge  on  the  1  pf 
capacitor  within  each  cell.  The  digital  equivalent  of  latching 
a  count  into  a  multi-bit  register  would  be  significantly  larger 
in  area  and  would  require  that  the  digital  time-stamp  counters 
run  during  the  acqui.sition  phase.  Thus,  analog  processing  kept 
cell  area  small  and  minimized  digital  switching  noise  during 
photo-current  measurements  in  the  acquisition  phase. 

IV.  PROTOTYPE  RANGE  IMAGE  SENSOR 

The  28  X  32  element  VLSI  sensor  prototype  described  in  the 
previous  section  was  incorporated  into  the  light-stripe  range 
system  shown  in  Fig.  8.  System  components  visible  in  the  pho¬ 
tograph  include  (from  the  left)  the  stripe  generation  assembly, 
the  VLSI  sensor  chip  and  its  interface  electronics,  a  calibration 
target  and  the  3-DOF  positioning  system.  Table  I  provides 
details  of  the  configuration  shown. 

V.  CELL-PARALLEL  SENSOR  CALIBRATION 

Calibration  provides  the  complete  specification  of  system  ge¬ 
ometry  necessary  for  converting  cell  time-stamp  data  into  range 
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TABLE  I 

Cel'  Parallel  Sensor  System  Summary 


Baseline 

300  mm 

Laser  Source 

Laser  Diode  (Collimated) 

Wavelength 

780  nm 

Output  Power 

30  mW 

Stripe  Width 

1  mm 

Stripe  Spread 

40“  (3dB) 

Sweep  Assembly 

Rotating  Mirror 
Sweep  Angle 

40“ 

Sensor  Optics 

1  /2"-Format  CCD  Zoom  Lens 

Focal  Length 

12.5  to  75  mm 

/-number 

//1. 8 

A/D  Precision 

12  bits 

images.  Two  sets  of  calibration  parameters  must  be  measured. 
First,  3-D  sensor  chip  geometry  and  optical  parameters  must 
be  measured  —  the  imager  model.  Next,  a  mapping  between 
time-stamp  values  fls  and  distance  t  for  all  sensing  elements  is 
developed  —  the  stripe  model. 

A.  Imager  Model  Calibration 

This  method  measures  component  model  geometry  using 
reference  objects,  manipulated  in  the  sensor's  field  of  view  with 
an  accurate  3-DOF  (degree  of  freedom)  positioningdevice.  The 
following  two-step  procedure  is  used  (Fig.  3): 

•  the  line-of-sight  rays  Rs  for  a  few  cells  are  measured,  and 

•  a  pinhole-camera  model  is  fit  to  measured  line-of-sight 
rays  in  order  to  approximate  line-of-sights  for  all  sensing 
elements. 


X  (meUn) 

Fig.  10.  Cell  (13,15)  measured  line  of  sight. 

A  planer  target  out  of  which  a  triangular  hole  has  been  cut  as 
shown  in  Fig.  9  is  used  to  map  out  sensing  element  line-of-sight 
rays.  The  target  is  mounted  on  the  positioner  so  that  its  surface 
is  parallel  to  the  world-zy  plane. 

A  single  3-D  point  on  the  line-of-sight  of  a  particular  sensing 
element  is  found  as  follows.  The  target  is  moved  to  some  z- 
position  in  world  coordinates  and  held.  The  bottom  edge  of  the 
triangular  hole  is  located  by  moving  the  target  around  in  x  and 
y  as  indicated  in  Fig.  9.  When  a  small  motion  in  either  z  or  y 
causes  a  large  change  in  the  time-stamp  value  reported  by  the 
cell,  occlusion  of  the  line-of-sight  at  an  edge  of  the  triangular 
cut  is  indicated. 

Once  many  points  along  the  bottom  edge  are  located,  a  line, 
known  to  lie  in  the  plane  of  the  target,  is  ht.  The  location  of 
the  top  edge  is  found  in  a  similar  fashion.  The  intersection  of 
the  top  and  bottom  edge  lines  define  one  3-D  point  that  lies  on 
the  cell’s  line-of-sight.  A  number  of  these  points  are  located 
by  moving  the  target  in  z  and  repeating  the  process.  The  line- 
of-sight  for  a  single  cell  can  then  be  identified  by  fitting  a  3-D 
line  to  these  points.  Experimental  data  from  the  calibration  of 
one  sensing  element’s  line-of-sight  is  shown  in  Fig.  10. 

Mapping  the  line-of-sight  rays  for  all  896  sensing  elements 
in  this  manner  is  too  time  consuming.  In  practice,  line-of-sight 
information  is  measured  for  25  cells,  evenly  spaced  in  a  5  grid. 
The  geometry  of  the  remaining  cells  is  approximated  using  a 
pinhole-camera  model. 

The  pinhole-camera  model[  1 1  ]  constrains  all  sensing  ele¬ 
ment  line-of-sight  rays  to  pass  through  a  single  point  focus  of 
expansion  at  the  optical  center  of  the  camera.  Fig.  1 1  graph¬ 
ically  illustrates  the  process.  Sensing  element  locations  are 
assumed  to  lie  in  some  sensor  plane,  at  locations  evenly  spaced 
in  a  2-D  grid  on  the  plane.  Eleven  model  parameters  must  be 
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determined  that  identify  the  transformation  matrix  Tsw  and  the 
geometry  of  the  the  sensor  plane.  A  least-squares  procedure 
is  used  to  fit  pinhole-model  parameters  to  line-of-sight  infor¬ 
mation  measured  in  the  first  calibration  step.  Imager  model 
geometry  is  now  fully  calibrated. 

B.  Advanced  Imager  Model  Calibration 

Unfortunately,  calibration  of  the  imager  model  via  line-of- 
sight  measurement  is  not  suitable  for  use  outside  of  the  labo¬ 
ratory  environment.  “One-at-a-time”  measurement  of  sensing 
element  geometry,  as  outlined  above,  is  slow  and  cumbersome. 

We  are  developing  a  faster,  more  precise  method  for  imager 
model  calibration.  In  this  new  calibration  method,  the  3-DOF 
positioning  system  is  replaced  with  a  liquid  crystal  display 
(LCD)  mask  that  need  only  be  accurately  positioned  along  one 
degree  of  freedom.  The  LCD  mask  is  used  to  define  precise 
black-and-white  images  that  are  “seen”  by  the  range  sensor. 
The  method  relies  on  intensity  image  information,  measuring 
geometry  through  analysis  of  reference  object  images[9]. 

The  LCD  mask  is  placed  between  a  diffuse  planer  target 
and  sensor  chip  at  a  known  position  and  is  backlit  by  shining 
the  system  stripe  source  on  the  planer  target.  The  pattern 
displayed  on  the  LCD  forms  a  black-and-white  image  on  the 
sensor.  Only  illuminated  sensing  elements  will  latch  the  stripe- 
detected  condition  (Section  III-B).  A  single-bit  intensity  image 
is  derived  by  identifying  the  time-stamp  output  of  illuminated 
sensing  elements. 

Sensing  element  line-of-sight  geometry  is  found  by  varying 
the  LCD  mask  pattern  in  a  controlled  fashion.  For  example,  a 
circular  pattern,  whose  3-D  center  is  known,  can  be  projected. 
A  calibration  point  is  found  by  measuring  the  2-D  location  of 
this  circle’s  center  in  the  intensity  image  returned  by  sensor. 
Additional  calibration  data  is  measured  by  varying  the  position 
of  the  circle  on  the  LCD  mask  and  the  position  of  the  LCD 
along  zs.  Also,  by  measuring  the  center  different  radii  of  the 
circle  at  a  fixed  position,  we  can  compensate  for  the  low  spatial 
resolution  of  the  current  sensor.  The  new  sensor  chip  design. 


Fig.  12.  Time-stamp  calibration. 


discussed  in  Section  VII,  returns  multi-bit  intensity  image  data 
which  further  assists  imager  geometry  calibration. 

Use  of  the  LCD  mask  significantly  reduces  the  time  required 
to  perform  imager-model  calibration.  In  the  previous  method, 
two  edges  of  a  triangular  hole  had  to  be  mapped  out,  via  accurate 
back-and-forth  movement,  in  order  to  yield  a  single  calibration 
point.  In  the  new  method,  one  calibration  point  is  measured 
from  a  single  LCD-generated  pattern  without  mechanical  X-Y 
movement.  Precise  calibration  of  the  low-spatial  resolution 
range  sensor  is  possible  because  high-precision  patterns  are 
generated  by  the  LCD  mask. 

The  use  of  an  LCD  mask  to  project  precise  2-D  patterns 
has  application  beyond  the  calibration  of  our  light-stripe  range 
sensor.  For  example,  this  technique  could  be  used  to  assist 
more  traditional  camera  calibration  procedures  or  to  present 
training  data  to  image-based  neural  net  systems.  LCD  displays 
have  several  advantages  over  CRT  displays  for  applications 
like  these  —  they  are  fast,  they  are  static  (not  refreshed),  and 
they  form  images  which  are  stable  and  well  defined. 

C.  Stripe  Model  Calibration 

The  second  part  of  the  calibration  procedure  determines  the 
mapping  between  time-stamp  data  and  range  along  all  sensing 
element  line-of-sight  rays.  As  shown  in  Fig.  12,  a  planer  target 
with  no  hole  replaces  the  target  used  in  step  one.  The  new 
target  is  held  at  a  known  world-z  position,  parallel  to  the  xy 
plane,  and  time-stamp  readings  0s  from  all  sensors  are  recorded. 
This  process  is  repeated  for  many  z  positions.  Using  this 
information,  the  function  which  maps  cell  time-stamp  values 
0s  into  line-of-sight  distance  r  for  each  sensing  element  is 
approximated  by  fitting  a  parabola  to  each.  Experimental  data, 
showing  the  fitted  t  verses  6$  functions  for  several  sensing 
elements,  is  shown  in  Fig.  13.  Calibration  of  the  cell-parallel 
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Fig.  13.  Hme-stamp  calibration  result. 


Fig.  14.  Cell  (13,15)  range-data  histograms. 


range  sensor  is  now  complete. 


VI.  SYSTEM  PERFORMANCE 

A.  Range  Accuracy  cutd  Repeatability 

The  quality  of  the  range  data  produced  by  the  cell-parallel 
range  sensor  was  measured  by  holding  a  planer  target  at  a 
known  world-a  position  with  the  3-DOF  positioning  device.  In 
the  experimental  setup,  the  world-z  axis  heads  almost  directly 
toward  the  sensor  with  the  zw  =  0  point  roughly  S(X)  nun  away. 
Analog  time-stamp  values  from  the  sensor  array  w«e  digitized, 
using  a  12-bit  analog-to-digital  convoter  (A/D),  and  recorded 
for  1,000  trials.  Light-stripe  sweep  (acquisition  phase)  time 
for  each  scan  was  3  msec. 

A  histogram  of  the  range  data  reported  by  one  cell  is  plotted 
in  Fig.  14.  The  horizontal  axis  represents  the  digitized  time- 
stamp  value,  converted  to  world-z  distance  via  the  calibration 
model.  Data  for  six  world-z  positions  are  combined  in  this 
plot.  The  vertical  axis  shows  the  number  of  times  (plotted 
logarithmically),  out  of  the  1 , 000  trials,  that  the  sensing  ele¬ 
ment  reported  that  world-z  distance.  The  sharpness  of  each 
peak  is  an  indication  of  the  stability  (repeatability)  of  the  range 
measurements. 

Averaged  statistical  data  for  25  evenly-spaced  sensing  ele¬ 
ments  is  plotted  in  Fig.  15.  In  order  to  measure  accuracy  and 
repeatability,  the  position  of  the  target,  as  reported  by  the  cell- 
parallel  sensor,  is  compared  to  the  actual  target  z  position.  The 
“boxed”  points  in  the  plot  represent  the  mean  absolute  error, 
expressed  as  a  fraction  of  the  world-z  position  and  averaged 
for  the  25  elements  at  zw.  One  standard  deviation  of  “spread”, 
also  normalized  with  zw.  is  shown  ({)  above  and  below  each 
box. 

The  experiments  show  the  mean  measured  range  value  to  be 
within  0.5  mm  at  the  maximum  500  mm  z  —  an  accuracy  of 
0.1%.  The  aggregate  distance  discrepancy  between  world  and 
measured  range  values  remains  less  than  0.5  mm  over  the  entire 
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Fig.  IS.  Range  data  accuracy  and  repeatability. 


360  mm  to  500  mm  z  range.  The  cell-parallel  sensor  repeatabil¬ 
ity  is  found  by  computing  the  standard  deviation  of  the  distance 
measurements.  The  measured  repeatability  of  histogram  data 
is  less  than  0.5  mm  —  0.1%  at  the  maximum  500  mm  posi¬ 
tioner  translation.  The  0.5  mm  repeatability  decreases  with 
the  distance  to  the  sensor  —  essentially  with  the  slope  of  the 
time-stamp  to  distance  mapping  function  (Fig.  13). 


B.  Range  Image  Acquisition 

Fig.  16  shows  a  wire-frame  representation  of  one  28  x  32 
range  image  produced  by  the  sensor.  The  imaged  object  is  the 
cup  shown  in  the  figure,  approximately  80  mm  in  diameter  at  its 
opening  and  80  mm  high.  The  range  sensor  is  looking  directly  at 
the  object  from  a  distance  of 500  mm.  The  viewpoint  of  the  plot 
is  at  a  point  directly  above  the  optical  center  of  the  sensor.  The 
complete  range  image  was  acquired  during  a  3  msec  stripe  scan. 
The  intersection  points  of  the  wire-frame  plot  are  positioned  on 
cell  line-of-sight  rays  at  the  measured  distance  along  the  ray 
and  the  focus  of  expansion  is  located  in  front  of  the  cup.  Thus, 
the  smaller  “squares”  represent  object  surface  patches  closer 
to  the  sensor.  This  is  opposite  the  manner  in  which  straight 
perspective  would  make  an  object  with  a  grid  painted  on  it 
appear,  and  at  first  glance  gives  the  false  impression  that  the 
“mold”  used  to  make  the  cup  has  been  imaged. 

The  curved  smooth  front  surface  of  the  object  is  clearly 
visible  in  the  range  data.  The  20  mm  handle  of  the  cup  is 
readily  distinguished,  as  is  the  planer  background  behind  the 
cup.  The  curved  surface  of  the  object  halfway  down  the  cup 
directly  across  from  the  bottom  of  its  handle  includes  a  slight 
shift  of  the  wire-frame.  The  imaged  cup  is  slightly  narrower  at 
its  base  by  about  2  mm.  The  cell-parallel  sensor  is  measuring 
this  small  3-D  feature  at  the  500  mm  object  distance. 


C.  Sensor  Performance  Summary 

A  summary  of  the  cell-parallel  sensor  system  performance 
is  given  in  Table  II. 


Fig,  16.  Range  data  wire  frame. 


TABLE  II 

Cell-Parallel  Sensor  Performance  Summary 


Spatial  Resolution  28  x  32 

Frame  Time  Up  to  I  msec 

Operating  Distance  350  to  500  mm 
Accuracy  <  0.5  mm 

Repeatability  <  0.5  mm 
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Fig.  17.  Second-generation  range  sensor  integrated  circuit 


Fig.  18.  Second-generation  sensing  element  layout 


Vn.  A  SECOND  GENERATION  SENSING  ELEMENT 

A  second-generation  implementation  of  the  light-stripe  sen¬ 
sor  array  has  been  fabricated.  This  new  chip,  seen  in  Fig.  17, 
incorporates  several  advantages  over  the  first  design.  The  die 
area  of  the  new  cell,  shown  in  Fig.  18,  is  216/im  x  216/tm, 
40%  smaller  than  that  of  the  cells  of  the  first-generation  sensor 
(photoreceptor  area  has  been  kept  constant).  Stripe  detection  is 
done  in  a  more  robust  manner  and  range  data  read-out  circuitry 
has  been  simplified.  In  addition,  the  new  cell  provides  a  means 
to  record  and  read  out  the  value  of  the  peak  intensity  seen  when 
it  acquires  a  range  data  sample.  The  peak  intensity  informa¬ 
tion  provides  a  direct  measure  of  scene  reflectance  because 
stripe  output  power  is  known  and  distance  to  the  object  point  is 
measured.  In  addition,  the  availability  of  intensity  information 
allows  for  efficient  sensor  calibration  (Section  V-B). 

Peak  detection  is  done  using  the  circuit  of  Fig.  19.  Operation 
of  the  circuit  is  straightforward.  The  source  following  transis¬ 


Fig.  19.  Second-generation  sensing  element  circuitry. 


tor  Qp  enables  capacitor  Ci  to  track  the  rising  intensity  input 
voltage  transitions.  No  path  is  provided  for  Cp  to  discharge 
when  photoreceptor  output  transitions  downward.  At  the  end 
of  a  scan,  the  largest  intensity  reading  observed  will  be  held. 
Stripe  detection  is  easily  accomplished  by  comparing  the  peak- 
intensity  value  Vf  with  the  amplified  photodiode  output  Vj. 
When  V,  falls  below  the  Vf,  the  output  from  the  comparator 
is  used  to  record  a  time-stamp  value. 

Using  5pice[  10],  operation  of  of  the  second-generation  sens¬ 
ing  element  design  was  simulated.  The  simulation  results  are 
plotted  in  Fig.  20.  The  output  from  the  peak-following  cir¬ 
cuit  XLSCELL .  30  acts  as  a  dynamic  threshold  for  each  cell, 
replacing  the  externally  applied  global  threshold  of  the  first- 
generation  design  (Section  m-B).  Comparator  input  offset 
mismatch  made  setting  a  global  threshold  level,  valid  for  all 
cells  in  the  array,  difficult.  Thus,  stripe  detection  is  made  more 
robust  by  this  modification.  In  addition,  the  “true”  peak  de¬ 
tection  of  the  new  design  provides  better  quality  range  data 
because  the  new  stripe  detection  scheme  identifies  the  location 
of  the  peak  in  time  more  accurately  than  simple  thresholding. 

The  peak-intensity  value  held  within  the  second-generation 
cell  is  an  important  artifact  of  the  ranging  process  and,  in  the 
new  design,  is  provided  as  an  additional  sensing  element  out¬ 
put.  The  illumination  source  in  the  system,  the  stripe,  is  of 
known  power.  Intensity  reduction  from  1/r-type  losses  can  be 
accounted  for  because  range  to  the  object  is  measured.  The 
intensity  value  therefore  provides  a  direct  measure  of  scene 
reflectance  properties  at  the  stripe  wavelength.  It  is  an  image 
aligned  perfectly  with  range  readings  from  the  cell  array. 

The  area  in  each  cell  dedicated  to  time-stamp  read  out  is 
much  smaller  in  the  new  design.  Direct  addressing  of  the  cell 
to  be  read,  using  row  and  column  selects,  eliminates  the  token 
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Fig.  20.  Second-generation  sensing  element  simulation  result. 


state  necessary  in  the  first-generation  design.  The  N  xM  array 
is  read  using  N  row  select  lines  and  M  column  select  lines. 
A  given  cell  is  enabled  for  read  out  by  asserting  the  row  and 
column  select  lines  that  correspond  to  the  location  of  the  cell 
in  the  array.  The  two-level  bus  hierarchy  has  been  maintained, 
however,  to  keep  bus  loading  at  a  minimum.  The  area  savings 
of  the  new  read  selection  method  has  made  cell  area  of  the 
second-generation  design  smaller  despite  the  additional  peak 
detection  circuitry. 

Vni.  CONCLUSION 

We  have  presented  the  design  and  construction  of  a  very 
high-performance  range-imaging  sensor.  This  sensor  acquires 
a  complete  28  x  32  range-data  frame  in  a  few  milliseconds.  Its 
range  accuracy  and  repeatability  were  measured  to  be  less  than 
0.5  mm  on  average  at  half-meter  distances.  The  success  of  this 
implementation  can  be  attributed  to  the  use  of  a  VLSI  smart 
sensor  methodology  that  allowed  a  practical  implementation  of 
the  cell-parallel  technique. 

While  the  advantages  of  processing  at  the  point  sensing  have 
been  advocated  by  many,  few  practical  smart-sensor  imple¬ 
mentations  have  been  demonstrated.  The  cell-parallel  range 
imager  presented  here  bridges  the  gap  between  smart  sensor 
theory  and  practice,  demonstrating  the  impact  that  the  smart 
sensor  methodology  can  have  on  robotic  perception  systems, 
like  automated  inspection  and  assembly  tasks. 

Smart  VLSI-based  sensors,  like  the  high-speed  range  im¬ 
age  sensor  presented  here,  will  be  key  components  in  future 
industrial  applications  of  sensor-based  robotics. 
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Abstract 

We  present  a  novel  lobnst  methodology  for  correspond¬ 
ing  a  dense  set  of  points  on  an  object  surface  for  3-D  stereo 
computation  of  depth.  The  methodology  utilises  multiple 
stereo  pairs  of  images,  each  stereo  pair  taken  of  exactly 
the  same  scene  but  under  different  illumination.  With  just 
2  stereo  pairs  of  images  taken  respectively  for  2  different 
iUumination  conditions,  a  stereo  pair  of  ratio  images  can 
be  produced;  one  for  the  ratio  of  left  images,  and  one  for 
the  ratio  of  right  images.  We  demonstrate  how  the  pho¬ 
tometric  ratios  composing  these  images  can  be  used  for 
accurate  correspondence  of  object  pomts.  Object  points 
having  the  same  photometric  ratio  with  respect  to  2  differ¬ 
ent  iUumination  conditions  comprise  a  weU-defined  equiv¬ 
alence  class  of  phyrical  constraints  defined  by  local  surface 
orientation  relative  to  iUumination  conditions.  We  show 
how  for  diffuse  reflection  the  photometric  ratio  is  invari¬ 
ant  to  varying  camera  characteristics  and  viewpoint  and 
that  therefore  the  same  photometric  ratio  in  both  images 
of  a  stereo  pair  implies  the  same  equivalence  class  of  phys¬ 
ical  constraints.  Corresponding  photometric  ratios  along 
epipolar  lines  in  a  stereo  pair  of  images  under  different 
iUumination  conditions  is  therefore  a  robust  correspon¬ 
dence  of  equivalent  physical  constraints,  and  determina¬ 
tion  of  depth  from  stereo  can  be  performed  without  know¬ 
ing  what  these  physical  constraints  being  corresponded 
actuaUy  are.  This  implies  a  very  practical  shape-from- 
stereo  methodology  appUcable  to  perspective  views  and 
not  requiring  any  knowledge  whatsoever  of  iUumination 
conditions.  This  is  particularly  practical  for  determina¬ 
tion  of  3-D  shape  on  smooth  featureless  surfaces  which 
has  previously  been  hard  to  perform  using  stereo.  We 
demonstrate  experimental  3-D  shape  determination  &om 
a  dense  set  of  points  using  our  stereo  technique  on  smooth 
objects  of  known  ground  truth  shape  that  can  be  accurate 
to  weU  within  ±1%  relative  depth. 

1  INTRODUCTION 

There  has  been  extensive  work  on  computational  stereo 
vision  including  [11],  [10],  [4],  [14],  and  [1].  A  large  col¬ 
lection  of  artides  on  stereo  are  contained  in  the  recent 
book  by  Mayhew  and  Frisby  [12].  Much  of  the  work  com¬ 
puting  depth  &om  stereo  vision  involves  the  correspon¬ 
dence  of  image  features  such  as  intensity  discontinuities 
or  sero  crossings  determining  image  edges.  These  are  fea¬ 
tures  that  can  be  computed  direcuy  from  an  image  with¬ 
out  any  knowledge  of  the  image  formation  process.  A 
possible  disadvantage  of  depth  determination  from  3-D 
stereo  using  edges  is  that  this  data  can  be  sparse  and 
a  number  of  methods  have  been  developed  to  interpo¬ 
late  smooth  surfaces  to  sparse  depth  data  from  stereo  [4], 


[17].  There  are  considerable  problems  with  shape  deter¬ 
mination  of  smooth  featureless  objects  uang  featnre-pcunt 
based  stereo  vision  algorithms. 

Crimson  [5]  was  the  first  to  consider  utilising  the  re¬ 
flectance  properties  of  surfaces  to  augment  shape  determi¬ 
nation  feom  stereo  vision.  Uti]isin(j  diffuse  shad^  infor¬ 
mation  from  two  camera  views  Gnmson  determined  sur¬ 
face  orientation  at  sero  crossings  using  this  in  addition 
to  depth  information  at  these  points  to  more  accurately 
interpolate  a  surface.  Diffuse  reflection  was  assumed  to 
be  Lambertian  and  a  Phong  [13]  specular  component  was 
also  assumed  to  exist.  Low  sur&ce  orientation  was  de¬ 
termined  by  a  modified  photometric  stereo  technique  [20] 
adapted  to  binocular  stereo. 

Smith  [16]  considered  the  correspondence  of  points  in 
a  stereo  of  images  of  a  smooth  featureless  Lamber¬ 
tian  reflecting  surface  utilising  a  mathematical  formula¬ 
tion  he  termed  the  Stereo  Integral  Equation.  What  is 
unique  about  this  work  is  that  except  for  knowing  a  priori 
the  correspondence  of  endpoints  along  an  epipolar  line,  all 
points  can  be  properly  corresponded  between  these  end¬ 
points  purely  from  photometric  values.  This  in  turn  pro¬ 
vides  for  a  very  dense  depth  map  from  stereo  vision. 

There  has  been  work  by  BUe,  Brektaff,  Zisserman 
and  others  [2],  [3],  [21]  that  has  exploited  the  geometry  of 
specular  reflection  viewed  from  a  stereo  pair  of  cameras 
to  derive  constraints  on  surface  shape.  This  work  however 
depends  upon  the  correspondence  of  segmented  specularir 
ties  rather  than  the  correspondence  of  actual  photometric 
values,  which  is  the  primary  concern  of  this  paper. 

The  major  advantage  of  being  able  to  accurately  cor¬ 
respond  photometric  wues  between  a  stereo  pair  of  im¬ 
ages,  besides  being  able  to  determine  the  shape  of  smooth 
featureless  surfaces,  is  that  this  would  provide  for  a  very 
dense  depth  map.  There  are  a  number  of  practical  issues 
concerning  stereo  vision  which  makes  this  very  difficult. 
First  and  foremost  is  that  stereo  vition  requires  2  cam¬ 
eras,  and  no  two  cameras  even  if  they  are  exactly  the 
same  model  number,  records  image  intensities  exactly  the 
same.  Even  if  a  surface  were  perfectly  Lambertian  reflect¬ 
ing  so  that  reflected  radiance  is  completely  independent 
of  viewpoint,  because  of  camera  problems  and  the  details 
of  image  formation  described  in  the  Problem  Background 
section,  the  same  reflected  radiance  from  an  object  point 
will  be  recorded  with  an  unpredictably  different  gray  value 
for  each  camera.  In  fact,  diffuse  reflection  is  not  even 
Lambertian  as  such  reflection  u  actually  dependent  upon 
viewing  angle.  Specular  reflection  is  ol^rv^  at  different 
object  points  from  different  views  which  further  compli¬ 
cates  the  situation. 

We  present  a  novel  practical  methodology  for  reliably 
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corresponding  photometric  values  in  a  stereo  pair  of  im¬ 
ages  that  overcomes  most  of  the  problems  inherent  to  us¬ 
ing  a  stereo  pair  of  cameras  without  having  to  perform 
a  large  amount  of  camera  calibration.  We  need  the  ad¬ 
ditional  requirement  of  at  least  2  illumination  conditions 
but  never  need  to  know  anything  about  these  illumina¬ 
tion  conditions.  Our  methodology  corresponds  the  left 
and  ri^t  ratio  images  of  the  same  scene  under  arbitrar¬ 
ily  different  iUumination  conditions.  We  prove  that  the 
photometric  ratios  arising  from  diffuse  reflection  ate  in¬ 
variant  to  most  characteristics  varying  between  a  stereo 
pair  of  cameras  as  well  as  invariant  to  viewpoint  and  dif¬ 
fuse  surface  albedo,  that  make  pixel  gray  values  by  them¬ 
selves  unreliable  for  stereo  correspondence.  We  also  show 
that  corresponding  photometric  ratios  is  equivalent  to  cor¬ 
responding  classes  of  well-defined  physical  constraints  on 
object  points.  Furthermore,  correspondence  of  photomet¬ 
ric  ratios  can  be  done  to  subpixel  accuracy  using  interpo¬ 
lation. 

A  technique  that  is  somewhat  related  to  our  stereo 
methodology  is  “dual  photometric  stereo”  pioneered  by 
Dceuchi  [8^.  There  are  however  major  conceptual  and  im¬ 
plementation  differences.  The  idea  of  dual  photometric 
stereo  is  to  apply  photometric  stereo  [20]  to  a  smooth 
object  surface  from  two  camera  views.  Surface  orien¬ 
tation  estimates  produced  from  photometric  stereo  are 
segmented  according  to  where  they  fall  on  a  tessellated 
Gaussian  sphere.  Segmented  orientation  classes  are  cor¬ 
responded  between  the  stereo  pair  of  images  using  area, 
mean  surface  orientation,  and,  epipolar  constraints.  A 
coarse  depth  map  is  computed,  and  then  refined  using  an 
iterative  scheme  that  forces  the  gradient  of  the  depth  map 
to  be  consistent  with  surface  orientation.  While  Ikeuclu 
corresponds  surface  orientation  estimates  produced  from 
multiple  illumination  in  a  stereo  pair  of  images,  our 
methodcdogy  corresponds  equivalence  classes  of  physical 
constraints  between  a  stereo  pair  of  images  without  ever 
having  to  compute  these  physical  constraints  explicitly. 
Ikeucm  does  have  the  extra  information  provided  by  sur¬ 
face  orientation  to  refine  his  depth  map,  but  this  b  re¬ 
stricted  to  nearly  orthographic  views  and  known  incident 
orientation  of  at  least  3  dUtant  light  sources.  The  3-D 
stereo  method  using  multiple  illumination  that  we  are 
proposing  never  requires  knowledge  of  any  of  the  multiple 
illumination  conditions  and  b  applicable  to  perspective 
views. 

2  Problem  Background 

We  describe  the  problematic  issues  of  comparing  image 
intensities  between  a  stereo  pab  of  cameras.  To  do  thb 
we  need  to  understand  about  the  image  formation  process 
and  the  nature  of  reflection  from  objects. 

We  describe  the  formation  of  image  intensity  values  be- 
^ning  with  the  familiar  relation  from  Horn  and  Sjoberg 

E  =  L,(ir/4)(D/i)*cos*a  ,  (1) 

relating  image  irradiance,  E,  to  reflected  radiance,  Lr- 
The  lens  diameter,  D,  image  dbtance,  i,  and  light  angle, 
a,  incident  on  the  camera  lens  are  depicted  in  Figure  1. 
Equation  1  assumes  ideal  pinhole  optics.  The  effective 
diameter,  D,  of  a  lens  can  be  controlled  with  an  aperture 
iris  the  sue  of  which  b  measured  on  an  F-stop  scale.  As  F- 
stop  b  the  ratio  of  focal  length  to  effective  lens  diameter, 
image  irradiance  b  inversely  proportional  to  the  square  of 
the  F-stop  value  on  a  camera  lens.  Image  irradiance  b 
therefore  very  sensitive  to  F-stop.  While  a  stereo  pair  of 
cameras  can  use  identical  lens  models  at  exactly  the  same 


F-stop  setting,  the  effective  leiu  diameters  can  still  be 
slightly  different.  The  focal  lengths  as  well  can  be  slightly 
different  and  recalling  the  classical  thin  leiu  law 


thb  will  influence  the  image  dbtance,  t,  in  turn  effecting 
the  image  uradiance.  Even  in  the  ideal  case  where  focm 
lengths  are  precisely  equal,  the  image  dbtance,  t,  can  be 
sli^tly  different  for  a  stereo  pau  of  images  even  though 
the  images  “appear”  equivalently  in  focus  when  in  fiKt 
they  are  not  precisely  in  focus.  On  top  of  all  thb  b  the 
dependence  of  ima^e  uradiance  in  perspective  images  on 
pixel  location  relative  to  the  optic^  center  of  the  image 
plane.  The  farther  a  pixel  b  ra^ally  away  from  the  opti¬ 
cal  center,  the  larger  b  the  light  incident  angle,  a,  which 
strongly  effects  image  uradiance.  Image  uradiwces  aris¬ 
ing  from  the  same  object  point  appear  in  different  parts  of 
a  stereo  pau  of  images  making  them  difficult  to  compare. 


FIGURE  1 


Equation  1  only  takes  into  account  the  optics  involved 
in  image  formation.  Image  uradiance  b  converted  into 
pixel  gray  value  using  electronics.  In  general  the  conver¬ 
sion  of  image  irradiance,  E,  into  pixel  gray  value,  /,  can 
be  described  by  the  expression 

I  =  gEy  +  d,  (2) 

where  g  b  termed  the  gain,  d  b  the  dark  reference,  and,  7, 
controls  the  non-linearity  of  gray  level  contrast.  It  b  typ¬ 
ically  easy  to  set  7  =  1.0  producing  a  linear  response,  and 
easy  to  take  a  dark  reference  image  with  the  lens  cap  on, 
then  subtracting  d  out  from  captured  images.  However, 
we  have  observ^  that  not  only  can  the  gain,  g,  be  vari¬ 
able  between  identical  model  cameras  but  thb  can  change 
over  time  especially  for  relatively  small  changes  in  tem¬ 
perature.  Unless  g  b  calibrated  frequently,  comparing 
pixel  gray  values  for  identical  image  uradiances  between 
a  stereo  pau  of  cameras  can  be  difficult. 

A  widely  used  assumption  about  diffuse  reflection  from 
materiab  is  that  they  are  Lambertian  [9]  meaning  that 
light  radiance,  L,  incident  through  solid  angle,  dw,  at  an¬ 
gle  of  incidence,  produces  reflected  radiance: 

Lp  cos^du 

independent  of  viewing  angle.  The  independence  of  diffuse 
reflected  radiance  with  respect  to  viewing  angle  makes 
it  theoretically  feasible  to  associate  radiance  v^ues  with 
object  points  in  a  stereo  pair  of  images.  However,  even 
for  ideij  Lambertian  diffuse  reflectance  the  above  discus¬ 
sion  outlines  the  practical  difficulties  in  achieving  an  ac¬ 
curate  correspondence  of  pixel  gray  values  produced  from 
reflected  radiance  in  a  stereo  pair  of  cameras. 

The  physical  reality  of  diffuse  reflection  makes  it  even 
more  practically  difficult  to  associate  diffuse  reflected  ra¬ 
diance  with  object  points  across  a  stereo  pair  of  images. 
A  recently  proposed  diffuse  reflectance  model  for  smooth 
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dielectric  surfaces  [18],  [[19]  these  proceedings]  empiri¬ 
cally  verified  to  be  more  accurate  than  Lambert’s  Law, 
expresses  the  dependence  of  diffuse  reflected  radiance  on 
both  angle  of  incidence,  V,  and  viewer  angle,  (see  Fig¬ 
ure  2),  as 

£p[l  -  F’(^,n)]  cos^[l  -  F(sin~'(^^^),  \/n)]du>  (3) 

tl 

where  the  functions,  FO,  are  the  Fresnel  reflection  coeffi¬ 
cients  [15],  and,  n,  is  the  index  of  re&action  of  the  dielec¬ 
tric  surface,  and,  p,  is  the  diffuse  albedo.  Figure  3  shows 
the  significant  dependence  of  diffuse  reflection  upon  viewer 
angle.  Diffuse  reflected  radiance  from  an  object  point  as 
seen  from  the  two  different  viewpoints  of  a  stereo  pair  of 
cameras  will  almost  always  not  be  equal. 

The  dependence  of  specular  reflection  upon  viewpoint 
is  even  more  severe  due  to  its  highly  directional  nature 
and  the  geometry  of  angle  of  incidence  equals  angle  of 
reflection. 
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3  Using  Photometric  Ratios  For  S-D 
Stereo 

The  discussion  and  analysis  of  the  previous  section  has 
shown  that  pixel  gray  values  by  themselves  are  unreliable 
in  being  associated  with  object  points  in  a  well-defined 
way  for  correspondence  between  a  stereo  pair  of  images. 
We  show  that  the  ratio  image  produced  from  2  images  of 
diffuse  reflection  from  the  same  scene  respective  to  2  dif¬ 
ferent  (but  not  necessarily  known)  illumination  conditions 
is  invariant  to  the  differences  in  physical  characteristics  of 
cameras  discussed  in  the  previous  section,  as  well  as  view¬ 
point  and  diffuse  surface  albedo.  Furthermore,  these  pho¬ 
tometric  ratios  can  be  associated  with  well-defined  physi¬ 
cal  constraints  on  object  points  making  them  suitable  for 
robust  correspondence  in  a  stereo  pair  of  images.  This  is 
particularly  useful  for  recovering  the  3-D  shape  of  smooth 
featureless  surfaces  from  stereo  which  has  previously  been 
very  difflcult  to  perform. 

S.l  Photometric  Ratios  As  An  Invariant 

Combining  equations  1,  2  and  3  gives  us  an  expression 
which  precisely  relates  pixel  gray  value,  J,  to  diffuse  reflec¬ 


tion  as  u  function  of  imaging  geometry  and  camera  param¬ 
eters.  For  incident  radiance  L  through  a  small  solid  an¬ 
gle,  du,  at  an  angle  of  incidence,  the  gray  value  formed 
from  viewing  an^e,  ^  (assuming  we  subtract  out  the  dark 
reference,  d)  is: 

I  =  p  [  (x/4)(D/*)*cos^a  Xp[l  -  F(^ft,n)]  x  cos^ 

X  [1  -  F(8in-^(^^),  1/n)]  du  ]'■!'' 
n 

For  a  general  incident  radiance  distribution,  L(^),  on 
an  object  point  the  gray  value  will  be: 

I  =  p  [  (x/4)(D/i)*  cos^  a  j  L(^)p[l-F(^,n)]  x  cos^ 

X  [1  -  F(sin-^(^),  1/n)]  du  ]^/t' 
n 

=  9  I  (»/4)(i>/*)*  [1 

X  [  yX(^)[l -F(^,n)]  cos^du>  (4) 

Consider  this  object  point  first  illuminated  with  an  in¬ 
cident  radiance  distribution,  Li(^),  and  then  illuminated 
with  an  incident  radiance  distribution,  From  equa¬ 

tion  4  the  photometric  ratio  of  gray  values  is: 

h  _  [  /Li(t^)[l -F’(V>.n)]  cos^frdw  ]^/'> 
h  [  /L2(tfr)[l-F(^,n)]cos^da;  ]Mi'' 

This  photometric  ratio  is  invariant  to  all  camera  param¬ 
eters  (except  7),  viewing  an^e,  and  diffuse  surfMe 
albedo,  p.  Note  that  expresnon  3  for  diffuse  reflection 
is  a  separable  function  with  respect  to  both  variables, 
and,  i}>,  and  that  is  why  the  viewing  angle  cancels  out  in 
the  photometric  ratio. 

3.2  Corresponding  Photometric  Ratios 

Most  cameras  have  a  default  setting  of  linear  response 
(i.e.,  7=1.0).  If  not,  the  intensity  values  can  be  linearised 
by  inverse  7-correction.  If  gray  values  are  linear-  or  are 
linearised-  for  a  stereo  pair  of  cameras,  then  the  invari¬ 
ance  of  equation  5  guarantees  that  diffuse  reflection  from 
an  object  point  will  have  the  same  ratio  /1//3  with  respect 
to  both  cameras.  Specular  reflection  ol^rved  from  any 
of  the  stereo  pair  of  cameras  at  the  object  point  will  per¬ 
turb  this  invariance,  but  fortunately  only  diffuse  reflection 
exists  at  most  object  points  on  dielectric  surfaces.  The  nu¬ 
merical  value  of /1//2  can  be  corresponded  along  epipolar 
lines  in  a  stereo  pair  of  cameras  to  subpixel  accuracy  us¬ 
ing  interpolation.  This  is  done  completely  independent 
of  knowledge  of  the  illumination  distributions  L\{yf>)  and 
For  a  smooth  surface  this  produces  a  very  dense 
depth  map  from  stereo  correspondence.  It  is  possible  that 
multiple  points  along  an  epipolar  line  can  have  the  same 
associated  photometric  ratio  and  correspondence  can  be 
aided  by  estimates  of  stereo  disparity  and  disparity  gra¬ 
dient  [1],  [14]. 

With  3  different  illumination  conditions  2  unique  ratio 
images  can  be  generated  and  now  2  photometric  ratios  will 
be  invariant  for  object  points  viewed  between  a  stereo  pair 
of  cameras,  which  can  be  used  to  further  disambiguate  ob¬ 
ject  points.  More  than  3  different  illumination  conditions 
may  provide  redundant  information.  We  intend  to  study 
the  case  of  using  more  than  2  multiple  illumination  con¬ 
ditions  for  stereo  correspondence. 
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S.3  Isoratio  Curves  and  Physical  Constraints 

The  ratio  of  equation  5  expresses  a  physical  constraint 
consisting  of  the  inter-relationship  between  the  local  sur¬ 
face  orientation  at  an  object  point  and  the  two  illumina¬ 
tion  distributions,  and,  Object  points  hav¬ 

ing  the  same  photometric  ratios  form  equivalence  classes 
that  we  term  in  this  paper  isoraiio  curves.  Different 
than  for  isophotes  which  are  curves  of  equal  gray  value 
in  an  image,  an  object  point  belongs  to  an  isoratio  curve 
based  on  the  geometric  relationship  of  its  surface  normal 
with  respect  to  illumination  independent  of  diffuse  surface 
albedo.  Corresponding  photometric  ratios  along  epipolar 
lines  between  a  stereo  pair  of  images  is  identical  to  corre¬ 
sponding  points  that  are  at  the  intersection  of  equivalent 
isoratio  curves  and  the  epipolar  lines. 

We  analyse  the  physical  constraints  governing  isora¬ 
tio  curves  for  distant  point  light  source  iUumination.  To 
simplify  analysis  somewhat  we  consider  a  Lambertian  re¬ 
flecting  object.  Consider  point  light  source  illumination  at 
incident  orientation  in  gradient  space  coordinates  (pi,qi), 
and  (p3, 93).  These  produce  the  following  reflectance  maps 
in  gr^ient  space  respectively  [6]; 

a)  =  1  +  PiP  +  gig 

v/1 -t- -I- 9^  v/1  +Pi  +9i  ’ 

,  _  1  +  PrP  +  grg 

^  v/1  -)-  p*  +  9^  V^l  p|  4-  9I 

The  surface  orientation  of  object  points  with  photomet¬ 
ric  ratio,  /1//3,  is  constrained  in  gradient  space  by  the 
expression  _ 

h  _  Ri(p,  q)  _  1  4-pip4-9ig  ^  y/l  -H  p|  +  9? 

h  Ri{p,  <r)  1  +  PrP  +  qtq  y/l  +pl+ql  ' 

resulting  in  the  following  linear  constraint  in  p  and  9: 

(pi-*^Pj)p+ (91 -*^92)9+1-*!^  =0,  (6) 

ij  I2 

where  _ 

.  _  y/l  +  Pi  +  9i 
/l+pf“9]- 

Therefore  object  points  lying  on  a  particular  isoratio  curve 
produced  from  two  distant  point  light  sources  all  have  lo¬ 
cal  surface  orientation  that  lies  somewhere  along  this  line 
in  gradient  space.  Diffuse  reflection  resulting  from  ex¬ 
pression  3  introduces  a  slight  nonlinearity  in  this  physical 
constraint.  Finite  light  source  distance  also  introduces 
nonlinearity. 

While  distant  light  source  illumination  is  a  simple  case, 
analysis  such  as  this  provides  important  intuition  about 
the  correspondence  of  photometric  ratios.  For  best  corre¬ 
spondence  we  would  like  isoratio  curves  to  intersect  epipo¬ 
lar  lines  as  perpendicularly  as  possible.  We  observe  for 
equation  6  that  incident  light  source  orientations  lying 
along,  9  =  mp,  the  line  through  the  origin  with  slope  m 
in  gradient  space,  produce  photometric  ratios  with  bota- 
tio  curve  linear  constraints  that  are  parallel  to  the  line 
9  =  —  (l/m)p.  These  linear  constraints  ate  perpendicu¬ 
lar  to  the  line  in  gradient  space  along  which  the  incident 
light  source  orientations  lie.  Hence,  assuming  nonverged 
cameras,  there  is  the  best  chance  that  isoratio  curves  will 
be  most  perpendicular  to  epipolar  lines  if  the  incident  di¬ 
rections  of  the  light  sources  intersect  the  baseline  for  the 
stereo  pair  of  cameras,  or  at  least  the  separation  between 
the  light  sources  be  parallel  to  this  baseline. 


4  Experimental  Results 

We  tested  the  accuracy  of  a  dense  depth  map  deter¬ 
mined  from  correspondence  of  photometric  ratios  between 
a  stereo  pair  of  images  on  a  cylinder  and  a  sphere  of  known 
ground  truth.  A  pair  of  Sony  XC-77  cameras  with  25  mm 
lenses  were  used  with  a  stereo  baseline  of  3  inches.  The 
cameras  were  not  verged  so  that  the  epipolar  lines  were 
the  scanlines  themselves.  The  radius  of  the  smooth  plas¬ 
tic  cylinder  shown  in  Figure  4  is  precisely  13/8  inches, 
and  the  radius  of  the  smooth  sphere  shown  in  Figure  8 
is  a  precisely  1  3/16  inch  radius  billiard  cue  ball.  The 
most  frontal  portion  of  these  objects  were  placed  20  inches 
from  the  stereo  baseline  which  is  far  relative  to  the  sixes 
of  these  objects  themselves.  Each  illumination  condition 
was  produced  from  one  of  2  point  light  sources  incident 
at  approximately  ±25*’  separated  along  the  baseline.  Ac¬ 
cording  to  the  analysis  of  the  previous  section  separating 
point  light  sources  parallel  to  the  baseline  increases  the 
perpendicularity  of  boratio  lines  with  respect  to  epipolar 
lines.  Again,  it  b  not  necessary  at  all  to  know  how  these 
light  sources  are  positioned. 

Figure  5  shows  the  photometric  ratios  that  were  used 
to  correspond  pixeb  for  the  cylinder  across  one  scanline 
of  the  left  and  right  stereo  images.  Note  that  there  are 
a  couple  of  photometric  ratios  on  the  left  hand  side  of 
the  left  image  scanline  that  are  higher  than  any  of  the 
photometric  ratios  along  the  same  scanline  in  the  right 
image.  Thb  b  because  the  left  camera  sees  a  smaU  por¬ 
tion  of  the  cylinder  which  b  not  in  the  view  of  the  right 
camera.  The  photometric  ratio  of  a  pixel  in  the  left  image 
b  corresponded  to  subpLxel  accuracy  in  the  right  image 
using  linear  interpolation  of  photometric  ratios  between 
pixels.  Figure  6  shows  a  color  bitmap  of  isoratio  curves 
across  the  cylinder  for  different  photometric  ratios  in  the 
left  and  right  images.  Figure  7a  shows  the  height  map  of 
the  cylinder  from  the  ground  truth  depth  map  of  the  cylin¬ 
der  and  Figure  7b  shows  the  height  map  of  the  cylinder 
from  the  derived  depth  map  from  stereo  correspondence 
of  photometric  ratios.  At  a  distance  of  20  inches  the  av¬ 
erage  depth  error  across  the  cylinder,  before  smoothing, 
was  ±0.17  inches  which  is  ±0.85%. 


FIGURE  4  FIGURE  8 


Figure  9  shows  the  isoratio  curves  for  the  sphere  for 
different  photometric  ratios  in  the  left  and  right  images. 
Figure  10a  shows  the  height  map  of  the  sphere  from  the 
ground  truth  depth  map  of  the  sphere  and  Figure  10b 
shows  the  height  map  from  the  derived  depth  map  from 
stereo  correspondence  of  photometric  ratios.  The  aver¬ 
age  depth  error  across  the  sphere,  before  smoothing,  was 
±0.09  inches  which  is  ±0.45%  depth  variation  at  20  inches. 

Clearly  this  methodology  using  photometric  ratios  for 
stereo  correspondence  can  be  very  accurate.  While  thb 
methodology  does  not  require  any  knowledge  of  each  illu¬ 
mination  condition,  there  are  illumination  conditions  that 
produce  better  measurement  accuracy  than  others. 
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5  Conclusion  and  Future  Work 

We  have  proposed  and  demonstrated  a  new  practi¬ 
cal  methodology  for  achieving  accurate  correspondence  of 
photometric  values  between  a  stereo  pair  of  images.  The 
advantage  of  using  photometric  values  for  stereo  corre¬ 
spondence  is  the  ability  to  generate  dense  depth  maps,  as 
well  as  determining  the  shape  of  smooth  featureless  ob¬ 
jects.  Because  of  characteristics  that  vary  from  camera  to 
camera  and  the  dependence  of  diffuse  reflection  on  view¬ 
point,  pixel  gray  values  by  themselves  are  unreliable  for 
stereo  correspondence  of  object  points.  We  have  intro¬ 
duced  the  notion  of  using  photometric  ratios,  produced 
from  different  illuminations  of  a  scene,  for  stereo  corre¬ 
spondence  which  are  reliable  because  of  their  invariance 
to  various  camera  characteristics  and  viewpoint.  Using 
only  2  illumination  conditions  we  have  demonstrated  the 
high  accuracy  of  depth  determination  from  the  stereo  cor¬ 
respondence  of  photometric  ratios  on  objects  of  precisely 
known  ground  truth.  Because  this  methodology  does  not 
require  knowledge  of  illumination  conditions,  and  works 
well  in  perspective  views,  it  can  be  applied  both  in  ma¬ 
chine  vision  and  in  the  less  controlled  environments  of 
computer  vision.  For  instance,  depth  in  an  outdoor  scene 
can  probably  be  accurately  determined  from  a  stereo  pair 
of  images  taken  at  different  times  with  illumination  vary¬ 
ing  due  to  position  of  the  sun  and/or  variation  in  cloud 
cover.  Future  work  will  entail  using  this  stereo  methodol¬ 
ogy  in  these  types  of  settings. 

In  addition  to  proposing  photometric  ratios  as  a  re¬ 
liable  way  of  corresponding  a  stereo  pair  of  images,  we 
introduced  the  notion  of  the  isoratio  curve  which  unlike 
isophotes  are  invariant  to  diffuse  surface  albedo.  There¬ 
fore  isoratio  curves  are  more  directly  related  to  the  actual 
geometry  of  the  surface  itself  and  can  give  a  large  amount 
of  information  to  object  recognition  with  minimal  knowl¬ 
edge  of  illumination.  We  are  currently  studying  the  use 
of  isoratio  curves  for  improving  the  performance  of  object 
recognition  in  robotic  environments. 
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Abstract 

We  present  a  novel  stereo  matching  algorithm  which 
integrates  learning,  feature  selection,  and  surface 
reconstruction.  First,  an  instance  based  learning 
(IBL)  algorithm  is  used  to  generate  an  approximation 
to  the  optimal  feature  set  for  matching.  Second,  we 
develop  an  adaptive  method  for  refining  the  feature 
set.  This  adaptive  method  arutlyzes  the  feature  error  to 
locate  the  sources  of  mismatches.  Then  the  search 
through  feature  space  for  maximizing  the  class 
separation  function  is  guided  by  elimiruiting  the 
sources  of  mismatches.  Third,  we  introduce  a  method 
for  determining  when  apriori  knowledge  is  necessary 
for  discriminating  between  the  correct  match  and  the 
sources  of  mismatches.  If  the  apriori  knowledge  is 
r  ecessary  then  we  use  a  surface  reconstruction  model 
to  discriminate  between  match  possibilities.  We 
performed  comprehensive  comparison  of  our  algorithm 
and  a  traditional  pyramid  algorithm  on  a  wide  set  of 
real  images.  Finally,  extensive  empirical  results  of 
our  algorithm  based  on  real  images  are  presented. 

1  Introduction 

Stereo  matching  is  an  important  problem  in  computer 
vision.  It  is  necessary  for  passive  range  Gnding.  It 
greatly  simplifles  navigation  and  automated  modeling 
of  objects  and  terrains.  In  human  body  measurement, 
we  could  automatically  create  models  of  human  bodies 
for  manufacturing  apparel  or  create  ergonomic 
equipment  for  the  office  or  home  use.  In  flight 
simulation  and  the  new  area  of  virtual  reality,  we 
could  automatically  create  models  of  terrain  and  other 
natural  environments. 

In  the  final  report  of  the  NSF  Workshop  on 
"Challenges  in  Computer  Vision  Research;  Future 
Research  Directions,"  two  of  the  major 
recommendations  on  research  topics  and  issues 
included  [S.  Negahdaripour  and  A.  Jain  1992], 


-  More  experimental  rigor  in  vision  research 

-  Res^ucha's  should  address  the  integration 

of  isolated  modules  at  each  visual 
processing  level 

This  ptqier  addresses  these  two  issues  by  performing 
extensive  empirical  testing  of  our  algorithm  on  a  wide 
range  of  real  images,  and  integrating  the  modules  of 
learning,  feature  selection,  and  surface  reconstruction. 

We  will  use  the  following  assumptions  and  definitions: 

We  are  given  two  intensity  images,  L  and  R. 
(x|,y])  denotes  the  image  axes  of  L 
(Xf .Yf)  denotes  the  image  axes  of  R 
(Xp,yp)  denotes  a  q)ecific  point  in  (xpyj) 

(Xc.Yc)  denotes  a  specific  point  in  (XpYf) 
Correspondence  denotes  a  list  of  two  elements, 
(xp.yj))  and  (xc,yc). 

We  are  given  n  feature  classes  (e.g.  intensity, 
Laplacian,  etc.). 

Feature  set  refers  to  a  set  whose  members  are 
feature  classes 

Feature  vector  refers  to  a  vector  of  size  n 

composed  of  the  values  of  every  feature  class 
at  a  specific  point 

In  stereo  matching,  our  goal  is  to  find  correspondences 
between  two  intensity  images  of  roughly  the  same 
content  Given  knowledge  of  the  camera  calibration 
and  the  correspondence  {Xp,yp)  to  (x^-.y^)  we  can  then 
reconstruct  the  3-D  coordinates  of  the  object  in  the 
world. 

This  paper  presents  a  new  multi-feature  stereo 
matching  algorithm.  We  use  the  following  features  in 
our  algorithm: 

-  image  intensity 

-  X  derivative  of  intensity 

-  y  derivative  of  intensity 

-  gradient  magnitude 

-  gradient  orientation 
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-  Laplacian  of  intensity 

-  curvature  of  2D  edges 

In  the  sense  of  using  "Landmark”  as  something  which 
is  unique,  we  call  our  algorithm  the  "Landmark  Stereo 
Matching  Algorithm"  because  the  central  idea  of  the 
method  is  to  find  a  feature  set  for  (Xp.yp)  that  will 
make  the  point  unique. 

There  are  essentially  three  steps  in  our  method  which 
is  shown  in  Figure  1.  The  first  step  produces  an 
approximation  toward  the  optimal  feature  set.  The 
second  step  refines  the  feature  set  toward  the  optimal 
feature  set  in  the  sense  of  making  the  selected  point 
(Xp.yp)  unique.  The  third  step  treats  the  case  where 
the  feature  set  is  ambiguous. 


Landmark  Stereo  Matching  Algorithm 

(1)  Find  an  initial  feature  set 

Method:  apply  a  concept  learning  algorithm 
Input:  values  of  the  features  at  (Xp.yp) 
Ouqiut:  a  subset  of  the  available  features,  ie. 
(intensity,  gradient  magnitude} 

(2)  Improve  the  feature  set 

Method:  find  the  feature  set  which  maximizes 
matching  accuracy 

Input:  a  feature  set.  ie.  (intensity,  gradient 
magnitude] 

Output:  a  feature  set  ie.  (gradient 

magnitude,  Laplacian  of  intensity) 

(3)  If  the  feature  set  is  unambiguous  then  use 

the  feature  set  for  matching. 

Else  apply  apriori  knowledge  to  resolve  the 
matching  ambiguity. 

Method:  apriori  knowledge  is  to  fit  to  a  thin 
metal  surface  model. 

Input:  a  feature  set 

Output:  a  correspondence,  (Xn,yr>)  ->  (Xr.yr) 
Figure  1.  The  Landmark  stereo  matching  algorithm. 

The  history  of  stereo  research  has  provided  a  rich  and 
extensive  background  for  the  ideas  in  our  research. 
Moravec  [1980]  found  interesting  points  in  the  left 
image  and  used  the  binary-search  method  of  an  image 
pyramid  to  find  the  correspondences.  Hannah  made 
improvements  to  his  method  but  kept  the 
unidirectional  coarse-to-fine  search  method  (M.  J. 
Hannah  1988].  Marr,  Poggio,  and  Crimson  used  zero- 
crossings  of  the  Laplacian  of  Gaussian  at  different 
spacings  as  matching  primitives. 

They  found  matches  at  a  particular  initial  level, 
enforced  continuity  of  zero-crossings,  and  then 
approximated  the  match  down  the  pyramid  from 
coarse-to-fine  [D.  Marr  and  T.  Poggio  1980;  W.  E.  L. 
Crimson  1981].  Lim  and  Binford  [1987]  used  a 


hierarchical  structure  based  on  different  scale  feamres, 
specifically,  bodies,  surfaces,  junctions,  curves,  and 
edges.  Hoff  and  Ahuja  [1985]  used  zero-crossings  to 
integrate  surface  modeling  and  stereo  matching  at  a 
particular  initial  level  and  strictly  approximated 
toward  the  finest  level.  Surfaces  were  modeled  as 
planar  and  quadratic  patches.  Barnard  [1987]  used  an 
annealing  approach  for  finding  global  optima  from 
matching  all  points  simultaneously.  The  match  error 
was  an  energy  function  combining  intensity  difference 
and  local  changes  in  disparity.  Cohen,  Vinet,  and 
Sander  [1989]  used  an  edge  hierarchy  to  integrate 
segmentation  with  stereo  matching. 

The  progression  of  stereo  research  appears  to  be 
toward  using  more  features  of  varying  levels  of 
abstraction.  The  most  recent  work  includes  using  a 
variety  of  waveforms  as  primitives  [McKeown  and 
Hsieh  1992].  Mar^ane  and  Trivedi  [1992]  used 
multiple  primitives  in  a  hierarchy  [Marapane  and 
Trivedi  1992].  In  addition,  Weng,  Ahuja,  and  Huang 
[1992]  used  edgeness,  and  positive  and  negative 
comemess  in  a  hierarchical  based  matcher. 

This  paper  is  organized  as  follows.  Section  2 
describes  the  instance  based  learning  algorithm,  which 
is  essentially  step  1  of  our  algorithm.  Section  3 
describes  the  feature  set  improvement,  which 
conesponds  to  step  2  of  our  algorithm.  Section  4 
describes  the  actual  stereo  matching  once  the  feature 
set  has  been  found.  It  also  explains  the  method  of 
resolving  ambiguous  matches,  which  is  step  3.  Section 
5  presents  the  results  of  using  the  algorithm  upon  the 
set  of  real  images.  Section  6  summarizes  the 
conclusions  and  contributions. 

2  Instance  Based  Learning 

This  section  explores  the  problem  of  finding  an  initial 
feature  set  in  step  1  of  Figure  1.  Given  that  the 
combinatorial  explosion  from  searching  for  an  optimal 
feature  set  may  be  prohibitive,  we  explore  a  method  of 
finding  an  initial  point  from  which  to  begin  the  search 
through  feature  space.  Computational  expense  can  be 
saved  by  generating  a  first  approximation  of  the 
optimal  feature  set  using  a  concept  learning  algorithm 
such  as  a  neural  net  or  an  instance-based  learning 
algorithm  [Aha  and  Kibler  1989]. 

Some  preliminary  terminology  for  instance  based 
learning  algorithms  is  reviewed  below: 

Exemplar  refers  to  a  list  of  two  elements,  where 
the  first  element  is  a  feature  vector,  and  the 
second  clement  is  a  feature  set.  The 
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classification  of  the  feature  vector  is  assumed 
to  be  the  associated  feature  set. 

Exemplar  list  refers  to  a  list  of  exemplars. 

There  are  two  stages  in  the  instance  based  learning 
paradigm.  First,  a  training  set  which  has  the  form  of 
an  exemplar  list  is  used  to  build  a  concept,  C.  This  is 
called  the  learning  stage.  Second,  after  the  training  set 
has  been  fully  processed,  new  input  feature  vectors  are 
classified  using  C.  A  simplistic  IBL  algorithm  would 
simply  copy  the  training  set  to  C  for  the  learning  stage. 
A  simplistic  IBL  algorithm  classifies  input  feature 
vectors  as  follows: 

(1)  Find  the  euclidean  distance  between  the  input 

feature  vector  and  the  feature  vector  of  each 
exemplar  in  C. 

(2)  Classify  the  input  vector  as  the  feature  set  of  the 

vector  in  C  which  has  the  minimum  distance. 

Advantageous  characteristics  of  instance-based 
learning  algorithms  [Aha  and  Kibler  1989]  are  (1) 
simple  representations  for  concept  descriptions,  (2)  low 
incremental  learning  costs,  (3)  small  storage 
requirements,  (4)  produce  concept  exemplars  on 
demand,  (S)  ability  to  learn  continuous  functions,  and 
(6)  ability  to  learn  non-linearly  separable  categories. 

Consider  an  example  where  the  input  feature  vector  for 
(xp,yp)  is  [100,9,20].  The  feature  vector  of  the 
exemplar  on  line  (2)  of  Figure  2  is  closest  to  the  input 
feature  vector.  Then,  we  would  classify  the  input 
feature  vector  as  [intensity). 


Line  Concept  C 

(1)  ([100, 0, 20],  [intensity]) 

(2)  ([IfX),  10, 20],  [intensity]) 

(3)  ([100, 30, 20],  [orientation]) 

(4)  ([1(X),  40, 20],  [orientation]) 

Figure  2.  An  example  of  a  concept  containing  4 
exemplars.  The  feature  classes  of  the  feature  vector  are 
[intensity,  magnitude,  orientation].  Intensity  and 
orientation  are  the  only  possible  classifications. 

Our  approach  toward  noise  tolerance  is  to  classify  the 
new  element  using  the  entire  concept  C  instead  of  the 
nearest  neighbor.  Specifically,  we  accumulate  the 
support  from  each  exemplar  in  C  toward  a  particular 
classification.  The  classification  with  the  highest 
support  is  then  chosen.  The  Gaussian  weighted 
support  function  is  chosen  as 

5(*)=2  e 

leCX 


where  5(fi,fr)  is  the  metric  between  the  exemplars  in 
C  which  are  in  class,  k,  and  the  new  instance,  fj. 

Then  the  classification  for  fp  would  satisfy 

max(5(/i:)),/:  =  l..c 

where  c  is  the  number  of  classes.  Intuitively,  this  will 
result  in  a  few  incorrect  points  being  suppressed  by  the 
vote  of  the  many  correct  points,  as  the  Gaussian 
weighting  will  give  greater  support  to  nearby 
exemplars,  and  less  support  from  farther  exemplars  in 
feature  space. 

Another  issue  in  designing  instance  based  learning 
algorithms  is  minimizing  the  size  of  the  concept  C. 
Note  that  in  Figure  2,  the  classification  for  any  new 
input  feature  vector  will  not  change  if  we  eliminate  the 
exemplars  on  lines  1  and  4.  In  general,  we  can  reduce 
the  size  of  C  by  grouping  exemplars,  which  are  close  in 
feature  space,  into  a  single  exemplar.  Furthermore,  if 
the  feature  vector  of  an  exemplar  in  C  has  a  different 
Gaussian  support  classification  than  it's  associated 
classification,  then  we  can  delete  the  exemplar  from  C. 

The  instance-based  learning  algorithm  which  is  used 
for  the  Landmark  algorithm  is  based  upon  Growth 
NT[Aha  and  Kibler  1989]  with  some  modifications 
toward  improving  training  order  independence 
(creating  the  discard  list),  and  concq)t  size 
minimization.  Given  that  T  denotes  the  training  set, 
NT2  is  shown  in  Figure  3. 


Instance  Based  Learning  Algorithm 

(1)  Initialize  C  to  the  set  of  first  exemplar  in  T. 

(2)  For  all  subsequent  training  exemplars  t  in  T: 

(3)  k  =  Gaussian  Support  Classiflcation  of  t  by  C. 

(4)  If  (k  equals  the  associated  classification  of  t) 
THEN  add  t  to  the  discard  list 

ELSE  add  t  to  C  and  check  the  discard  list  for 
incorrectly  classified  instances. 

(5)  Delete  redundant  and  noisy  exemplars  from 

C. _ 

Figure  3.  NT2,  the  instance  based  learning  algorithm. 
This  algorithm  shows  the  first  stage  of 
instance  based  learning  algorithms,  where  the 
concept  is  created. 

3  Feature  Set  Improvement 

This  section  explores  the  problem  of  improving  the 
current  feature  set  in  step  2  of  Figure  1.  In  order  to 
optimize  the  feature  set,  we  need  a  function  to 
maximize.  We  shall  use  the  concept  of  the  class 
separation  distance  in  formulating  the  optimality 
criterion.  Consider  that  each  feature  set  will  resul*  in  a 
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specific  enx>r  function  with  respect  to  (Xp,yp).  The 
optimal  feature  set  should  have  the  quality  that  its 
minimum  is  at  the  correct  (X(.,yc).  The  ideal  error 
function  would  have  one  minimum  at  the  correct 
(Xf  ,yf),  whereas  the  worst  case  error  function  would  be 
a  flat  plane,  which  would  be  the  most  ambiguous 
situation. 

By  defining  the  class  separation  distance  in  terms  of 
the  error  function,  we  only  need  to  consider  the 
minima  instead  of  every  point  in  the  image.  This 
allows  us  to  significantly  reduce  computational 
expense.  Let  us  consider  the  case  of  multiple  minima 
in  the  error  function.  Each  minimum  is  associated 
with  a  different  stereo  correspondence,  where  only  one 
correspondence  is  correct.  The  other  minima  are  then 
called  sources  of  mismatches  since  they  lead  to 
incorrect  correspondences. 

With  respect  to  stereo  matching,  we  want  to  maximize 
the  distance  between  the  value  of  the  error  function  at 
the  correct  correspondence  and  the  value  of  the  error 
function  at  all  the  sources  of  mismatches. 


where  xq  refers  to  the  global  minimum  of  the  error 
function,  Ny  refers  to  the  total  number  of  minima,  and 
M  represents  all  x  such  that  x  is  a  local  minimum  but 
not  XQ.  Thus  the  class  separation,  J  varies  between  0 
and  1,  the  minimum  and  maximum  class  separation 
distances,  respectively. 


In  image  space,  sufficient  conditions  for  a  local 
minimum  are 


(w  •  F(x)),  =  0  and  ( w  •  F(x))^  =  0 
and 


(wF(x))„  (wF(x))^ 

(w  F(x))^  (w  F(x))^  ^ 

Now,  the  goal  is  to  fmd  w  such  that  J  is  maximized,  or 

7(X)  =  max/(w.F) 

W 


Thus,  we  define  a  measure  which  will  give  a  higher 
class  separation  distance  to  greato’  differences 
between  minimums  in  the  error  function,  that  is,  when 
the  difference  between  the  first  and  second  minima  is 
large,  the  class  separation  distance  is  also  large.  But, 
we  would  also  like  the  class  separation  distance  to 
gracefully  diminish  with  additional  minima  which  are 
close  to  the  global  minimum.  The  Gaussian  function 
was  chosen  because  of  these  properties. 


With  consideration  of  all  possible  feature  sets,  we  have 
reached  the  discrimination  limit  of  the  feature  classes 
and  metrics.  For  suboptimal  search  we  could  stop 
searching  at  a  sufficiently  high  class  separation. 
Henceforth,  this  will  be  ^led  a  distinct  feature  set  as 
opposed  to  an  optimal  feature  set  Nevertheless,  it  is 
possible  that  there  is  no  w  which  results  in  a  single 
distinct  or  optimal  minimum.  This  case  is  explored  in 
section  4. 


Let  n  be  the  total  number  of  available  features,  then  the 
error  function  for  a  feature  set  is 
Error  =  w  •  F 

where 

w  =  [w|  W2 ...  wj 

with 

Sw.  =  l 
1=1 
and 

F=[fif2...fn] 

where  f^  is  the  error  due  to  the  nth  feature  with  respect 
to  (xp,yp).  Note  diat  f]  refers  to  which  is  the 

error  from  using  the  first  feature  class  between  (Xp,yp) 
and  every  point  in  (x^.y^).  Similarly,  f2  refers  to 
f2(Xr,yr),  which  is  the  error  from  using  the  second 
feature  class. 

Then  if  we  choose  Gaussian  weighting,  the  class 
separation  distance  becomes 


A  straightforward  method  of  feature  selection  is  to 
maximize  J(.)  between  (Xp,yp)  and  R.  Intuitively,  this 
will  result  in  obtaining  the  most  distinct  error  function 
from  the  set  of  feature  classes.  If  we  were  to  use  this 
apfxoach,  we  would  also  use  one  of  the  traditional 
feature  selection  search  methods:  Branch  and  Bound 
[Narendra  and  Fukunaga  1977]  for  the  optimal  feature 
set,  or  one  of  many  feature  selection  algorithms 
[Whitney  1971;  Kittler  1978;  Marill  and  Green  1963; 
Kitller  1978]  for  a  less  computationally  expensive  but 
suboptimal  set  We  have  another  interesting 
possibility. 

Instead  of  maximizing  /(.)  between  (Xp,yp)  and  R,  we 
could  maximize  J(.)  between  (xp,}^  and  L.  This 
possibility  has  the  following  significant  advantages:  (1) 
we  would  know  which  minimum  in  the  error  function 
corresponds  to  (Xp,yp)  and  which  minima  correspond 
to  sources  of  mismatches;  (2)  if  we  compute  the  error 
only  at  the  minima  for  the  features  not  in  the  feature 
set,  then  we  could  guide  the  addition  of  features  to  the 


996 


feature  set  by  adding  the  feature  which  has  the  greatest 
total  error  at  the  minima. 

The  disadvantage  is  that  once  the  feature  set  is 
selected,  we  will  have  to  search  through  R  for  the 
global  minimum,  which  in  the  straightforward  method 
would  already  have  been  performed. 

Both  methods  share  the  important  advantage  of  the 
ability  to  determine  when  the  feature  set  is  insufficient 
for  matching  (Xp,yp).  This  situation  occurs  when  the 
class  separation  distance  is  lower  than  a  threshold,  Jf. 
Although  many  stereo  matchers  will  reject  a 
correspondence  if  the  final  feature  error  is  too  high,  it 
is  rare  for  a  stereo  matching  algorithm  to  be  able  to 
determine  if  there  are  too  many  points  with  low  feature 
errors  (ie.  when  J  <  Jf). 

Let  F(.  =  list  of  n  feature  classes.  The  algorithm  is 
shown  in  Figure  4. 


Feature  Set  Improvement  Algorithm 

(1)  Fj  =  Feature  set  approximation  from  the  IBL 
algorithm  with  respect  to  (xp,yp) 

(1.1)  =  J(Fj)  =  Initial  class  separation 

(2)  if  the  feature  set  is  distinct  ( 7(F|)>7, )  then 
go  to  Stereo  Matching  Algorithm  (in 
Section  4  or  step  3  of  Figure  1.) 

(3)  Mt  =  minima  in  F|  applied  to  L  and  (xp.yp) 

(4)  Apply  Ml  to  Fg. 

(5)  Let  f  be  the  element  in  F^.  which  has  the 
maximum  total  error  over  Mp 

(S.  1)  If  there  are  no  features  left  (f=NULL)  then  go 

to  Stereo  Matching  Algorithm 

(5.1)  AddftoFj 

(5.2)  Delete  f  from  Fg 

(5.3)  if  the  current  feature  set  is  distinct  (/(Fj)  > 
then  go  to  Stereo  Matching  Algorithm 

(5.4)  if  /(Fj)  >  Jgij  then  =  /(Fj);  go  to  5 

(5.5)  Delete  f  from  F;.  Go  to  5 

Figure  4.  The  feature  set  improvement  algorithm. 

In  summary,  the  central  idea  of  the  guided  or  adaptive 
method  of  finding  distinct  feature  sets  is  to  record  the 
points  of  minima  of  the  current  feature  set  with  respect 
to  (Xp,yp).  Then,  features  which  have  the  largest  total 
error  at  the  minima  are  added  to  the  feature  set  if  J 
increases.  Thus,  we  can  perform  an  informed  addition 
of  features  to  the  feature  set.  This  information 
constrains  the  searching  in  an  intuitively  pleasing 
manner. 

4  Stereo  Matching  Algorithm 


(1)  The  feature  set,  Fj,  is  distinct 

(2)  The  feature  set,  Fj,  is  not  distinct 

If  the  first  is  true,  then  the  feature  set  appears  to  be 
able  to  discriminate  between  the  correct  match  and  the 
sources  of  mismatches.  Consequently,  we  apply  the 
feature  set  to  the  corresponding  epipolar  line  of  R,  and 
determine  the  correspondence  as  the  point  of  minimum 
enor. 

In  the  second  case,  the  discriminatory  ability  of  our 
feature  set  is  insufficient  to  properly  distinguish 
between  the  possible  matches  in  R.  There  are  two 
options  in  this  situation.  We  could  reject  the  point,  or 
apply  a  heuristic  to  select  one  of  the  minima.  Thus, 
the  solution  will  depend  upon  the  particular  replication 
to  which  the  feature  selection  is  being  applied. 

For  stereo  matching  we  chose  to  decide  between  match 
possibilities  by  fltting  the  previous  matches  and  the 
current  match  to  the  quadratic  variation,  E  [Crimson 
1981;  Terzopoulos  1983, 1984, 1988]. 


The  surface  reconstruction  method  of  Harris  [1987] 
was  chosen  because  it  could  potentially  be 
implemented  in  hardware,  and  it  can  incorporate  slope 
information.  The  quadratic  variation  including  slope 
information  is  shown  below 


If  we  consider  (xp,yp)  and  the  sources  of  mismatches 
as  a  set  of  points  in  L.  Then  map  them  to  the  minima 
in  R,  the  interpolated  depth  can  be  used  to  compute 
the  quadratic  variation.  Thus,  the  correspondence 
which  satisfies 

min  [  w  •  F  +  Quadratic  Variation] 

over  (Xp,yp)  and  the  sources  of  mismatches  in  L  and 
the  minima  in  R  is  taken  as  the  correct  correspondence 
of  (Xp,yp).  Furthermore,  if  we  have  access  to  a  dense 
surface  map  generated  from  previous  correspondences 
between  L  and  R,  then  we  can  also  use  the  surface  map 
to  fit  (Xp,yp)  directly. 


In  step  3  of  Figure  1,  we  have  two  possibilities: 
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The  stereo  matching  algorithm  which  incorporates  the 
two  cases  is  shown  in  Figure  S. 


Stereo  Matching  Algorithm 

(1)  If  the  feature  set,  F;  is  distinct,  then  go  to  2 
Else  go  to  3 

(2)  Apply  Fj  to  R  and  (Xp,yp)  to  find  the 
correspondence,  (Xg,y{,) 

(2. 1)  Update  surface  reconstruction  map  z=f(x,y) 

(2.2)  Update  the  learning  algorithm  concept,  C. 

(2.3)  STOP 

(3)  If  the  reconstructed  surface  map  is  dense  then 
set  (X(.,y(.)  to  the  minimum  which  agrees 
closest  with  the  surface  reconstruction  map. 

If  the  map  is  not  dense  then  select  the 
correspondence  which  minimizes  the 
normalized  sum  of  the  feature  error  and  the 
quadratic  variation,  E  over  (Xp,yp)  and  the 
sources  of  mismatches.. 

(3.1)  Update  surface  reconstruction  map. 

(3.2)  STOP _ 

Figure  S.  The  stereo  matching  algorithm. 

5  Results 

In  this  section  we  test  the  Landmark  algorithm  and  a 
conventional  pyramidal  algorithm  upon  the  test  set  of 
real  images.  We  present  the  comparative  matching 
accuracies,  and  then  we  show  the  test  images,  and 
other  visual  representations  of  the  matches  found  from 
the  Landmark  algorithm. 

The  algorithms  were  written  in  the  C  programming 
language,  and  were  ported  from  a  UNIX-based 
Hewlett-Packard  workstation  to  a  50-MHX  Zeos 
486/AT . 

The  Landmark  Algorithm  and  a  pyramidal  algorithm 
were  tested  on  stereo  pairs  of  real  images.  The  results 
are  shown  in  Figure  6,  and  graphed  in  Figure  7.  The 
pyramidal  algorithm  is  presented  as  a  benchmark.  The 
data  structure  for  the  pyramidal  algorithm  is  a 
Gaussian  intensity  pyramid.  Linear  search  was 
performed  at  the  starting  level,  and  then  hill  climbing 
through  the  pyramid  image  structure  was  used  to 
implement  the  refinement  to  the  finest  resolution.  The 
matching  feature  was  intensity  and  the  metric  was  the 
normalized  correlation  coefficient 

The  256x256  images  include  a  poster  of  a  volleyball 
player,  a  person  standing  in  front  of  a  plain 
background,  an  outdoor  scene  of  a  street,  and  a  rock 
wall  image  from  the  Stuttgart  standardized  image  set 
[Forstncr,  W.  1986;  Gulch,  E.  1988],  robot  arms 
against  a  complex  background,  and  a  face. 


Unfortunately,  due  to  space  limitations  we  can  not 
include  all  of  the  visual  results  in  this  paper. 

The  volleyball  poster  in  Figure  8  was  chosen  because  it 
demonstrates  the  basic  matching  accuracy  in  object 
space.  Since  all  of  the  points  lie  on  a  plane  it  is  trivial 
to  check  whether  a  match  is  correct  The  greatest 
deviation  in  the  Z  axis  for  the  reconstructed  surface 
from  the  average  was  0.8  cm.  Figure  9  shows  the  8216 
matched  points.  Figure  10  shows  the  reconstructed 
surface  fiom  the  matched  points  using  the  Landmark 
algorithm. 

For  the  following  images,  the  matches  were  checked 
visually.  Thus  the  expected  accuracy  is  approximately 
one  pixel. 

The  rock  wall  stereo  pair  in  Figure  1 1  depicts  a  rock 
wall  which  changes  r^idly  in  depth.  This  image  was 
selected  because  it  shows  the  potential  for  a  stereo 
matcher  to  perform  automated  torain  mapping,  and 
for  benchmark  reasons.  It  was  in  the  "difficult” 
category  of  the  Stuttgart  standardized  image  set 
[Forstner  1986,  Gulch  1988].  Figure  12  shows  the 
2281  matched  points.  The  reconstructed  surface  in 
Figure  13  shows  the  mismatched  points  as  sharp 
jagged  peaks. 

The  robots  stereo  pair  in  Figure  14  shows  three 
industrial  robot  arms.  This  stereo  pair  was  chosen  for 
the  similarity  to  industrial  manuf^turing 
environments.  Note  that  points  along  the  background, 
the  robots,  and  the  curved  white  tubing  were  matched. 
The  outline  of  the  two  robots  on  the  left  can  be  seen 
against  the  door.  The  outline  of  the  robot  on  the  right 
side  can  be  seen  against  the  wall.  Figure  15  shows  the 
1275  matched  points.  The  reconstructed  surface  is 
shown  in  Figure  16. 

Figure  17  shows  the  face  stoeo  pair.  This  stereo  pair 
was  chosen  to  show  the  potential  for  human  body 
measurement  using  stereo  matching.  Note  that  the 
resolution  was  sufficient  to  show  the  eyes  and  nose,  but 
not  the  lips.  The  matched  points  occur  in  curves, 
because  these  represent  equal  intensity  areas  on  the 
face.  Figure  18  shows  the  726  matched  points.  The 
reconstructed  surface  in  Figure  19  shows  that  the  eyes 
are  slightly  too  sunken.  This  is  due  to  the  limitations 
of  the  resolution  of  the  image  to  resolve  depth 
sufficiently  accurately. 

Finally,  we  briefly  consider  the  effect  of  the  learning 
module.  The  learning  module  is  used  in  order  to 
reduce  the  time  required  to  search  for  a  distinct  feature 
set,  and  that  the  time  used  by  the  learning  module  will 
depend  upon  the  size  of  the  exemplar  list  as  well  as  the 
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size  of  the  training  instances.  When  we  switched  off 
the  learning  module,  the  average  time  was  1.03 
sec^int.  When  we  used  an  exemplar  list  of 
maximum  length  100  with  the  seven  features  described 
in  section  1,  the  average  time  was  0.44  sec^int. 

Thus,  the  learning  module  resulted  in  more  than 
doubling  the  matching  speed. 
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78 

71 
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69 

84 

89 
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64x64 

83 

84 

68 

74 

79 

95 

. 

^ndm^ 

97 

96 
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97 
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Figure  6.  A  table  of  the  comparative  matching 
accuracies  in  percentage  of  the  Landmark  algorithm 
versus  the  pyramid  algorithm.  Note  that  the  pyramid 
algorithm  required  a  starting  level  at  which  the  initial 
matches  were  found  by  linear  search.  The  numbers 
such  as  16x16, 32x32,  or  64x64  refer  to  the  resolution 
of  the  starting  level. 


Figure  7.  The  percentage  of  correctly  matched  points 
over  a  variety  of  images  between  a  pyramidal  matcher 
with  starting  resolutions  at  16x16, 32x32,  and  64x64, 
and  the  Landmark  algorithm. 

6  Conclusions 

For  practical  applications,  a  robust  stereo  matching 
algorithm  should  be  (1)  sufficiently  general  to  analyze 
the  image  content  and  select  the  appropriate  features; 

(2)  it  should  use  features  which  minimize  the 
possibility  of  a  wrong  match;  (3)  it  should  be  able  to 
realize  its  own  limitations.  Specifically,  it  should 
know  when  it  is  unable  to  find  a  reliable  and  accurate 
match;  (4)  it  should  not  require  excessive 
computational  power  nor  resources;  (S)  it  should  be 
sufficiently  noise  tolerant  to  match  real  world  scenes  as 
opposed  to  artificial  or  laboratory  scenes. 


The  Landmaric  stereo  matching  algorithm  satisfies  all 
of  the  above  properties  by  (1)  using  a  noise  toloant 
instance  based  learning  algorithm  to  generate  an 
approximation  to  the  optimal  feature  set;  (2)  using  an 
active  method  to  maximize  the  distincmess  of  the 
selected  point  (Xp,yp);  (3)  implementing  the  complete 
algorithm  on  a  personal  computer,  (4)  extensively 
testing  the  Landmark  algorithm  upon  a  wide  range  of 
real  images. 

Furthermore,  from  the  frnal  report  of  the  NSF 
Worieshop  on  "Challenges  in  Computer  Vision 
Research;  Future  Research  Directions,”  two  of  the 
majOT  recommendations  on  research  topics  and  issues 
included  [S.  Negahdaripour  and  A.  Jain  1992]. 

-  More  experimental  rigor  in  vision  research 

-  Researchers  should  address  the  integration 
of  isolated  modules  at  each  visual 
processing  level 

The  Landmark  algorithm  addresses  both  of  these 
recommendations. 

The  set  of  test  images  were  real  images  of  complex 
scenes  which  would  be  found  in  practical  applications 
such  as  terrain  mapping,  human  body  measurement, 
and  industrial  manufacturing.  For  the  purposes  of 
bench  marking,  the  Landmark  algorithm  was 
compared  to  a  single  feature  pyramid  matching 
algorithm.  The  matching  accuracy  of  the  Landmark 
algorithm  ranged  from  91%  to  99%  while  the 
matching  accuracy  for  the  pyramid  algorithm  ranged 
from  only  52%  to  95%. 

The  main  contributions  of  this  paper  are 

(1)  Integrating  the  modules  of  learning,  feature 

selection,  and  surface  reconstruction. 

(2)  Extensive  empirical  testing  upon  real  images 

(3)  A  method  to  determine  when  the  feature  set  is 

insufricient  to  discriminate  between  match 
possibilities. 

(4)  A  method  of  guided  maximization  of  the  class 

separation  criterion  for  selecting  the  best 
feature  set 

(5)  A  noise  tolerant  instance  based  learning 

algorithm 

Future  research  will  focus  on  recognition  applications, 
in  particular,  facial  feature  recognition. 

The  software  for  this  algorithm,  and  a  stereo 
workbench  complete  with  automated  testing  routines, 
stereo  test  images,  manually  found  correspondences, 
and  graphics  input  and  output  routines  is  available 
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from  the  authors  for  the  purposes  of  bench  marking 
and  comparison  with  other  methods.  The  PC 
executable  requires  a  386  PC  running  Microsoft 
Windows  version  3.1,  and  8  MB  of  RAM. 
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Figure  8.  Volleyball  poster  stereo  pair 


Figure  9.  Matched  points  of  the  volleyball  stereo  pair 


Figure  10.  Reconstructed  surface  of  the  volleyball  poster  stereo  pair  at  different  viewing  angles 
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Figure  1 1 .  Rock  wall  sicrco  pair 


Figure  12.  The  Matched  Points  of  the  Rock  Wall  Stereo  Pair 
Mismatch 


Figure  13.  Reconstructed  surface  of  the  rock  wall  stereo  pair  at  different  viewing  angles.  Note  that  the 
rock  wall  is  a  slanted  wall  which  slants  away  from  the  viewer.  The  spike  in  the  left  surface  plot  is  due  to 
mismatch. 


Figure  14.  Robots  stereo  pair 


/I 
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Figure  16.  Reconstructed  surface  of  the  robots  stereo  pair  at  different  viewing  angles 
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Figure  17.  The  face  stereo  pair 


Figure  19.  Reconstructed  surface  of  the  face  stereo  pair  at  different  viewing  angles 
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Implementation  and  Performance  of 
Fast  Parallel  Multi-Baseline  Stereo  Vision 


Jon  A.  Webb 

School  of  Computer  Science 
Carnegie  Mellon  University 
Pittsburgh.  PA  15213-3890 


A  fast  implementation  of  multi-baseline  stereo 
vision  on  a  parallel  computer  is  described.  For 
three  240x256  images,  the  algorithm  runs  in  64 
ms  on  64  iWarp  processors,  exceeding  15  Hz 
frame  rate.  This  is  a  speedup  of  51  over  an 
implementation  on  a  SPARC  11  and  represents 
the  fastest  correlation-based  stereo  vision  sys¬ 
tem  reported.  Implementing  this  algorithm  this 
efficiently  required  careful  trade-offs  in  algo¬ 
rithm  design  and,  particularly,  in  the  imple¬ 
mentation  of  the  basic  communication 
operations.  A  building  block  approach  is 
described  for  achieving  best  efficiency  in  com¬ 
munication;  the  basic  operations  that  the  par¬ 
allel  computer  can  do  at  maximum  speed  are 
identified,  and  then  these  primitives  are  used  to 
construct  the  communications  functions  needed 
by  the  algorithm. 

1.  Introduction 

We  have  achieved  the  highest  reported  per¬ 
formance  ever  of  a  stereo  vision  system  based 
on  correlation:  64  ms  for  240x256  imagery, 
exceeding  IS  Hz  on  a  64-processor  iWarp  com¬ 
puter.  Achieving  this  was  not  simply  a  matter  of 
increasing  processor  performance  but  involved 

This  research  was  supported  by  the  National  Science 
Foundation  under  Grant  MIPS  8920420  and  by  the 
Defense  Advanced  Research  Projects  Agency  and 
monitored  by  U.  S.  Army  Training  and  Doctrine  in 
Fort  Huachuca,  Arizona  under  Contract  DABT63- 
91-C-0035. 

The  Government  has  certain  rights  in  this  material. 
Any  views,  opinions,  finding,  and  cotKluskms  or 
recommendabons  expressed  in  this  material  are 
those  of  the  author  and  do  not  necessarily  reflect  the 
views  of  the  National  Science  Foun^tiotuThey 
should  also  not  be  interpreted  as  representing  the 
official  policies,  either  expressed  or  implied,  of  the 
U.S.  Government 


many  careful  trade-offs  that  apply  to  vision  sys¬ 
tems  and  parallel  machines  in  general,  lliis 
paper  closely  analyzes  the  performance  of  this 
one  system  in  order  to  derive  general  lessons 
concerning  the  performance  of  parallel  systems 
on  computer  vision  problems. 

2.  Stereo  vision  algorithm 


The  stereo  vision  algorithm  used  here  is 
Kanade-Okutomi  multi-baseline  stereo  [2]. 
Multi-baseline  stereo  uses  multiple  cameras 
along  a  single  baseline  to  provide  redundant 
information.  This  allows  a  simple  matching 
algorithm  to  give  very  good  results. 

The  principle  of  multi-baseline  stereo  is 
illustrated  in  Figure  1.  In  normal  stereo,  there 
are  two  images,  a  reference  and  a  match  image. 
For  each  pixel  in  the  reference  image,  and  for 
each  possible  disparity,  the  error  is  calculated 
between  the  reference  and  match  images. 


Ordinary  Stereo 

Reference  image  .  Daparity 


Match  image 


Multiple  rriinima 


Reference  image 


Disparity 

Single  minimum 

Match  images 

Multi-baseline  Stereo 


Figure  1.  Ordinary  and  multi-baseline  stereo. 


Ordinary  stereo  can  give  multiple  false 
matches,  as  illustrated.  Multi-baseline  stereo 
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reduces  this  problem  by  taking  advantage  of 
redundant  information.  As  illustrated,  we  have 
arranged  the  cameras  so  that  a  disparity  of  d  in 
the  fint  match  image  corresponds  to  2d  in  the 
second.  It  is  unlikely  that  a  pixel  in  the  refer¬ 
ence  image  will  match  pixels  at  corresponding 
disparities  in  both  images  unless  the  match  is 
real. 

The  error  is  calculated  by  summing  the  error 
between  the  images  over  a  small  window  of 
size  hxw.  )^th  a  naive  algorithm,  this  calcula¬ 
tion  over  a  image  of  size  rxc  will  take  rchw  add 
operations.  By  maintaining  row  and  column 
sums,  we  can  reduce  this  to  4rc  operations  plus 
some  overhead.  However  this  also  reduces  the 
available  parallelism,  because  the  row  and  col¬ 
umn  sums  must  be  updated  sequentially. 

3.  Parallel  design  issues 

Adapt  was  used  to  implement  the  stereo 
vision  algorithm.  Adapt  is  a  specialized  lan¬ 
guage  for  image  processing  on  parallel  comput¬ 
ers  that  has  been  extensively  described 
elsewhere  [3].  The  Adapt  compiler  automati¬ 
cally  parallelizes  image  processing  programs, 
but  requires  the  programmer  to  write  them  in 
terms  of  a  series  of  image  to  image  operations, 
each  of  which  is  described  as  a  program  to  be 
applied  at  every  pixel  or  every  row  of  the 
images  in  parallel. 

Multi-baseline  stereo  vision  divides  natu¬ 
rally  into  two  steps,  which  are  repeated  for  each 
disparity  d: 

Difference.  Calculate  the  difference  image, 
which  is  the  pixel-by-pixel  error  between 
the  reference  image  and  the  match  images. 
This  is  done  by  forming  the  sum  of  squared 
differences  between  the  reference  pixel 
and  the  match  images  shifted  within  rows 
by  appropriate  multiples  of  d. 

2.  Minimize.  At  each  pixel,  sum  the  differ¬ 
ence  image  over  the  hxw  window  and 
compare  this  with  the  best  error  so  far.  If 
less,  replace  the  error  value  by  the  new 
value  and  the  previous  disparity  by  d. 


The  difference  step  can  be  implemented  eas¬ 
ily  and  efficiently  in  Adapt.  However,  if  the 
minimize  step  is  implemented  in  the  obvious 
way  processor  utilization  is  poor.  This  is 
because  the  number  of  rows  h  in  the  window  is 
generally  large  compared  to  the  number  of 
rows  per  processor;  for  example,  the  image  size 
may  be  240x256  and  there  are  64  processors, 
while  the  window  size  is  13x13.  When  the  win¬ 
dow  sum  is  formed,  each  processor  will  be 
repeating  the  work  of  other  processors  witii 
overlapping  rows.  As  a  result,  ti.ere  will  be  lit¬ 
tle  or  no  speedup  as  the  number  of  processors  is 
increased  beyond  the  point  at  which  the  number 
of  rows  per  processor  equals  the  window 
height,  or  about  18  processors  in  this  case. 

The  solution  to  this  is  to  pre-calculate  the 
partial  sum  of  the  difference  image  in  the  differ¬ 
ence  step.  Each  processor  can  simply  add  its 
four  rows  together,  forming  a  sum  image, 
which  is  distributed  to  other  processors  before 
the  minimize  step.  The  minimize  step  then 
forms  its  initial  row  sum  by  adding  in  appropri¬ 
ate  rows  of  the  sum  image,  together  with  what¬ 
ever  image  rows  are  need^  to  make  the 
window  height  exactly  h  rows. 

4.  Computation  Issues 

Table  1  summarizes  the  performance  of  the 
computation  in  the  stereo  program  and  com- 
parrs  the  performance  of  the  assembly-lan¬ 
guage  routines  used  with  C-generated  code. 


Language 

Assembly 

'  C 

Difference 

22;0msC237 

MFLOPS) 

36.0  ms  (157 
MFLOPS) 

Minimize 

17.3  ms  (369 
MFLOPS) 

40.7  ms  (156 
MFLOPS) 

Total 

39.3  ms 

76.7  ms 

Table  1.  Stereo  Computation  Times 
for  three  240x256  images  and  16  disparity 
levels,  on  64  processors. 


There  are  two  reasons  why  the  assembly 
code  is  significantly  faster  than  the  C  code:  (1) 
The  compiler  optimizes  code  primarily  within  a 
basic  block.  In  assembly  coding,  there  is  no 
such  restriction.  (2)  Bugs  in  the  C-step  imple- 
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mentation  of  the  iWarp  component,  impact  the 
performance  of  the  compute-and-access 
instruction,  which  allows  floating  point  add  and 
multiply  operations  to  proceed  in  parallel  vdth 
memory  access  operations.  Since  the  C  com¬ 
piler  must  generate  code  for  the  most  general 
case,  it  cannot  generate  the  most  efficient  code. 
However,  within  Adapt  these  bugs  either  can¬ 
not  occur  or  can  easily  be  avoided. 

5.  iWarp  Communications  Structure 

Before  discussing  communications  issues  in 
the  stereo  program,  we  first  discuss  the  commu¬ 
nication  capabilities  of  the  iWarp  computer; 
more  detail  is  available  elsewhere  [1].  It  is  a 
two-dimensional  torus,  each  of  which  is  con¬ 
nected  to  its  nearest  neighbor  by  input  and  out¬ 
put  ports,  each  of  which  can  transfer  data  at  40 
MB/s.  Cell  (0,0)  of  the  processor  array  is  con¬ 
nected  to  the  SIB,  a  special  iWarp  processor 
that  can  communicate  with  I/O  devices  across  a 
VME  bus;  all  data  processed  by  the  iWarp  array 


iWarp  supports  connections  which  define 
communications  routes  among  processors. 
Within  connections  messages  can  be  used  to 
allow  more  than  two  processors  to  communi¬ 
cate  on  a  single  connection.  >\lthin  messages, 
data  can  be  sent  word  by  word  (called  systolic) 
or  with  a  DMA-like  operation  called  spooling 
which  transfers  the  data  in  the  background. 

6.  Communications  Issues 

The  communications  operations  in  Adapt 
and  their  theoretical  maximum  physical  band- 
widths  are: 

•  Broadcast',  duplicate  data  presently  in  the 
SIB  on  all  processors.  40  MB/s. 


•  Distribute:  distribute  an  image  presently  in 
the  SIB  to  other  processors,  by  row 
swaths.  80  MB/s. 

•  Collect:  collect  an  image  distributed  on  the 
processors  by  row  swaths  and  transfer  to 
the  Sm.  80  MB/s 

•  Create  working  copy:  obtain  the  grace 
bands  needed  for  processing  a  swath  of  the 
image  from  adjacent  processors  160  MB/ 
s. 

These  bandwidths  are  achieved  (to  within 
about  10%)  in  our  implementation  of  Adapt  on 
iWarp,  even  for  relatively  small  amounts  of 
data.  To  do  this,  we  define  primitive  maximum 
bandwidth  operations  and  then  implement  all 
communications  operations  in  terms  of  them, 
ensuring  maximum  performance  for  communi¬ 
cations.  The  primitive  I/O  for  iWarp  are: 

1.  Receive.  Data  is  taken  from  the  pathway 
and  stored  in  memory. 

2.  Send.  Data  is  taken  from  memory  and  sent 
to  the  pathway, 

3.  Pass.  Data  is  passed  from  one  pathway  to 
another. 

4.  Receive  and  pass.  Data  is  be  passed  from 
one  pathway  to  another  and  simulta¬ 
neously  copied  into  memory. 

5.  Receive  and  pass  twice.  Data  is  be  copied 
into  memory  and  also  sent  to  two  other 
pathways. 

6.1  Broadcast 

Broadcast  can  be  implemented  in  a  variety 
of  ways.  The  SIB  is  connected  to  the  processor 
array  at  two  points,  so  data  can  be  sent  either 
entirely  from  one  connection  or  over  both  con¬ 
nections  simultaneously.  Once  the  data  is  inside 
array  it  can  be  transferred  from  cell  to  cell  in  a 
pipeline  using  receive  and  pass  or  in  a  tree  fash¬ 
ion  by  also  using  receive  and  pass  twice.  We 
will  consider  these  three  techniques:  using  one 
SIB  connection  and  pipelined  transfer  within 
the  array  (method  (a));  using  both  SIB  connec¬ 
tions  and  pipelined  transfer  (method  (b));  using 
one  SIB  connection  and  tree  transfer  within  the 
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array  (method  (c)). 


All  of  these  methods  achieve  the  same  max¬ 
imum  bandwidth.  The  only  difference  between 
them  is  in  latency.  These  differences  in  latency 
lead  to  a  significant  difference  in  performance, 
as  illustrate  in  Figure  3. 


Figure  3.  Performance  of  different 
implementations  of  broadcast. 


Currently,  Adapt  uses  method  (a)  for  broad¬ 
cast  because  it  was  the  easiest  to  implement 

6.2  Distribute 

There  are  only  two  fundamentally  different 
methods  of  implementing  distribute.  In  the  first 
method,  data  is  sent  systolically  from  the  SIB 
over  one  pathway;  each  cell  executes  a  receive 
followed  by  a  pass.  In  the  second  method  data 
is  sent  firom  the  SIB  over  two  pathways,  one 
that  goes  in  a  forward  direction  through  the 
processors  and  the  other  that  goes  in  a  reverse 
direction.  Processors  in  the  forward  pathway 
act  as  before,  while  processors  in  the  reverse 
pathway  execute  at  pass  followed  by  a  receive. 

Since  the  second  method  has  a  maximum 
physical  bandwidth  of  80  MB/s  versus  40  MB/ 
s  for  the  first  method,  it  is  to  be  preferred  if  it 
does  not  impose  too  high  startup  overhead.  We 
therefore  examine  the  transfer  rate  for  images 
of  various  sizes  in  Figure  4.  From  this  graph  we 
observe  that  even  for  small  images  of  8K  bytes, 
the  transfer  rate  exceeds  the  maximum  that  can 
be  expected  from  using  just  one  pathway.  We 
therefore  adopt  the  two  pathway  method.r 


32K  64K  96K  128K 

Total  image  size  (bytes) 

Figure  4.  Performance  of  distribute. 

63  Collect 

Collect  can  be  implemented  using  one  or 
two  pathways,  just  as  with  distribute.  The  per- 
formance  of  the  two  pathway  method  is  shown 
in  Figure  S.  As  with  distribute^  the  graph  rap¬ 
idly  exceeds  the  maximum  bandwidth  with  a 
single  pathway  (at  12K  bytes  the  bandwidth  is 
42  MB/s),  so  this  method  is  chosen.  However, 
the  maximum  bandwidth  achievable  in  collect 
seems  to  be  60  MB/s,  which  is  less  than  the  the¬ 
oretical  maximum  of  80  MB/s.  This  is  due  to  a 
bug  in  the  iWarp  hardware  that  limits  the  max¬ 
imum  physical  bandwidth  while  transferring 
into  the  SIB  to  30  MB/s  over  any  one  physicd 
pathway. 


6.4  Create  working  copy 

In  create  working  copy  each  cell  initially  has 
several  rows  of  data,  and  will  obtain  some  rows 
of  data  above  and  below  its  rows  from  adjacent 
cells.  The  behavior  is  shown  in  Figure  6.  From 
this  figure,  it  is  clear  that  create  working  copy 
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can  be  implemented  as  two  shift  operations,  in 
which  processor  i  sends  its  rows  of  data  to  the 
next  or  previous  processor  in  the  array,  depend¬ 
ing  on  whether  the  shift  is  in  a  forward  or 
reverse  direction.  Create  working  copy  consists 
of  one  shift  in  the  forward  dilution  and  one 
shift  in  reverse. 


Before 


After 


Figure  6.  Create  working  copy. 


Each  shift  operation  can  be  implemented 
with  send  and  receive  operations  in  the  obvious 
way.  Using  spooling,  each  processor  can  be 
sending  and  receiving  data  for  both  shifts 
simultaneously,  giving  a  maximum  bandwidth 
of  160  MB/s. 

150.0 

■Jo 

-100.0 

o 

I 

I  50.0 
0 

Data  transferred  (bytes) 

Figure  7.  Performance  of  create  working 
copy. 


The  performance  of  the  I/O  in  create  work¬ 
ing  copy  is  shown  in  Figure  7.  The  transfer  rate 
seems  to  peak  at  150  MB/s,  instead  of  the 
expected  160  MB/s;  some  of  this  difference 
must  be  due  to  the  same  hardware  limitation 
that  led  to  the  maximum  60  MB/s  transfer  rate 
to  the  SIB  in  collect. 


6S  Stereo  I/O  times 

Table  2  summarizes  the  stereo  VO  times. 
The  times  for  distribute  and  create  working 
copy  were  measured  directly,  and  the  times  for 
broadcast  and  collect  were  estimated  from  the 
work  earlier  in  this  section  (these  times  could 
not  be  estimated  directiy  in  a  running  system 
since  they  are  overlapped  with  other  computa¬ 
tion;  thus,  the  times  reported  here  are  overesti¬ 
mates  in  terms  of  the  impact  of  these  times  on 
total  execution  time.) 


Broadcast 

1.92  ms 

Distribute 

3.00  ms 

Create 
working  copy 

difference 

image 

8.76  ms 

sum  image 

4.90  ms 

Collect 

4.16  ms 

Total 

22.7  ms 

Table  2.  Stereo  I/O  times  for  three  240x256 
images  and  16  disparity  levels 


7.  Scaling  with  number  of  processors 

Figure  8  gives  the  scaling  of  the  stereo 
vision  algorithm  (for  three  240x256  images 
with  a  13x13  summation  window)  from  16  to 
64  processors.  The  execution  time  does  not 
decrease  smoothly  with  increasing  numbers  of 
processors  because  the  program  is  sensitive  to 
the  match  between  the  number  of  rows  in  the 
image  and  the  number  of  processors.  This  leads 
to  a  significant  loss  of  efficiency  with  processor 
numbers  of 40, 48,  and  56. 


Figure  8.  Scaling  of  performance  with 
number  of  processors. 
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8.  Conclusions  and  Future  Work 

The  stereo  vision  system  described  here  is 
the  fastest  system  for  obtaining  depth  data  pres¬ 
ently  available  —  it  is  even  faster  than  laser 
ranging  systems,  for  a  comparable  number  of 
depth  measurements.  Achieving  this  high  per¬ 
formance  was  not  simply  a  matter  of  reimple¬ 
menting  an  existing  algorithm  on  a  par^lel 
computer,  but  rather  required  rethinking  of  the 
algorithm  in  one  step,  and  particular  attention 
to  I/O  within  the  algorithm: 

•  The  introduction  of  the  sum  image  in  the 
difference  step  made  it  possible  to  scale 
the  algorithm  beyond  18  processors. 

•  Thinking  of  the  I/O  in  iWarp  in  terms  of  a 
collection  of  primitive  operators  that  could 
be  assembled  to  get  the  Adapt  I/O  func¬ 
tions  was  the  key  to  efficiency.  Previous 
implementations  of  the  Adapt  I/O  func¬ 
tions  made  apparently  reasonable  choices 
(e.g.,  using  a  binary  tree  and  message¬ 
passing  to  implement  broadcast)  but 
because  there  was  no  systematic  approach 
to  getting  maximum  efficiency  perfor¬ 
mance  was  poor,  and,  more  importantly,  it 
was  impossible  to  determine  whether  the 
implementation  was  as  good  as  possible. 
In  the  work  reported  here  we  are  reason¬ 
ably  certain  that  the  Adapt  I/O  functions 
are  as  efficient  as  possible. 

•  Most  parallel  computers  support  only  mes¬ 
sage  passing,  and  have  no  systolic  com¬ 
munication  facilities.  This  limits  the 
primitive  operators  to  receive  and  send. 
Using  just  ffiese  operators,  it  is  possible  to 
build  the  Adapt  I/O  functions,  but  with 
great  loss  of  efficiency,  especially  in 
broadcast.  Perhaps  by  analysis  of  the 
primitive  I/O  functions  the  machine  is 
capable  of  as  it  is  being  constructed, 
increased  flexibility  can  be  introduced, 
resulting  in  greater  efficiency  for  common 
program  communications  functions. 

Future  work  will  concentrate  on  integrating 
the  computation  described  here  with  the  cam¬ 
era  and  display  interfaces  to  demonstrate  high¬ 


speed  stereo  vision  in  the  real  woiid.  The  fast¬ 
est  VME-based  framebuffer  known  to  us  is  the 
Ironies  IV-FCFG,  which  supports  a  5.12  MB/s 
transfer  rate,  but  requires  transferring  4  bytes 
for  every  pixel  (in  RGBO  format),  which  qua¬ 
druples  the  data  that  must  be  transferred  for  dis¬ 
play  (the  three  input  images  can  be  transferred 
as  one  RGB  image).  Using  this  firamebuffer,  we 
have  been  able  to  achieve  a  cycle  time  for  the 
stereo  algorithm  of  97  ms,  including  digitiza¬ 
tion,  input,  and  output,  for  an  image  processing 
rate  of  10  Hz.  We  are  investigating  other  solu¬ 
tions  that  will  allow  us  to  achieve  the  IS  Hz 
processing  time  possible  on  iWarp;  several 
solutions  look  promising. 
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Abstract 

One  of  the  key  tools  in  applying  physics- 
based  models  to  machine  vision  has  been 
the  analysis  of  color  histograms.  We 
present  here  an  algorithm  for  analyzing 
color  histograms  that  yields  estimates  of 
surface  roughness,  phase  angle  between 
the  camera  and  light  source,  and  illumina¬ 
tion  intensity.  In  addition,  an  understand¬ 
ing  of  the  effect  of  these  parameters  upon 
highlight  appearance  allows  us  to  make 
better  estimates  of  illumination  color  and 
object  color. 

We  test  our  algorithm  on  simulated  data 
and  the  results  compare  well  with  the 
known  simulation  parameters.  We  also  test 
our  method  on  real  images  and  the  results 
are  reasonably  close  to  the  actual  parame¬ 
ters  estimated  by  other  means.  Our  method 
for  estimating  scene  properties  is  very  fast, 
and  requires  only  a  single  color  image. 

1.  Introduction 

Color  histograms  have  long  been  used  by  the 
machine  vision  community  in  image  understand¬ 
ing.  Color  is  usually  thought  of  as  an  important 
property  of  objects,  and  is  often  used  for  segmenta¬ 
tion  and  classification.  Unfortunately  color  is  not 
uniform  for  all  objects  of  a  given  class,  nor  even 
across  a  single  object.  Therefore  color  variation 
must  be  modeled  in  some  way. 

The  earliest  uses  of  color  histograms  modeled  the 
histogram  as  a  Gaussian  cluster  in  color  space  [3]. 
In  1984  Shafer  showed  that  for  dielectric  materials 
with  highlights,  the  color  histogram  associated 
with  a  single  object  forms  a  plane  [10].  This  plane 


is  defined  by  two  color  vectors:  a  body  reflection 
vector  and  a  surface  reflection  vector.  Every  pixel’s 
color  is  a  linear  combination  of  these  two  colors.  In 
1987  Klinker  and  Gershon  independently  observed 
that  the  color  histogram  forms  a  T-shape  or  dog-leg 
in  color  space  [6],  [2].  These  histograms  are  com¬ 
posed  of  two  linear  clusters,  one  corresponding  to 
pixels  that  exhibit  mostly  body  reflection  and  one 
corresponding  to  pixels  exhibiting  surface  reflec¬ 
tion.  This  T-shape  made  it  possible  to  identify  char¬ 
acteristic  body  reflection  and  illumination  colors. 

However,  there  is  more  to  be  said  about  the  shape 
of  the  color  histogram.  In  previous  work,  we 
showed  that  color  histograms  have  identifiable  fea¬ 
tures  that  depend  in  a  precise  mathematical  way 
upon  such  non-color  scene  properties  as  surface 
roughness  and  imaging  geometry  [9j.  In  this  paper 
we  show  that  three  scene  properties  the  illumina¬ 
tion  intensity,  the  roughness  of  the  surface,  and  the 
phase  angle  between  camera  and  light  source 
may  be  recovered  from  three  measurements  of  the 
histogram  shape. 

The  functions  that  relate  the  scene  properties  to  the 
histogram  measurements  are  interdependent  and 
highly  non-linear.  Since  an  exact  solution  is  not 
feasible,  we  have  developed  a  method  that  interpo¬ 
lates  between  data  from  a  lookup  table.  Our  work 
has  also  shown  how  the  colors  observed  in  a  high¬ 
light  depend  upon  more  than  just  the  scene  colors, 
but  also  upon  the  surface  roughness  and  the  imag¬ 
ing  geometry  [9].  Thus,  our  estimates  of  these 
scene  parameters  allow  us  to  improve  initial  esti¬ 
mates  of  the  illumination  color.  This  means  we  are 
better  able  to  discount  the  effect  of  the  illuminant 
to  recover  estimates  of  the  object’s  reflectance. 

We  begin  in  section  2  with  a  brief  explanation  of 
the  relationship  between  the  color  histogram  fea- 


1013 


tures  and  the  various  scene  parameters.  Section  3 
presents  an  algorithm  to  compute  estimates  of 
these  parameters  from  the  histogram.  Section  4 
shows  how  our  algorithm  has  been  applied  to  real 
images. 

2.  Understanding  Color  Histograms 

When  we  talk  about  the  color  histogram,  we  mean 
a  distribution  of  colors  in  the  three-dimensional 
RGB  space.  For  a  typical  imaging  system  with  8 
bits  for  each  color  band,  there  are  256^  “bins”  into 
which  a  pixel  may  fall.  In  this  work,  we  only  con¬ 
sider  whether  a  bin  is  full  or  empty.  We  do  not  use 
a  fourth  dimension  to  record  the  number  of  pixels 
which  have  a  particular  RGB  value.  A  fourth 
dimension  would  be  difficult  to  visualize,  but  more 
significantly  would  also  be  dependent  on  such 
things  as  object  size  and  shape. 

In  our  work  we  use  the  “Dichromatic  Reflection 
Model”,  where  the  light  L  reflected  from  a  dielec¬ 
tric  object  is  the  sum  of  two  components:  a  body 
component  and  a  surface  component  [10]: 

L(X,0.,0^0p)  =  Z.*(X,0j,0^0p  + 

L,(X,0.,0^0p 

Each  of  the  two  components  is  a  function  of  the 
wavelength  of  light  (X)  and  the  angles  of  incidence 
(9.),  reflection  (0^),  and  phase  angle  (0^).  The 
Dichromatic  Reflection  Model  further  states  that 
each  component  in  turn  may  be  separated  into  a 
color  term  c  that  depends  only  on  X,  and  a  magni¬ 
tude  term  m  that  depends  only  upon  0.,  0^  and  9y. 

L (X,  0.,  0^  0p)  =  Cf, (X)  m^{9.,  0^,  0^)  + 

C,(X)  OT,(0,,0,,0p) 

Figure  1  shows  a  sketch  of  a  typical  color  histo¬ 
gram  for  a  dielectric  surface  illuminated  by  a  single 
light  source.  As  labeled,  the  histogram  has  two  lin¬ 
ear  clusters  of  pixels:  the  body  reflection  cluster 
and  the  highlight  cluster.  The  first  of  these  clusters 
extends  from  the  black  comer  of  the  cube  (point  a) 
to  the  point  of  maximum  body  reflection  (point  b). 
The  other  cluster  starts  somewhere  along  the  body 
reflection  cluster  (point  c)  and  extends  to  the  high¬ 
light  maximum  (point  d). 

2.1.  The  Body  Reflection  Cluster 

The  linear  cluster  that  we  call  the  body  reflection 
cluster  corresponds  to  pixels  that  exhibit  mostly 


Figure  1:  Histogram  of  an  object 

body  reflection  with  very  little  surface  reflection.  If 
there  is  no  ambient  illumination  in  the  scene,  this 
cluster  begins  at  the  black  point  of  the  color  cube 
(point  a),  corresponding  to  points  on  the  surface 
with  normals  90  degrees  or  more  away  from  the 
direction  of  the  illumination.  The  point  at  the  other 
extreme  of  the  body  reflection  cluster  (point  b), 
corresponds  to  the  largest  amount  of  body  reflec¬ 
tion  seen  anywhere  on  the  object.  Assuming  that 
body  reflection  is  Lambertian,  the  magnitude  term 
will  obey  the  relation 

=  yB^cos  (0.)  (1) 

where  0,  is  the  angle  of  illumination  incidence. 
The  gain  in  converting  photons  measured  by  the 
imaging  array  into  pixel  values  is  represented  by  y. 
The  brightness  of  the  body  reflection  is  represented 
by  the  term  B^.  This  factors  in  both  the  reflectance 
of  the  object  (albedo)  and  the  intensity  of  the  light. 

If  the  object  exhibits  all  possible  surface  normals, 
the  body  reflection  cluster  will  be  full  length  and 
densely  filled.  If  the  object  is  composed  of  a  small 
number  of  flat  surfaces,  there  will  be  gaps  in  the 
body  reflection  cluster.  For  this  paper  we  will 
assume  that  objects  we  are  looking  at  have  a  broad, 
continuous  distribution  of  surface  normals. 

A  vector  fitted  to  the  body  reflection  cluster  (from 
point  a  to  b)  will  point  in  the  direction  of  the  body 
reflection  color,  which  is  the  product  of  the  object 
color  and  the  illumination  color.  If  the  illumination 
color  has  been  determined  from  analysis  of  the 
highlight,  the  object  color  may  be  calculated  by 
dividing  out  the  illumination  color,  as  shown  in 
some  color  constancy  methods  [1],  [4],  [S],  [II]. 

If  we  assume  that  there  is  some  point  on  the  object 
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which  has  a  surface  normal  pointing  directly  at  the 
light  source  (and  which  is  visible  to  the  camera), 
then  at  that  point  cos  (0.)  =  1 .  This  means  that  the 
length  of  the  fitted  vector  (the  magnitude  lafel )  cor¬ 
responds  to  the  gain  times  the  object’s  apparent 
brightness  If  the  intensity  of  the  illumination  is 
also  recovered  from  highlight  analysis,  then  the 
object  albedo  can  be  septu'ated  from  the  object’s 
apparent  brightness  (assuming  that  the  gain  of  the 
camera  had  been  calibrated).  This  makes  it  possi¬ 
ble  to  tell  a  bright  light  shining  on  a  dark  surface 
from  a  dim  light  shining  on  a  bright  surface. 


2.2.  The  Highlight  Cluster 

The  cluster  of  pixels  we  call  the  highlight  cluster 
corresponds  to  pixels  that  show  a  non-negligible 
amount  of  surface  reflection.  These  pixels  come 
from  the  areas  of  the  image  that  form  the  highlight. 
In  the  histogram,  the  highlight  cluster  starts  where 
it  intersects  with  the  body  reflection  cluster  (point  c 
in  Figure  1)  and  extends  upwards  from  there  to  the 
brightest  point  of  the  highlight  (point  d). 

In  this  presentation  we  use  the  Torrance-Sparrow 
model  of  scattering  [12].  This  models  a  surface  as  a 
collection  of  tiny  facets,  with  a  describing  the 
standard  deviation  of  facet  angles  with  respect  to 
the  global  surface  normal.  Smooth  surfaces  will 
have  a  small  standard  deviation  while  rougher  sur¬ 
faces  will  have  a  laiger  standard  deviation.  The 
equation  for  scattering  gives  the  amount  of  surface 
reflection  as 


FG(0., 0^,0  )a 
acos(0/"^^P 


(2) 


where  0^  is  the  off-specular  angle  and  0^  is  the 
angle  of  reflectance.  6,  is  the  intensity  of  the  illu¬ 
mination,  Y  is  the  camera  gain,  and  a  is  a  constant 
that  includes  the  facet  size  (a  variable  in  the  origi¬ 
nal  Torrance-Sparrow  model).  G  is  the  attenuation 
factor  and  is  a  complicated  function  of  imaging 
geometry  (see  [12]  for  details). 


F  is  the  Fresnel  coefficient  that  describes  the  per¬ 
centage  of  the  light  that  is  reflected  at  the  interface; 
it  is  a  function  of  geometry,  wavelength,  polariza¬ 
tion  state,  and  index  of  refraction  of  the  material  in 
question.  However  it  is  very  weakly  dependent  on 
incident  angle  and  wavelength  (over  the  visible 
range),  so  we  will  follow  the  Neutral  Interface 
Reflection  model  and  assume  that  it  is  constant  for 


a  given  material  [7j.  Furthermore,  for  a  wide  range 
of  plastics  and  paints,  the  indices  of  refraction  are 
very  nearly  identical.  For  this  paper  we  will 
assume  that  materials  have  an  index  of  refraction 
of  1 .5,  corresponding  to  4.0%  Fresnel  reflectance. 

2.2.1.  Length  of  Highlight  Cluster 

When  looking  at  highlights  on  a  variety  of  sur¬ 
faces,  we  quickly  observe  that  highlights  are 
brighter  and  sharper  on  some  surfaces,  while  they 
are  dimmer  and  more  diffused  on  other  surfaces. 
Very  shiny  surfaces  exhibit  only  a  tiny  amount  of 
scattering  of  the  surface  reflection,  whereas  very 
matte  surfaces  have  a  great  deal  of  scattering. 

We  see  from  equation  (2)  that  the  sharpness  of  the 
peak  is  determined  by  the  standard  deviation  a, 
and  that  the  height  of  the  peak  is  inversely  propor¬ 
tional  to  a.  Intuitively  this  makes  sense,  since  sur¬ 
face  reflection  scattered  over  a  smaller  area  will  be 
more  “concentrated.”  A  smooth  object  will  have  a 
small  standard  deviation  of  facet  slopes  (c)  result¬ 
ing  in  a  long  highlight  cluster.  A  rough  object  will 
have  a  large  a,  and  so  will  exhibit  a  shorter  cluster. 

Equation  (2)  indicates  that  the  intensity  of  the  light 
source  B,  also  affects  the  magnitude  of  the  surface 
reflection,  and  thus  the  length  of  the  highlight  clus¬ 
ter.  It  is  obvious  from  equation  (2)  that  the  length  is 
directly  proportional  to  this  brightness. 

Equation  (2)  also  predicts  that  imaging  geometry 
will  have  an  effect  upon  highlight  magnitude,  as 
indicated  by  the  cos(0^)  term  in  the  denominator 
and  the  attenuation  term  G  in  the  numerator.  The 
effect  on  the  length  of  the  highlight  cluster  is  small 
but  noticeable,  causing  the  length  to  increase  as  the 
phase  angle  is  increased. 

2.2.2.  Width  of  Highlight  Cluster 

Another  difference  between  histograms  for  smooth 
and  rough  surfaces  is  the  width  of  the  highlight 
cluster  where  it  meets  the  body  reflection  cluster 
(the  distance  from  point  Cj  to  point  C2  in  Figure  1 ). 
The  highlight  cluster  will  be  wider  for  rougher  sur¬ 
faces,  and  narrower  for  smoother  surfaces.  This  is 
because  rougher  objects  will  .scatter  surface  reflec¬ 
tion  more  widely,  over  a  larger  number  of  reflec¬ 
tance  angles. 

In  the  color  histogram,  any  amount  of  surface 
reflection  results  in  pixels  that  are  displaced  from 
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the  body  cluster  in  the  direction  of  the  illumination 
color.  If  we  take  any  highlight  pixel  and  project 
along  the  surface  color  vector  onto  the  body  reflec¬ 
tion  vector,  we  can  tell  how  much  body  reflection 
is  present  in  that  pixel.  If  we  consider  all  the  pixels 
in  the  highlight  area  of  the  image  and  look  at  how 
much  body  reflection  is  in  each  of  them,  we  obtain 
a  range  of  body  reflection  magnitudes.  The  rougher 
the  surface,  the  larger  the  range  of  body  reflection 
magnitudes,  since  the  highlight  is  spread  over  a 
larger  number  of  surface  normals.  This  property  is 
independent  of  the  object’s  size  and  shape. 

The  brightness  of  the  illumination,  B,,  will  also 
have  an  effect  on  the  width  of  the  highlight  cluster. 
As  the  light  intensity  is  increased,  points  on  the 
surface  that  had  amounts  of  surface  reflection  too 
small  to  be  noticed  may  become  bright  enough  to 
be  included  with  the  highlight  pixels.  Clearly  the 
width  will  grow  as  the  light  intensity  grows.  How¬ 
ever  the  growth  is  very  slow,  much  slower  than  the 
linear  growth  of  the  length  with  increasing  illumi¬ 
nation  intensity.  This  means  that  the  ratio  of  the 
cluster  length  to  the  width  grows  as  the  illumina¬ 
tion  is  increased,  making  it  possible  to  distinguish 
a  bright  source  illuminating  a  rough  object  from  a 
dim  source  illuminating  a  shiny  one. 

Although  the  width  of  the  highlight  cluster  does 
not  depend  upon  the  object’s  size  and  shape,  it  does 
depend  upon  the  phase  angle.  To  see  why,  consider 
a  highlight  that  spreads  15  degrees  in  every  direc¬ 
tion  from  its  maximum.  If  the  camera  and  light 
source  are  separated  by  30  degrees,  the  perfect 
specular  angle  will  be  at  15  degrees  with  respect  to 
the  illumination  direction.  The  highlight  will 
spread  over  points  with  surface  normals  ranging 
from  0  degrees  to  30  degrees.  (For  this  explanation, 
we  will  ignore  the  l/cos(0^)  term  in  equation 
(2).)  The  amount  of  body  reflection  at  these  points 
will  vary  from  cos  (0)  =  1.0  to  cos  (30)  =  0.87,  a 
width  of  0.13.  On  the  other  hand,  if  the  camera  and 
light  source  are  separated  by  90  degrees,  the  per¬ 
fect  specular  angle  will  be  at  45  degrees,  with  the 
highlight  spreading  from  30  degrees  to  60  degrees. 
Now  the  amount  of  body  reflection  will  vary  from 
cos  (30)  =  0.87  to  cos  (60)  =  0.50,  giving  a  much 
larger  width  of  0.37. 

2.2.3.  Intersection  of  Clusters 

When  we  introduced  the  diagram  in  Figure  1,  we 


described  the  highlight  cluster  as  beginning  “some¬ 
where”  along  the  body  reflection  cluster.  Now  we 
will  show  how  to  pinpoint  the  location.  The  dis¬ 
tance  along  the  body  reflection  cluster  where  the 
two  clusters  imersect  (the  length  of  ^  divided  by 
the  length  of  ab  in  Figure  1 )  shows  the  amount  of 
body  reflectance  at  those  points  on  the  surface  that 
are  in  the  highlight.  Assuming  that  body  reflection 
is  Lambertian,  the  amount  of  body  reflection  is 
proportional  to  the  cosine  of  the  incidence  angle.  If 
the  two  clusters  intersect  at  the  maximum  point  on 
the  body  reflection  cluster,  it  means  the  highlight 
occurs  at  those  points  that  have  the  maximum 
amount  of  body  reflection,  where  surface  normals 
point  directly  at  the  light  source.  If  the  two  clusters 
meet  halfway  along  the  body  reflection  cluster,  the 
highlight  must  occur  at  points  with  surface  normals 
pointing  acos  ( 1  /2)  or  60  degrees  away  from  the 
illumination  direction. 

If  the  body  reflection  is  Lambertian,  it  does  not 
depend  in  any  way  upon  the  angle  from  which  it  is 
viewed.  Thus  the  body  reflection  does  not  tell  us 
anything  about  the  camera  direction.  However,  the 
surface  reflection  is  dependent  upon  both  the  illu¬ 
mination  and  camera  directions.  If  we  ignore  for  a 
moment  the  l/cos(0^)  term  in  equation  (2),  we 
see  that  the  maximum  amount  of  surface  reflection 
will  occur  at  those  points  on  the  surface  where  the 
angle  of  incidence  equals  the  angle  of  reflection. 
Thus  if  the  highlight  occurs  at  a  point  where  the 
surface  normal  faces  10  degrees  away  from  the 
light  source  direction,  the  light  source  and  camera 
must  be  20  degrees  apart  with  respect  to  that  point 
on  the  surface. 

The  1/cos  (0^)  term  in  equation  (2)  means  that  the 
maximum  amount  of  surface  reflection  will  not 
always  occur  precisely  at  the  perfect  specular 
angle.  This  is  particularly  true  of  rougher  surfaces 
where  the  highlight  is  spread  over  a  wide  range  of 
reflectance  angles,  causing  significant  variations  in 
l/cos(0^).  This  causes  “off-specular  peaks” 
described  in  [13].  The  result  is  that  the  intersection 
is  slightly  dependent  upon  the  surface  roughness. 

2.2.4.  Direction  of  Highlight  Cluster 

The  highlight  cluster  is  u.sually  long  and  narrow  in 
shape  and  a  vector  can  be  fitted  to  it  (from  point  c 
to  d  in  Figure  1).  Klinker  argued  that  this  vector 
will  usually  correspond  closely  to  the  surface 
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reflection  color  [5],  This  is  true  for  smooth  objects 
where  the  highlight  has  a  small  area,  and  for  imag¬ 
ing  geometries  where  the  body  reflection  changes 
slowly  over  that  area.  In  this  case,  the  amount  of 
body  reflection  at  the  base  of  the  highlight  cluster 
and  the  amount  at  the  tip  varies  by  a  small  amount. 

On  the  other  hand,  if  the  object  is  rough  and  the 
highlight  occurs  on  a  part  of  the  object  where  the 
cosine  of  the  incidence  angle  changes  more  rapidly, 
then  the  amount  of  body  reflection  at  the  base  of 
the  highlight  cluster  may  vary  significantly  from 
the  amount  at  the  tip.  This  has  the  effect  of  skewing 
the  highlight  cluster  away  from  the  direction  of  the 
illumination  color.  The  estimate  of  the  illumination 
color  made  from  fitting  a  vector  to  this  cluster  will 
be  somewhat  inaccurate. 

An  inaccurate  estimate  of  illumination  color  will, 
in  turn,  bias  estimates  of  the  object  color.  The  vec¬ 
tor  fitted  to  the  highlight  cluster  is  a  reasonable  first 
estimate  of  illumination  color,  but  we  now  know 
that  it  may  be  skewed.  If  we  know  the  surface 
roughness  and  imaging  geometry,  we  can  calculate 
the  amount  of  skewing  and  compensate  for  it. 

In  sections  2.2.1  and  2.2.2  we  showed  that  the 
roughness  of  the  object  affects  the  length  and  width 
of  the  highlight  cluster,  but  that  there  is  some 
dependence  on  imaging  geometry.  Section  2.2.3 
showed  that  the  imaging  geometry  determines  the 
intersection  of  the  two  clusters,  but  that  there  is 
some  dependence  upon  roughness.  Furthermore, 
the  intensity  of  the  illumination  affects  the  length 
and  width  of  the  highlight  cluster  as  well,  although 
in  different  ways.  The  degree  of  dependence  of 
each  histogram  measurement  upon  the  scene 
parameters  is  characterized  in  Table  1 . 


Roughness 

Phase  Angle 

Illumination 

Intensity 

Length 

strong 

weak 

strong 

Width 

strong 

strong 

weak 

Intersection 

weak 

strong 

none 

Table  1:  Dependence  of  histogram  features 
upon  scene  parameters 


In  this  section  we  have  shown  that  the  direction  of 
the  highlight  cluster  may  be  skewed  away  from  the 
direction  of  the  illumination  color,  depending  upon 
the  roughness  and  the  phase  angle.  The  highlight 
cluster  length,  width  and  intersection  are  defined 


with  respect  to  the  highlight  cluster  direction,  so 
measurements  of  these  features  depend  upon  the 
estimate  of  the  illumination  color.  If  that  direction 
is  incorrect,  the  other  histogram  measurements  will 
be  affected.  Obviously  these  factors  are  interdepen¬ 
dent.  Therefore  we  solve  for  the  roughness,  phase 
angle,  and  illumination  intensity  simultaneously. 

3.  Analyzing  Color  Histograms 

The  previous  section  described  the  relationship 
between  scene  parameters  and  histogram  features. 
Understanding  the  relationship  is  the  first  step  in 
analyzing  color  histograms.  The  next  step  is  figur¬ 
ing  out  how  to  actually  exploit  the  histogram  to 
recover  quantitative  measures  of  scene  properties. 

The  known  image  parameters  which  can  be  mea¬ 
sured  from  the  color  histogram  are  the  body  reflec¬ 
tion  cluster’s  length  and  direction;  the  highlight 
cluster’s  length,  width,  and  direction;  and  the  inter¬ 
section  point  of  the  two  clusters.  This  gives  four 
scalar  values  and  two  vector  quantities.  They  will 
be  referred  to  by  the  following  variables: 

/  -  length  of  highlight  cluster 
w  -  width  of  highlight  cluster 
i  -  intersection  of  two  clusters 
b  -  length  of  body  reflection  cluster 
d,  -  direction  of  highlight  cluster 
dj,  -  direction  of  body  reflection  cluster 

The  unknown  scene  parameters  which  we  would 
like  to  recover  from  the  histogram  can  also  be 
divided  into  scalar  values  and  vector  quantities. 
These  variables  are: 

a  -  optical  roughness 

0p  -  phase  angle 

Bj.  -  illumination  intensity 

B„-  object  albedo 

c,  -  chromaticity  of  light  source 

c„  -  object  chromaticity  under  “white”  light 

For  convenience,  we  have  separated  the  illumina¬ 
tion  and  object  reflectance  into  intensity  compo¬ 
nents  and  chromatic  components.  As  we 
mentioned  in  section  2.1  the  object’s  chromaticity 
c„  may  be  recovered  in  a  straightforward  wa^  from 
the  direction  of  the  body  reflection  cluster  df,  and 
the  color  of  the  light  source  c,.  The  red  component 
of  the  body  reflection  vector  is  divided  by  the  red 
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component  of  the  light  source  color,  the  green 
component  divided  by  the  green  component,  and 
the  blue  component  divided  by  the  blue  compo¬ 
nent.  The  result  is  normalized  to  length  1 . 

The  albedo  of  the  object  is  also  recovered  in  a 
straightforward  manner  from  the  length  of  the  body 
reflection  cluster  and  the  illumination  intensity. 
The  length  b  is  equal  to  the  maximum  amount  of 
body  reflection,  where  cos(0.)  =  1.  The  object 
albedo  is  simply  the  cluster  length  b  divided  by 
the  illumination  intensity  B,  and  by  the  gain  y. 

For  the  remaining  unknowns  the  situation  is  not  so 
simple.  From  section  2  we  know  that  the  length  /  is 
related  to  surface  roughness  and  illumination  inten¬ 
sity,  but  is  also  dependent  upon  imaging  geometry. 
The  remaining  knowns  {I,  w,  i,  d,)  and  unknowns 
(o,  6  ,  B,,  c,)  will  be  examined  in  detail  in  the 
next  few  sections. 


described  in  section  2.2.4,  the  highlight  color  may 
be  estimated  by  fitting  a  vector  to  the  pixels  that 
form  the  highlight  cluster.  However  that  cluster 
may  be  skewed. 

This  is  shown  in  Figure  2,  which  is  the  histogram 
of  a  simulated  rough  object  illuminated  with  white 
light.  The  dotted  “measured”  line  shows  the  direc¬ 
tion  calculated  for  the  best  fit  vector  to  the  high¬ 
light  cluster.  (In  this  case  the  best  fit  line  to  the 
cluster  will  not  pass  through  the  brightest  point  in 
the  highlight  cluster.)  The  position  of  the  dotted 
line  in  Figure  2  shows  the  projection  of  the  bright¬ 
est  highlight  pixel  onto  the  body  reflection  vector 
along  the  best  fit  vector,  since  the  length  is  calcu¬ 
lated  from  the  brightest  pixel.  The  “ideal”  line  indi¬ 
cates  the  direction  of  the  actual  illumination  color. 
The  skewing  causes  the  measured  length  to  be 
longer  than  it  would  have  been  if  the  correct  illumi¬ 
nation  color  had  been  known. 


3.1.  Exact  Solutions 


Equations  (1)  and  (2)  describe  the  amounts  of  body 
and  surface  reflection,  and  m,,  as  a  function  of 
imaging  geometry  and  light  intensity.  Equation  (2) 
also  shows  how  the  amount  of  surface  reflection 
varies  with  the  roughness  a.  Unfortunately  it  is  not 
possible  to  directly  solve  for  these  scene  parame¬ 
ters  from  the  histogram  measurements.  For  exam¬ 
ple,  the  length  of  the  highlight  cluster  indicates  the 
maximum  amount  of  surface  reflection  seen  any¬ 
where.  For  given  values  of  a,  0^,  and  B,,  the 
length  I  may  be  calculated 


/  =  MAX 


FC(0.,  0,,0p)a 
ocos  (0^) 


-exp 


(3) 


over  all  values  of  0.,  0^,  and  0^.  However,  since  G 
contains  several  trigonometric  functions  and  the 
roughness  term  a  occurs  both  inside  and  outside 
the  exponential,  there  is  no  analytic  solution  for  o 
from  the  length  /. 


Although  equation  (3)  has  no  analytic  solution,  it 
might  be  possible  to  solve  it  iteratively,  through 
some  sort  of  search  (for  example  by  gradient 
descent).  However  that  assumes  the  length  /  can  be 
accurately  measured  from  the  histogram.  In 
section  2.2.1 ,  the  length  is  measured  from  the  tip  of 
the  highlight  cluster  to  its  base  along  the  direction 
of  the  highlight  color.  Unfortunately  the  highlight 
color  is  not  typically  known  in  advance.  As 


Figure  2:  Skewed  length  measurement 

This  contrasts  the  “ideal”  length  that  we  would  like 
to  obtain  from  the  histogram,  with  the  “measured” 
length  that  can  actually  be  recovered  without  a  pri¬ 
ori  knowledge  of  the  illumination  color.  When  the 
highlight  is  skewed,  the  values  measured  from  the 
histogram  will  be  different  from  the  ideal  values 
that  would  be  calculated  from  analyzing  equation 
(2).  In  the  absence  of  a  priori  information,  we  can 
only  measure  what  is  available  in  the  histogram. 
How  do  we  derive  the  ideal  values  from  the  ones 
that  are  measured?  Once  that  is  done  how  do  we 
recover  the  image  parameters  in  which  we  are 
interested? 

3.2.  Approximate  Solution 

Our  approach  is  to  recover  scene  parameters  by  an 
approximate  method,  directly  from  the  initial  histo¬ 
gram  measurements.  Therefore,  we  do  not  need  to 
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recover  “ideal”  histogram  values  from  the  “mea¬ 
sured”  ones.  The  ideal  values  of  the  histogram  are 
a  useful  abstraction  since  their  relationship  to  scene 
parameters  is  easy  to  explain.  However  they  cannot 
be  obtained  from  the  histogram  without  knowledge 
of  the  illumination  color. 

Figure  3  shows  the  variation  in  the  measured 
length  as  roughness  and  phase  angle  are  changed. 
Each  value  of  length  describes  a  contour  within  the 
space  of  roughness  and  phase  angles.  Given  a 
length  measurement  from  a  histogram,  the  associ¬ 
ated  scene  parameters  must  lie  somewhere  on  that 
contour.  When  illumination  intensity  is  considered 
along  with  roughness  and  phase  angle,  these  three 
scene  parameters  form  a  three  dimensional  param¬ 
eter  space.  A  length  measurement  would  then 
describe  a  two  dimensional  surface  within  that 
space,  showing  the  possible  roughness,  phase 
angle,  and  light  intensity  values  that  could  give  rise 
to  a  histogram  with  that  length  highlight  cluster. 


changes  in  roughness  and  phase  angle 

The  intersection  and  width  measurements  will  also 
describe  surfaces  within  the  parameter  space.  The 
hope  is  that  the  surfaces  for  each  of  the  histogram 
measurements  will  intersect  at  a  single  point  in  the 
parameter  space,  making  it  possible  to  recover 
unique  values  for  surface  roughness,  phase  angle, 
and  illumination  intensity.  So  obvious  questions 
ate:  how  can  we  generate  these  contours  of  equal- 
length,  equal-width,  and  equal-intersection;  and  do 
these  contours  intersect  to  give  unique  solutions? 

Section  3.1  pointed  out  that  there  is  no  analytic 
solution  to  generate  t'  e  contours.  However, 
Figure  3  shows  how  the  highlight  cluster  length 
varies  with  roughness  and  phase  angle  at  discrete 


points.  These  values  come  from  simulating  an 
object  with  those  parameters  and  then  measuring 
the  length,  width  and  intersection  of  the  highlight 
cluster  in  the  resulting  color  histogram.  By  simu¬ 
lating  a  large  range  of  roughness  values,  phase 
angles,  and  illumination  intensities,  we  create 
lookup  tables  of  length,  width,  and  intersection 
measurements.  We  then  search  through  the  lookup 
table  to  find  the  scene  parameters  that  correspond 
to  a  given  set  of  histogram  measurements. 

The  other  question  is  whether  a  unique  solution 
exists  for  a  given  triple  (/,  w,  /) .  If  some  triple  has 
more  than  one  solution,  that  means  that  different 
combinations  of  scene  parameters  can  give  rise  to 
identical  histogram  measurements.  It  also  means 
that  a  search  through  the  contours  in  parameter 
space  cannot  be  guaranteed  to  converge.  We  have 
explored  the  distribution  of  possible  (/,  w, /)  tri¬ 
ples,  and  found  that  each  set  of  measurements  is 
associated  with  at  most  one  set  of  scene  parame¬ 
ters.  (Of  course,  many  points  within  the  measure¬ 
ment  space  will  not  correspond  to  any  set  of  scene 
parameters.)  The  only  remaining  problem  is  to 
determine  which  set  of  scene  parameters  is  associ¬ 
ated  with  a  given  measurement  triple. 

3.3.  Generating  Lookup  Tables 

The  range  of  roughness  values,  phase  angles,  and 
illumination  intensities  used  to  create  the  lookup 
table  is  shown  in  Table  2.  The  roughness  value  is 
the  standard  deviation  of  facet  angles  with  respect 
to  the  global  surface  normal.  The  phase  angle  is  the 
angle  between  the  camera  and  light  source  with 
respect  to  the  object.  The  light  intensity  is  a  per¬ 
centage  of  a  hypothetical  light’s  maximum  output. 


Min 

Max 

Increment 

Total  Used 

Roughness 

1° 

15“ 

2“ 

8 

Phase  Angle 

(f 

90“ 

10“ 

10 

Intensity 

50% 

100% 

10% 

6 

Overall 

480 

'Table  2:  Range  of  parameters 


For  each  set  of  roughness,  phase  angle,  and  light 
intensity  values,  a  simulated  object  is  generated. 
The  histogram  associated  with  the  object  is  auto¬ 
matically  separated  into  body  reflection  and  high¬ 
light  clusters.  'The  technique  is  similar  to  that 
described  in  [5].  Vectors  are  fitted  to  each  of  the 
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clusters. 


(10) 


Once  the  direction  of  the  highlight  cluster  has  been 
measured,  the  vector  d,  is  used  to  project  all  high¬ 
light  pixels  onto  the  body  reflection  vector  df,- 
These  projections  determine  the  relative  contribu¬ 
tions  of  the  vectors  in  each  pixel.  Each  color  pixel 
p  in  the  histogram  can  then  be  defined  as 

p  =  m^d,  +  mjlh 

This  is  essentially  the  dichromatic  equation, 
although  the  highlight  cluster  direction  d^  may  dif¬ 
fer  from  the  actual  surface  color.  The  histogram 


measurements  are  then  deflned  simply  as 

/  =  MAX  (m,)  overall  p  (4) 

b  =  MAX  (m^)  overall  p  (5) 

i  =  mi,/b  for  that  p  with  maximum  (6) 

w  =  [MAX  (nif,)  -  MIN  (m^)  ]  /b  over  all 

p  for  which  m,  >  T  (7) 


The  threshold  T  is  set  according  to  the  noise  level 
of  the  camera. 

3.4.  Calculating  Roughness,  Phase  Angle, 
and  Illumination  Intensity 

Once  the  length,  width  and  intersection  have  been 
measured,  the  problem  is  to  determine  which  scene 
parameters  will  give  rise  to  that  shape  histogram. 
Our  work  uses  a  polynomial  approximation  to  the 
lookup  table  data.  We  assume  that  the  roughness 
can  be  approximated  as  a  polynomial  function  of 
the  length,  width,  and  intersection  measurements 
of  the  histogram. 

0  (8) 

~  +  Bl  +  Cvv  +  Di  + 

El^  +  Fw^  +  Gi^  +  Hlw  +  Hi  +  Jiw  +  ... 

The  lookup  table  provides  the  means  for  calculat¬ 
ing  the  coefficients  of  the  polynomial.  It  provides 
almost  500  sets  of  histogram  measurements  and  the 
associated  roughness  values.  Least  squares  estima¬ 
tion  is  used  to  calculate  the  best  fit  nth  degree  poly¬ 
nomial  to  the  data.  A  fourth  degree  polynomial  is 
used  in  our  experiments. 

Similarly,  the  phase  angle  and  illumination  inten¬ 
sity  arc  also  approximated  as  polynomial  functions 
of  the  histogram  length,  width,  and  intersection. 

=  (9) 


B,  =  /!„(/.  w,0 

Generating  the  lookup  table  is  obviously  very  time 
consuming  (about  8  hours  on  a  SPARC  II)  since  it 
involves  calculating  almost  500  graphics  simula¬ 
tions.  However,  the  table  generation  and  coefficient 
calculation  only  need  to  be  done  once  and  can  be 
done  ahead  of  time.  At  run-time  our  system  takes  a 
histogram  from  an  image  with  unknown  parame¬ 
ters  and  automatically  separates  it  into  two  clusters 
and  measures  their  dimensions.  The  polynomial 
equations  are  then  applied  to  quickly  estimate  the 
roughness,  phase  angle,  and  illumination  intensity. 
The  run-time  portion  is  very  quick,  taking  less  than 
3  seconds  on  a  histogram  containing  about  3600 
pixels.  If  the  histogram  has  already  been  split  into 
clusters  in  the  process  of  segmenting  the  image  [6], 
the  time  to  calculate  the  scene  parameters  is  less 
than  I  second. 

To  test  the  polynomial  approximations,  one  hun¬ 
dred  test  images  were  simulated  and  then  analyzed 
by  our  method.  The  surface  roughness,  phase 
angle,  and  illumination  intensity  values  used  in  the 
test  images  were  chosen  by  a  pseudo-random  num¬ 
ber  generator.  The  test  values  were  constrained  to 
lie  within  the  ranges  used  in  the  lookup  table. 

The  calculated  values  of  o,  0^,  and  B,  were  com¬ 
pared  with  the  original  values  used  to  generate  the 
image.  In  almost  all  cases  the  calculated  values 
were  close  to  the  original  ones.  However,  for  2%  of 
the  cases,  the  values  were  very  obviously  wrong. 
For  example,  a  negative  value  of  roughness  or  illu¬ 
mination  intensity  is  clearly  unreasonable.  Fortu¬ 
nately,  bad  values  can  be  detected  automatically, 
by  checking  to  see  if  recovered  values  are  within 
the  allowable  range.  Recovered  values  that  fall  out¬ 
side  that  range  indicate  that  a  different  method 
should  be  used  to  recover  the  scene  parameters.  We 
have  found  that  the  problem  disappears  if  a  third 
degree  polynomial  is  used  instead,  although  the 
overall  error  on  all  cases  is  slightly  higher. 

Table  3  shows  the  results  for  the  remaining  98 
cases  where  the  fourth  degree  polynomial  pro¬ 
duced  reasonable  estimates.  It  shows  the  average 
error  in  recovering  the  parameter,  and  also  reiter¬ 
ates  the  step  sizes  u.sed  in  the  table.  The  errors  are 
lower  than  the  table  resolution,  showing  the  inter¬ 
polation  method  is  fairly  effective. 

The  results  for  calculating  roughne.ss  and  phase 
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Parameter 

Average  Error 

Table  Resolution 

Roughness 

1.20“ 

2“ 

Phase  Angle 

4.40“ 

10“ 

Intensity 

8.18% 

10% 

Cases  Considered 

98/100 

Table  3:  Results  on  simulated  data 


angle  are  very  good.  They  show  that  these  non¬ 
color  parameters  may  be  calculated  with  reason¬ 
ably  high  accuracy  just  by  considering  the  shape  of 
the  color  histogram.  The  error  in  calculating  illu¬ 
mination  intensity  is  a  bit  higher,  although  it  still 
provides  a  useful  estimate. 

3.5.  Calculating  Illumination  Chromaticity 

As  we  pointed  out  in  section  2.2.4,  the  highlight 
cluster  may  be  skewed  from  the  direction  of  the 
illumination  color.  The  skew  is  particularly  pro¬ 
nounced  for  large  phase  angles  and  rough  surfaces. 
These  two  factors  determine  how  much  the  body 
reflection  changes  over  the  area  of  the  highlight. 

Therefore,  if  we  know  or  can  calculate  the 
surface  roughness  and  the  imaging  geometry,  we 
can  in  turn  calculate  the  amount  of  highlight  skew¬ 
ing.  Once  the  skew  is  known,  its  effect  can  be  sub¬ 
tracted  from  the  direction  of  the  highlight  cluster  to 
give  the  true  color  of  the  illumination. 

Section  3.4  showed  how  to  estimate  the  roughness 
and  phase  angle  from  the  color  histogram.  These 
estimates  are  now  used  to  estimate  the  skew. 
Again,  a  lookup  table  approach  is  used.  When  the 
simulations  are  performed  to  fill  the  lookup  tables 
with  measurements  of  length,  width,  and  intersec¬ 
tion,  the  skewing  of  the  highlight  is  also  calculated 
and  stored  in  the  table.  Then  a  polynomial  function 
is  used  to  calculate  the  skew  angle  as  a  function  of 
roughness,  phase  angle,  and  illumination  intensity; 

Skew^AJa.6^,B,)  (11) 

A  third  degree  polynomial  is  used  in  our  experi¬ 
ments. 

Once  the  skew  has  been  calculated,  the  illumina¬ 
tion  color  c,  may  be  calculated  from  the  measured 
direction  d,  and  the  calculated  skew  angle.  Obvi¬ 
ously,  if  the  polynomial  functions  described  in 
section  3.4  produce  bogus  estimates  of  the  rough¬ 
ness,  phase  angle,  or  illumination  intensity,  there  is 
little  point  in  plugging  them  into  the  equation  for 


calculating  Skew.  In  those  2%  of  the  100  test  cases, 
the  program  did  not  attempt  to  calculate  the  illumi¬ 
nation  color.  For  the  remaining  98  test  cases,  the 
skew  angle  was  used  to  calculate  the  illumination 
color.  The  results  are  shown  in  Table  4.  The  error 
in  estimating  skew  is  the  difference  between  the 
correct  skew  angle  and  the  skew  angle  calculated 
by  our  method.  The  table  shows  the  average  error 
over  the  98  cases  considered.  It  also  shows  the 
minimum,  maximum  and  average  of  the  actual 
skew  values.  For  69  of  the  test  images,  the  scene 
parameters  were  such  that  the  highlight  was 
skewed  by  more  than  1°. 


Average  error 

1.73“ 

Average  skew 

8.63“ 

Minimum  value 

0.01“ 

Maximum  value 

27.5“ 

Number  of  skews  >  1“ 

69 

Cases  considered 

98/100 

Table  4:  Results  in  calculating  skew 


We  will  describe  our  algorithm’s  performance  on 
one  of  the  test  simulations.  The  test  image  is  a  red 
cylinder  under  white  light.  The  simulation  parame¬ 
ters  are  given  in  the  first  column  of  Table  5.  The 
histogram  associated  with  this  image  is  shown  in 
Figure  4.  The  program  automatically  divided  the 
histogram  into  body  reflection  and  highlight  clus¬ 
ters.  The  measurements  made  of  the  histogram  are 
shown  in  the  second  column  of  Table  5. 


Simulated  Image 

Histogram 

Measurements 

Recovered 

Parameters 

=  d,  =  c,  = 

10.58,0.58,0.58]  [0.81,0.45,0.37]  [0.59,0.57.0.57] 

a  =  12.06° 

1  =  74.1 

a  =  11.99° 

0  =  63.30° 

p 

w  =  0.47 

0  =  67.05° 

p 

fi,  =  90% 

1  =  0.51 

B,  =  89% 

Table  5:  Example  results 

The  direction  fitted  to  the  highlight  cluster  is  sig¬ 
nificantly  skewed  away  from  the  direction  of  the 
actual  illumination  color.  It  represents  a  much  red¬ 
der  color  than  the  white  illumination  color,  and  so 
would  be  a  poor  estimate  of  the  illumination  color. 
It  would  also  yield  an  inaccurate  estimate  of  the 
object  color  when  the  illumination  color  is  divided 
out  of  the  body  reflection  color. 
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Figure  4:  Histogram  of  simulated  image 


Applying  polynomial  equations  (8),  (9)  and  (10)  to 
the  length,  width  and  intersection  measurements, 
the  program  estimated  the  scene  parameters  shown 
in  the  third  column  Table  5.  We  then  applied 
equation  (11)  to  these  estimates  of  a,  0^,  B,,  and 
estimated  the  skew  between  the  highlight  cluster 
direction  and  the  actual  illumination  color  to  be 
1 8.75°.  Applying  this  skew  to  the  cluster  direction 
ds  produced  an  estimate  of  the  light  source  chro- 
maticity  that  is  very  close  to  the  original  white 
color.  The  complete  algorithm  is  diagrammed  in 
Figure  5. 


Off-line: 


Run-time: 


Figure  5:  Algorithm  for  calculating  scene 
parameters 


4.  Applying  the  Algorithm  to  Real  Images 

Real  images  present  many  challenges  for  vision 
researchers.  These  include  camera  noise,  clipping, 
chromatic  aberration,  blooming,  color  imbalance, 
etc.  We  have  modified  our  algorithm  in  systematic 
ways  to  adapt  to  these  conditions.  Tliese  modifica¬ 
tions  are  described  in  [8].  We  will  describe  here 
some  experiments  we  have  performed  to  test  our 
algorithm  on  real  images. 

4.1.  Estimating  Phase  Angle 

An  experiment  was  set  up  in  the  Calibrated  Imag¬ 
ing  Laboratory  (CIL)  at  Carnegie  Mellon  Univer¬ 
sity  to  test  our  algorithm’s  ability  to  estimate  phase 
angle  from  real  images.  A  series  of  images  was 
taken  with  the  camera  and  light  source  separated 
by  an  increasing  phase  angle.  The  angle  was  esti¬ 
mated  with  a  large  protractor  and  strings  to  indicate 
the  direction  of  the  camera  and  light  source.  The 
angles  measured  by  this  method  were  estimated  to 
be  accurate  to  within  5  degrees. 

The  first  image  in  the  sequence  was  taken  when  the 
camera  and  light  source  were  approximately  10 
degrees  apart.  The  phase  angle  was  then  increased 
by  10  degrees  between  each  picture.  The  last  image 
in  the  sequence  was  taken  when  the  phase  angle 
between  the  camera  and  light  source  was  90 
degrees. 

The  program  automatically  split  the  color  histo¬ 
gram  of  the  object  into  two  clusters,  fit  vectors  to 
those  clusters,  and  calculated  the  values  of  length, 
width,  and  intersection.  This  was  repeated  for  each 
image  in  the  sequence.  The  polynomial  approxima¬ 
tion  described  in  section  3.4  was  used  to  calculate 
the  phase  angle  from  the  length,  width  and  inter¬ 
section  measurements.  The  results  are  shown  in 
Figure  6.  The  dotted  line  shows  the  correct  answer, 
using  the  phase  angle  measured  by  the  protractor  as 
ground  truth.  The  average  error  in  estimating  angle 
is  9.96°. 

Overall  the  method  developed  for  estimating  phase 
angle  from  analyzing  color  histograms  works  fairly 
well,  especially  considering  that  the  ground  truth 
measurement  of  the  phase  angle  is  fairly  crude. 
Also  the  lookup  tables  were  calculated  without  cal¬ 
ibrating  the  simulated  images  to  the  conditions  in 
the  CIL.  In  particular,  the  noise  of  the  camera  was 
not  measured  preci.sely  and  the  light  source  used  in 
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Figure  6:  Results  for  calculating  phase  angle  fron 
real  images 

the  experiments  was  not  a  point  source  as  was  used 
in  the  simulations. 

4.2.  Estimating  Illumination  Intensity 

A  second  experiment  was  performed  in  the  CIL  to 
test  the  performance  of  the  algorithm  at  estimating 
illumination  intensity.  The  light  was  plugged  into  a 
variable  voltage  supply  with  a  manually  operated 
dial.  A  sequence  of  images  was  taken  under 
increasing  levels  of  illumination,  while  the  imag¬ 
ing  geometry  and  target  object  were  kept  constant. 
Altogether  six  images  were  taken.  The  illumination 
level  was  measured  with  a  spot  meter  aimed  at  a 
white  card.  The  spot  measurements  were  estimated 
to  be  repeatable  to  within  5%. 

Again,  the  program  analyzed  the  histograms  to 
produce  measurements  of  length,  width  and  inter¬ 
section  for  each  image.  The  polynonual  equation  to 
calculate  illumination  intensity  was  then  applied  to 
these  measurements.  The  results  are  shown  in 
Figure  7.  The  horizontal  axis  shows  the  values  pro¬ 
duced  by  the  spot  meter,  while  the  vertical  axis 
shows  the  intensity  estimated  by  the  histogram 
analysis.  The  gain  of  the  camera  has  not  been  cali¬ 
brated,  so  the  program  gives  a  relative  estimate  of 
intensity.  The  dotted  line  shows  the  best  linear  fit  to 
the  data.  If  the  slope  of  that  line  is  considered  to  be 
the  gain  of  the  camera,  then  the  average  error  in 
estimating  illumination  intensity  is  S.07%. 

4.3.  Estimating  Roughness 

A  third  experiment  was  performed  to  show  how  the 
system  estimates  surface  roughness  from  color  his¬ 
tograms.  Figure  8  shows  a  composite  of  five 
images  of  different  objects.  The  upper  left  shows  a 
green  plastic  toy  in  the  shape  of  an  alligator;  the 
upper  right  contains  an  orange  plastic  pumpkin  for 


Figure  7:  Results  for  cakulating  illumination 
intensity  from  real  images 

trick-or-treating;  in  the  lower  left  is  a  terra-cotta 
ball;  in  the  lower  right  a  red  plastic  beach  ball.  In 
the  center  is  a  red  plastic  pail. 

Table  6  shows  the  roughness  calculated  by  the  sys¬ 
tem  for  each  of  these  objects.  The  objects  are  listed 
in  order  of  decreasing  roughness,  as  estimated  by 
human  observation.  The  calculated  roughness 
number  is  the  standard  deviation  of  facet  angles. 


Object 

Calculated  Roughness 

Alligator 

10.07" 

Pumi^dn 

8.93" 

Terra-cotta  ball 

3.61" 

RedbaU 

0.40“ 

Red  pail 

0.10" 

Table  6:  Results  for  estimating  roughness 


There  is  no  error  measure  for  these  results,  since 
there  is  no  ground  truth  data  for  the  actual  rough¬ 
ness  values.  Nevertheless,  the  roughness  ranking 
from  the  program  agrees  with  that  produced  by  a 
human  observer.  The  pumpkin  presents  a  particu- 
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Figure  8:  Five  objects  with  different  roughness 
values 
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larly  interesting  case,  since  it  shows  roughness  at 
more  than  one  scale.  At  the  large  scale  where 
roughness  is  judged  by  touch,  it  is  clearly  the 
roughest  object.  This  type  of  roughness  is  large 
enough  to  be  considered  “texture”.  At  the  smaller 
scale  of  optica]  roughness,  it  is  considered  to  be 
more  shiny  than  the  alligator. 

5.  Conclusions 

The  color  histogram  of  an  image  is  a  rich  source  of 
information,  but  it  has  not  been  fully  exploited  in 
the  past.  We  have  shown  that  the  color  histogram 
of  a  dielectric  object  may  by  characterized  by  a 
small  number  of  measurements,  which  relate 
directly  to  many  scene  properties.  We  have  shown 
how  these  histogram  measurements  may  be  used  to 
recover  estimates  of  surface  roughness,  imaging 
geometry,  illumination  intensity,  and  illumination 
color.  These  estimates  may  in  turn  be  used  to  cal¬ 
culate  object  color  and  albedo. 

The  resulting  algorithm  is  applied  to  real  images, 
and  produces  reasonable  estimates  of  phase  angle, 
illumination  intensity,  and  surface  roughness.  The 
method  is  independent  of  the  shape  of  the  object, 
and  works  on  shapes  ranging  from  a  pumpkin  to  an 
alligator.  The  model  used  to  develop  the  lookup 
tables  is  fairly  general,  and  was  not  calibrated  to 
match  the  actual  imaging  conditions  such  as  light 
source  extent,  camera  noise  characteristics,  etc. 
This  kind  of  analysis  may  be  applied  to  such  varied 
tasks  as  surface  inspection  and  object  recognition. 
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Abstract 

One  of  the  most  common  assumptions  for  recovering 
object  features  in  computer  vision  and  rendering  objects 
in  computer  graphics  is  that  the  radiance  distribution 
of  diffuse  reflection  from  materials  is  Lambertian.  We 
propose  a  reflectance  model  for  diffuse  reflection  from 
smooth  inhomogeneous  dielectric  surfaces  that  is  empir¬ 
ically  shown  to  be  significantly  more  accurate  than  the 
Lambertian  model.  The  resulting  reflected  diffuse  radi¬ 
ance  distribution  has  a  simple  mathematical  form.  The 
proposed  model  for  diffuse  reflection  utilises  results  of  ra¬ 
diative  transfer  theory  for  subsurface  multiple  scattering. 
For  an  optically  smooth  surface  boundary  this  subsurface 
intensity  distribution  becomes  altered  by  Ftesnel  attenu¬ 
ation  and  Snell  refraction  making  it  become  significantly 
non-Lambertian.  The  reflectance  model  derived  in  this 
paper  accurately  predicts  the  dependence  of  diffuse  re¬ 
flection  &om  smooth  dielectric  surfaces  on  viewing  angle, 
always  falling  off  to  sero  as  viewing  approaches  grasing. 
This  model  also  accurately  shows  that  diffuse  reflection 
f^  off  faster  than  predicted  by  Lambert’s  law  as  a  func¬ 
tion  of  angle  of  incidence,  particularly  as  angle  of  incidence 
approaches  close  to  90".  We  present  diffuse  reflection  ef¬ 
fects  near  occluding  contours  of  dielectric  objects  that  are 
strikingly  deviant  &om  Lambertian  behavior,  and  yet  are 
precisely  explained  by  our  diffuse  reflection  model.  Our 
proposed  diffuse  reflection  model  has  the  added  feature 
that  it  explains  the  physical  origin  of  diffuse  albedo  which 
is  typically  an  ad  hoc  scaling  coeflicient.  This  can  be  used 
to  explain  the  relative  strengths  of  the  specular  and  dif¬ 
fuse  reflection  components  from  inhomogeneous  dielectric 
surfaces  purely  in  terms  of  the  physical  parameters  of  the 
surface  itself. 

1  INTRODUCTION 

Perhaps  the  most  widely  used  assumption  about  re¬ 
flectance  from  materials  in  computer  vision  and  in  com¬ 
puter  graphics  is  Lambert’s  Law  for  diffuse  reflection  [7]. 
Lambert  predicted  that  diffuse  reflection  from  a  materim 
contributed  by  light  incident  from  a  specified  direction  b 
proportional  to  the  cosine  of  the  angle  between  thb  inci¬ 
dent  direction  and  the  surface  normal,  independent  of  the 
direction  of  reflection.  While  relatively  little  physical  mo¬ 
tivation  was  given  for  this  law  when  it  was  first  pubUshed 
over  200  years  ago,  it  has  been  adopted  by  the  computer 
vbion  and  computer  graphics  communities  primarily  be¬ 
cause  it  serves  as  a  reasonably  accurate  and  computation¬ 
ally  simple  approximation  for  describing  diffuse  reflection 
under  a  number  of  conditions. 

A  prevalent  class  of  materiab  encountered  both  in  com¬ 
mon  experience  and  in  vbion/robotics  environments  are 
inhomogeneous  dielectrics  which  include  plastics,  ceram¬ 


ics,  and,  rubber.  Almost  all  diffuse  reflection  from  these 
materials  physically  arises  from  subsurface  multiple  scat¬ 
tering  of  tight  cauKd  by  subsurface  inhomogeneities  in 
index  of  refraction.  In  thb  paper  we  model  inhomoge¬ 
neous  dielectric  material  as  a  collection  of  scatterers  con- 
t^ed  in  a  uniform  dielectric  medium  with  index  of  re¬ 
fraction  different  from  that  of  au.  We  propose  a  simple 
model  of  diffuse  reflected  intensity  resulting  from  the  pro¬ 
cess  of  incident  light  refracting  into  the  dielectric  medium, 
producing  a  subsurface  diffuse  intensity  dbtribution  from 
multiple  internal  scattering,  and  then  refraction  of  thb 
subsurface  diffuse  intensity  dbtribution  back  out  into  au. 
See  Figure  1.  In  [191,  [16]  we  formally  derived  and  empir¬ 
ically  verified  that  it  light  b  incident  with  radiance,  L,  at 
incidence  angle,  rp,  through  a  small  solid  angle,  dw,  on  a 
smooth  dielectric  surface,  then 

pL  X  (1  -  F{^,n))  X  cos^  x  (1  -  l/n))du» 

n 

describes  the  diffuse  reflected  radiance  into  emittance  an¬ 
gle  (i.e.,  viewing  an^e),  <p.  The  terms  F  refer  to  the  Ftes- 
nel  reflection  coefficients  [111,  n,  b  the  index  of  refraction 
of  the  dielectric  medium,  ana,  p,  b  the  total  diffuse  albedo. 
We  show  that  the  total  diffuse  albedo,  p,  b  duectly  re¬ 
lated  to  both  the  single  scattering  albedo  describing  the 
proportion  of  energy  reradiated  upon  each  subsurface  sin¬ 
gle  scattering,  and,  the  index  of  refraction  n. 


The  Lambertian  term,  cos^,  embodied  in  the  above 
expression  arises  from  a  combination  of  subsurface  diffuse 
scattering  described  by  Chandrasekhar’s  analysb  using  ra¬ 
diative  transfer  theory,  [2|,  and  a  radiometric  correction 
from  dbtortion  of  infinitesimal  angles  due  to  Snell  refrac¬ 
tion  across  the  air-dielectric  boundary.  Over  ranges  of  in¬ 
cident  and  reflected  angles  where  the  Fresnel  coefficients, 
F,  are  insignificantly  varying,  thb  offers  a  formal  proof 
that  Lambert’s  Law  does  approximate  well  under  these 
conditions.  However,  as  a  result  of  varying  Fresnel  coeffi¬ 
cients  there  are  a  number  of  conditions  where  diffuse  re¬ 
flection  from  smooth  inhomogeneous  dielectrics  seriously 
deviates  from  Lambertian  behavior. 

In  the  past  decade  the  computer  vbion  and  computer 
graphics  communities  have  become  increasingly  aware  of 
accurate  modeling  of  the  reflectance  properties  of  mate¬ 
rials,  particularly  with  respect  to  the  specular  compo¬ 
nent.  The  works  by  Torrance  and  Sparrow  [14],  and, 
Beckmann  and  Spbzichino  [l]  have  been  amongst  the 
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most  popular  in  providing  computer  vision  and  graph¬ 
ics  researchers  with  an  accurate  modeling  of  the  specular 
component  of  reflection  from  rough  materials  [3],  [5],  [12], 
[8].  Only  very  recently  has  there  been  consideration  of 
non-Lambertian  diffuse  reflection  within  the  vision  and 
graphics  community.  The  paper  by  He  and  Torrance  et 
[4]  proposes  a  comprehensive  fuU-wave  model  for  light 
reflection  including  the  description  of  a  directional  dif¬ 
fuse  component  produced  from  diffraction  and  interfer¬ 
ence  effects  when  the  wavelength  of  light  is  comparable 
to  the  sise  of  surface  roughness  elements.  This  model  as¬ 
sumes  that  diffuse  reflection  originating  from  subsurface 
scattering  within  inhomogeneous  dielectrics  is  Lambertian 


pendent  upon  the  single  scattering  albedo,  p.  An  h^th  or¬ 
der  approximation  to  the  Chandrasekhar  H-function  can 
be  expressed; 

defined  in  terms  of  the  positive  seros,  pi,  of  the  even  Leg¬ 
endre  polynomial  of  order  2N,  and  the  positive  roots,  ag,, 
of  the  associated  characteristic  equation; 

N 

1  =  V 


and  contributes  to  a  uniform  diffuse  component.  Torrance 
[13]  is  currently  exploring  full-wave  solutions  for  light  in¬ 
cident  on  dielectric-air  boundaries  from  subsurface  scat¬ 
tering.  Tagare  and  deFigueiredo  [12]  propose  as  part  of 
their  multiple  lobed  reflectance  model  for  machine  vision 
a  functionrd  approximation  to  the  Chandrasekhar  diffuse 
reflection  Law  [2].  While  we  also  use  the  Chandrasekhar 
diffuse  reflection  law  for  diffuse  subsurface  scattering,  it 
is  not  nearly  accurate  for  materials  without  consideration 
of  various  dielectric-air  boundary  effects.  Oren  and  Nayar 
[^  have  been  studying  non-Lambertian  diffuse  reflection 
effects  for  rough  surfaces  assuming  a  statistical  distribu¬ 
tion  of  Lambertian  reflecting  facets  along  with  masking, 
shadowing,  and,  intereflection.  Apart  from  analysis  of  re¬ 
flected  intensity  distributions  for  diffuse  reflection,  Shafer 
[1^  proposed  a  dichromatic  color  reflectance  model  for 
diffuse  and  specular  components,  and  Wolff  [15]  proposed 
a  polarization  reflectance  model  involving  tne  diffuse  re¬ 
flection  component  for  inhomogeneous  dielectrics. 

This  paper  is  a  combined  summary  of  most  of  the  re¬ 
sults  from  the  papers  [19],  [17],  [16],  [18].  For  exact  details 
of  the  derivations  in  tms  paper  consult  [19],  [16],  [18]. 

2  CHANDRASEKHAR  DIFFUSE  RE¬ 
FLECTION  LAW  FROM  MULTIPLE 
SCATTERING 

Chandrasekhar  [2]  formally  derived  a  number  of  expres¬ 
sions  for  diffuse  reflection  from  multiple  scattering  under 
different  single  scattering  conditions.  While  the  original 
application  was  to  transmission  and  diffuse  reflection  of 
light  from  stellar  and  planetary  atmospheres,  the  same 
physical  principles  apply  to  subsurface  scattering  of  light 
within  dielectrics.  We  are  in  particular  interested  in  Chan¬ 
drasekhar’s  derivation  for  diffuse  reflection  assuming  that 
single  particle  scattering  is  isotropic.  Chandrasekhar  as¬ 
sumes  that  the  scattering  particles  are  plane  parallel  in 
that  the  optical  properties  (e.g.,  particle  density)  are  uni¬ 
form  within  parallel  planar  layers.  The  geometry  of  inci¬ 
dent  and  reflected  light  is  referred  to  with  respect  to  the 
orientation  of  this  pane  of  uniformity.  In  the  case  of  a 


S  DIFFUSE  REFLECTION  FROM 
SMOOTH  DIELECTRICS 

The  Chandrasekhar  diffuse  reflection  law  alone  does 
not  provide  a  complete  physically  accurate  description 
of  diffuse  reflection  from  inhomogeneous  dielectric  sur¬ 
faces.  While  gaseous  molecules  of  steUar  or  planetary  at¬ 
mospheres  are  separated  by  empty  space,  particles  within 
an  inhomogeneous  dielectric  surface  are  assumed  to  be 
separated  by  a  uniform  medium  with  index  of  refraction 
different  from  that  of  air.  As  we  wiU  be  considering  opaque 
objects,  the  assumption  of  a  semi-infinite  medium  of  scat- 
terers  is  very  well  approximated.  The  parallel  planes  along 
which  the  optical  properties  of  subsurface  particles  are 
uniform  are  assum^  to  be  parallel  to  the  smooth  planar 
surface  boundary.  There  are  at  least  3  ways  in  which  a 
dielectric  medium  wiU  alter  a  subsurface  diffuse  intensity 
distribution  with  respect  to  its  smooth  boundary  with  air. 
The  first  2  ways  are  a  result  of  Snell’s  law  [11],  and  the 
third  is  due  to  Fresnel  attenuation: 

•  From  Snell's  law  the  angle  at  which  light  energy  is 
incident  upon  (reflected  from)  the  plane  of  uniformity 
for  the  subsurface  particles  will  be  different  than  the 
angle  at  which  light  is  incident  upon  (reflected  from) 
the  smooth  surface  boundary.  See  Figure  2. 

•  As  a  result  of  Snell’s  law  the  solid  angle  through 
which  light  energy  b  incident  upon  (reflected  from) 
the  plane  of  uniformity  for  the  subsurface  particles 
will  be  different  than  the  solid  angle  through  which 
light  b  incident  upon  (reflected  from)  the  smooth  sur¬ 
face  boundary. 

•  Because  of  Fresnel  attenuation  for  transmission  of 
light  from  air  into  dielectric  and  vice  versa,  light  en¬ 
ergy  incident  upon  the  plane  of  uniformity  for  the 
subsurface  particles  will  be  less  than  light  energy  orig¬ 
inally  incident  upon  the  smooth  surface  boundary, 
and,  light  energy  transmitted  back  out  through  the 
surface  boundary  will  be  less  than  light  energy  pro¬ 
duced  by  subsurface  diffuse  scattering. 


semi-infinite  medium  of  scatterers  (i.e.,  an  opaque  dielec¬ 
tric  surface)  the  result  for  the  reflected  radiance  accord¬ 
ing  to  diffuse  reflection  from  multiple  scattering,  assuming 
an  botropic  single  scattering  distribution,  b  expressed  in 
terms  of  the  Chandrasekhar  H-function: 


^  r-rmj  f  -  —  _i_  ..  “  k /  » 

4jr  Pri/  +  Pine 

where,  pint  and  pr,/  are  the  directional  cosines  of  inci¬ 
dent  and  reflected  light  with  respect  to  the  normal  of  the 
plane  of  uniformity,  respectively,  p  is  the  single  scattering 
albedo  (representing  the  proportion  of  energy  reradiated 
upon  each  subsurface  single  scattering),  and  L  is  the  in¬ 
cident  radiance.  The  Chandrasekhar  H-functions  are  de- 


4)  < 


,-cos<l> 


FIGURE  3 

Figure  2  defines  incident  angle  ^  and  emittance  an¬ 
gle  d>,  in  air,  with  respect  to  the  normal  of  the  smooth 
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dielectric  surface  boundary.  The  overline  notation  refers 
to  the  corresponding  angles  inside  the  dielectric  medium. 
This  correspondence  is  made  according  to  the  following 
expressions  for  Snell’s  law: 


(1) 


-r  .  _i,sin^  -  .  _i,sin^ 

^  =  sin  ( - ),  =  sin  ( - ) 

Ti  n  __ 

where  n  is  the  index  of  refraction  of  the  dielectric.  This 
is  typically  in  the  range  of  1.4  to  2.0  for  most  glasses, 
plastics,  paint  and  ceramic. 

The  Fresnel  coefficients  varying  between  0  and  1  inclu¬ 
sive  describes  the  attenuation  of  incident  light  radiance 
upon  specular  reflection  from  a  planar  material  interface 
of  sise  larger  than  the  wavelength  of  the  incident  light. 
The  Fresnel  coefficients  are  dependent  upon  the  angle  of 
incidence  and  the  index  of  refraction  of  the  material.  They 
are  also  dependent  upon  the  polarisation  state  of  light. 

The  Fresnel  coefficient,  for  unpolarized  light 

incident  at  angle  ^  upon  a  diuectric  with  simple  index  of 
refraction,  n,  is  given  by: 

F{^,n)  =  +  Fj,)  (2) 


where 


a*  —  2a  cos  iji  cos*  V* 
a*  +  2a  cos  V*  +  cos*  il> 


_  ,  ,  .  a*  —  2a  sin  V*  tan  il>  -t-  sin*  i(>  tan*  il>  „  ,  ,  . 


a  =  sin’lfr 


e  = 


ei 


\-K 


where  _  _ 

p  Cp{cosil>,cos<l)) 

Pi  =  —j - =7 -  = 

n  cos  y> 


4ir»»*  Pint  +  Pr,J 


f/7 

^  =  j 

Jo 


dielectric  medium  into  air  contributing  to  diffuse  reflec¬ 
tion,  and  some  is  specularly  reflected  back  into  the  dielec¬ 
tric  medium  to  be  rescattered  once  again,  and  so  on.  The 
expression  for  g  in  equation  4  is  the  result  of  an  infinite 
geometric  series  describing  the  sum  of  scattering  distribu¬ 
tions  from  successive  internal  specular  reflections.  It  turns 
out  that  g  is  only  weakly  dependent  upon  imaging  geom¬ 
etry  (within  3%)  and  is  interpreted  as  the  sur^e  diffuse 
albedo.  References  [19],  [16]  show  a  plot  of  total  diffuse 
albedo,  g,  vs.  single  scattering  albedo,  p. 

4  EXPERIMENTAL  RESULTS 

Figures  3  and  4  show  experimental  results  for  an  opti¬ 
cally  smooth  piece  of  compressed  white  magnesium  oxide 
ceramic.  The  ceramic  was  measured  with  a  stylus  pro- 
filometer  to  have  a  variation  in  height  profile  no  greater 
than  half  the  wavelength  of  green  light.  Using  a  Brewster 

Optically  Smooth  White  Ceramic 
E^ttAscc  Aasle  *  0* 


See  Siegal  and  HoweU  [11]  for  a  formal  derivation. 

The  details  of  the  derivation  of  diffuse  reflection  from 
smooth  dielectric  surfaces  is  given  in  [19],  [16j.  It  is  shown 
that  if  light  is  incident  with  radiance,  L,  at  incidence  an¬ 
gle,  through  a  small  solid  angle,  du,  on  a  smooth  di¬ 
electric  surface,  then  the  reflected  radiance  is 

gL  X  (I  —  F(i/>,n))  x  cos  i>  x  (1  —  F(<f>,l/n))cLt>  (3) 

where  the  total  diffuse  albedo,  g,  is  given  by: 


(4) 


F(^’,  1/n)  Cp(cos^',  1.0)2)rsin  . 

It  was  stated  that  the  derivation  of  the  diffuse  reflec¬ 
tion  formula  in  this  paper  includes  the  physical  modeling 
of  diffuse  reflection  produced  from  incident  light  refracting 
into  the  dielectric  medium  (depicted  in  Figure  1),  multi¬ 
ply  scattering  amongst  subsurface  inhomogeneities,  and 
refracting  back  out  into  air.  In  fact,  this  is  a  first  order 
effect.  As  light  is  refracting  out  into  ait  from  the  dielec¬ 
tric  medium,  a  significant  amount  of  light  is  specularly 
reflected  back  into  the  dielectric,  particularly  for  internal 
specular  angles  above  the  critical  angle  where  specular  re¬ 
flection  is  100%.  This  critical  angle  is  sin~^(l/n)  which 
is  typically  about  40”  for  common  dielectrics.  Light  spec¬ 
ularly  reflected  back  into  the  dielectric  medium  multiply 
scatters  once  again,  and  the  term  K  describes  the  propor¬ 
tion  of  light  radiance  that  is  internally  rescattered.  Some 
of  the  light  radiance  that  is  rescattered  refracts  from  the 


FIGURE  4 


Angle  of  Emittancc 


angle  technique  the  index  of  refraction  of  the  ceramic  was 
determined  to  be  n  =;  1.7.  Our  diffuse  reflectance  model 
enables  the  empirical  measurement  of  the  single  scatter¬ 
ing  albedo,  p,  by  empirically  determining  the  ratio  of  the 
strengths  of  the  specular  and  diffuse  components  of  re¬ 
flection  at  known  angles  of  incidence  and  emittance.  The 
ratio  of  the  specular  to  the  diffuse  reflection  component 
was  measured  very  near  normal  incidence  and  emittance 
and  compared  with  the  ratio  in  expression  7.  The  value 
of  g  can  be  derived  from  this,  and  in  turn,  p  can  be  com¬ 
puted  (see  [19],  [16],  [18]  for  details).  The  empirically 
determined  specular  to  diffuse  ratio  is  accounted  for  by 
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a  single  scattering  albedo,  p,  just  above  0.95  which  is  re¬ 
markably  energy  conserving. 

In  Figures  3  and  4,  dashed  curves  represent  predicted 
Lambertian  diffuse  reflection,  solid  curves  represent  the 
diffuse  reflection  law  of  expression  3  with  single  scatter¬ 
ing  albedo  p  =  0.95  and  index  of  refraction  n  =  1.7  (it 
makes  little  difference  for  1-0  >  p  >  0.9,  14  >  n  >  2.0 
with  respect  to  the  shape  of  the  diffuse  reflection  curve). 
Observe  how  in  actuality  diffuse  reflection  goes  to  zero  as 
viewer  angle  goes  to  90*  !  Also,  for  large  angles  of  inci¬ 
dence  percentage  error  between  Lambert’s  Law  and  em¬ 
pirical  measurements  become  quite  large,  exceeding  50% 
above  80*.  The  proposed  diffuse  reflection  law,  expres¬ 
sion  3,  never  deviates  more  than  3%  from  the  empirical 
data  while  there  exist  sharp  deviations  from  Lambertian 
behavior.  The  results  show  that  the  Lambertian  model 
can  be  assumed  to  within  about  5%  accuracy  if  angle  of 
incidence  and  angle  of  emittance  are  simultaneously  no 
larger  than  50*.  Outside  of  this  constraint  range  serious 
deviations  start  to  occur. 

Figure  5  shows  an  ordinary  scene  where  the  Lamber¬ 
tian  model  strikingly  breaks  down  altogether  and  yet  is 
explained  with  high  accuracy  by  the  proposed  diffuse  re¬ 
flection  model  in  this  paper.  A  ceramic  coffee  cup  of 
cylindrical  body  shape  is  illuminated  from  the  left  side 
by  a  point  source.  Starting  from  the  left  occluding  con¬ 
tour  going  right  the  angle  of  incidence  starts  at  0*  and 
increases.  The  Lambertian  model  predicts  that  image  in¬ 
tensities  should  decrease  going  to  the  right.  The  image  in¬ 
tensities  in  fact  increase  to  a  maximum  intensity  at  about 
65*  surface  orientation  and  then  begin  to  slowly  decrease. 
The  reason  is  because,  ^  =  90*  —  tp,  that  is  the  emittance 
angle  starts  at  90*  at  the  occluding  contour  and  decreases 
goin^  right.  Looking  at  the  graph  in  Figure  4  diffuse  reflec¬ 
tion  mcreases  sharply  as  angle  of  emittance  decreases  from 
90*.  At  about  surface  orientation  65*  (V"  =  25*,  =  65*) 

the  decrease  of  diffuse  reflection  with  respect  to  increas¬ 
ing  angle  of  incidence  starts  to  overtake  the  increase  of 
diffuse  reflection  with  respect  to  decreasing  angle  of  emit¬ 
tance,  and  a  maximum  occurs.  (Note  the  graph  in  Figure 
5  is  only  for  the  diffuse  component  and  does  not  include 
measurement  of  the  specular  component  which  occurs  at 
relative  orientation  —45*  in  the  picture  of  the  cup.)  Ac¬ 
cording  to  Figure  5  the  qualitative  shape  of  the  true  dif¬ 
fuse  reflection  curve  (solid)  b  entirely  different  (e.g.,  its 
not  even  monotonic)  from  that  for  the  Lambertian  model 
(dashed).  The  percentage  error  of  the  Lambertian  model 
is  also  very  high  for  frontal  surface  orientations  where  the 
angle  of  incidence  is  large.  This  behavior  of  diffuse  reflec¬ 
tion  occurs  for  a  reasonably  large  range  of  lighting  con¬ 
figurations  and  can  perhaps  even  aid  in  the  detection  of 
dielectric  occluding  contours. 

Figure  6a  shows  an  actual  white  billiard  ball  illumi¬ 
nated  by  two  point  light  sources  orthogonal  to  viewing, 
one  from  the  left  side  and  one  from  the  right  side.  Figure 
6b  shows  a  computer  graphics  rendering  of  a  sphere  illumi¬ 
nated  by  the  same  configuration  of  2  point  light  sources 
assuming  Lambert’s  diffuse  reflectance  Law,  while  Fig¬ 
ure  6c  shows  the  same  computer  graphics  rendering  of  a 
sphere  using  the  diffuse  reflectance  law  proposed  in  this 
paper.  While  both  shadow  boundaries  with  respect  to 
the  left  and  right  light  sources  coincide  along  the  verti- 
caUy  oriented  great  circle  at  the  front  of  the  sphere,  there 
appears  to  be  a  “shadow  band”  of  darker  (i.e.,  smaller) 
intensity  values  about  this  shadow  boundary  due  to  the 
high  faU  off  of  diffuse  reflectance  at  high  angles  of  inci¬ 
dence  near  90*.  Observe  that  realistically  this  “shadow 


band”  is  in  fact  significantly  wider  in  Figure  6a  than  pre¬ 
dicted  by  the  Lambert  Law  in  Figure  6b.  When  rendered 
with  the  diffuse  reflectance  law  proposed  in  this  paper  in 
Figure  6c,  the  actual  size  of  the  “shadow  band”  is  more 
accurately  predicted.  Again,  this  “shadow  band”  is  not 
actually  a  shadow  but  rather  smaller  intensity  values,  and 
this  clearly  illustrates  the  sharper  drop  off  of  diffuse  inten¬ 
sity  values  at  higher  angles  of  incidence  than  predicted  by 
Lambert’s  Law  and  yet  predicted  by  our  diffuse  reflectance 
model  (See  again  Figure  3).  The  implications  for  realis¬ 
tic  rendering  of  diffuse  reflection  in  computer  graphics  is 
that  diffuse  reflection  produces  significantly  larger  darker 
regions  near  shadow  boundaries  (i.e.,  at  high  angles  of 
incidence)  than  predicted  by  Lambert’s  Law.  This  is  par¬ 
ticularly  true  for  shadow  boundaries  occurring  at  frontal 
surface  orientations  relative  to  the  viewer  where  geometric 
foreshortening  of  surface  area  is  not  significant. 

Figures  7a,  7b,  and  7c  show  grey  level  representations 
of  isophote  curves  (i.e.,  image  curves  with  equal  intensity) 
corresponding  respectively  to  Figures  6a,  6b  and  6c.  Lam¬ 
bert’s  Law  predicts  for  this  configuration  of  light  sources 
illuminating  a  sphere  that  equal  reflected  radiance  occurs 
for  points  forming  concentric  circles  on  the  sphere  about 
the  left-most  and  right-most  occluding  contour  points. 
These  concentric  circles  of  equal  reflected  radiance  ortho- 
graphically  project  onto  straight  isophote  lines  as  depicted 
in  Figure  7b,  with  maximum  diffuse  reflectance  occurring 
at  the  left-most  and  right-most  occluding  contour  points 
where  the  angle  of  incidence  is  zero.  Figure  7a  which  is 
an  actual  depiction  of  the  isophotes  of  Figure  6a  shows 
that  in  fact  Unes  of  equal  image  intensity  severely  curve 
near  the  occluding  contour  of  the  sphere.  Maximum  dif¬ 
fuse  reflection  occurs  at  the  center  of  the  closed  ellipti¬ 
cal  isophotes  near  the  left-most  and  right-most  occluding 
contours,  illustrating  a  2-D  version  of  the  effect  depicted 
in  Figure  5.  Figure  7c  shows  the  isophotes  rendered  using 
the  diffuse  reflectance  model  proposed  in  this  paper  which 
are  remarkably  similar  to  the  actual  isophotes  in  Figure 
7a  (except  for  the  isophotes  perturbed  by  the  specuTari- 
ties).  Comparing  Figures  7a,  7b  and  7c  shows  very  clearly 
how  our  diffuse  reflectance  model  accurately  predicts  re¬ 
flectance  features  that  are  significantly  deviant  from  Lam¬ 
bertian  behavior. 

5  COMBINED  DIFFUSE  AND  SPECU¬ 
LAR  REFLECTION 

An  important  feature  of  our  diffuse  reflectance  model 
is  that  it  predicts  the  surface  diffuse  albedo  for  dielectrics 
purely  in  terms  of  physical  parameters.  Inhomogeneous 
dielectric  surfaces  exhibit  both  diffuse  and  specular  reflec¬ 
tion  components.  Previously,  combined  diffuse  and  spec¬ 
ular  reflection  has  been  modeled  as  a  sum  of  scaled  diffuse 
and  specular  terms,  with  the  scaling  factors  (i.e.,  diffuse 
albedo  and  specular  albedo)  determined  from  experimental 
fitting  for  each  particular  surface.  We  propose  the  follow¬ 
ing  combined  reflectance  model  for  diffuse  and  specular 
reflected  radiance  from  smooth  surfaces; 

L  g[l  ~  n)]  cos  V'  [1  -  F’(sin~*(^*  — ),  l/n)]du; 

n 

-h  0)6(00  + m- 6),  (5) 

for  small  incident  solid  angle,  dui,  at  incidence  angle  V’l 
incident  azimuth  angle  6o,  emittance  an^le  <b  and  emitted 
azimuth  angle  6.  The  diffuse  albedo,  g,  is  that  computed 
from  equation  4  and  is  not  just  simply  a  scaling  factor  but 
determined  from  single  scattering  ^bedo,  p,  and  index  of 
refraction,  n. 
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FIGURE  5 


The  combined  diffuse  and  specular  reflectance  model, 
expression  5  can  be  used  to  formally  answer  the  question: 
How  bright  is  a  specularity  ?  Taking  the  ratio  of  the 
strengths  of  the  specular  to  diffuse  reflection  components 
of  expression  5:  _ 

p[l  -  f(V>,n)]  cosV'[l  -  F(8in"‘(5i^),l/n)]dw  ’ 


which  is  independent  of  uniform  incident  radiance,  L. 
Since  F{x,n)  and  F{x,l/n)  are  monotonically  increasing 
with  respect  to  increasing  x  it  should  be  clear  that  diffuse 
reflection  is  at  a  maximum  when  ^  =  0",  0  =  0°  and 
specular  reflection  is  at  a  minimum  when  V*  =  O'*.  There¬ 
fore,  the  ratio  expression  6  is  a  minimum  when  viewing- 
iUumination  geometry  is  at  normal  incidence  and  viewing. 
The  following  ratio  constitutes  a  physical  lower  bound  on 
the  ratio  of  specular  to  diffuse  reflected  radiance: 

_ J’(0”.n) _ 

p[l-J’(0*,n)][l-F(0M/n)]d«  • 

The  total  diffuse  albedo,  p,  is  largest  for  conservative  scat¬ 
tering  when  the  single  scattering  albedo  p  =  1.0.  Using 
expression  4  to  compute  p  at  p  =  1.0,  the  physical  lower 
bound  expressed  by  7  for  the  ratio  of  specular  to  diffuse 
reflected  radiance  evaluated  for  different  indices  of  refrac¬ 
tion  is  (<L)  in  steradians): 

,  '  ,  0.0838  ,  ,  0.211  ,  „  0.370 


dw 


dit) 


These  expressions  represent  physical  lower  bounds  be¬ 
low  which  it  is  not  physicaUy  possible  to  have  brightness 
contrast  between  a  specularity  and  surrounding  diffuse  re¬ 
flection  on  a  smooth  dielectric  surface.  Since  practically 
all  dielectrics  have  n  >  1.4,  this  particular  lower  bound 
can  serve  as  a  very  conservative  general  formula  for  ruling 
out  specularities  in  a  scene. 


6  CONCLUSION 

The  primary  result  of  this  paper  is  a  simple  closed  form 
expression,  (equation  3),  derived  from  first  physical  prin¬ 
ciples  whid  accurately  describes  diffuse  reflection  from  a 
smooth  dielectric  surface  and  explains  the  physical  origin 
of  diffuse  albedo  (equation  4).  The  results  presented  in 
this  paper  have  bearing  on  virtually  any  technique  in  com¬ 
puter  vision  that  relies  upon  the  Lambertian  assumption 
applied  to  dielectric  surfaces,  including  shape  from  shad¬ 
ing,  shape  and/or  roughness  determination  from  multiple 
light  source  Rumination  (e.p.)  photometric  stereo)  and 
shape  from  intereflection.  It  is  impossible  to  reference  all 
of  the  related  works  but  the  book  by  Horn  and  Brooks  [6] 
contains  a  number  of  applicable  papers.  The  results  of  this 
paper  makes  it  possible  to  analyse  the  precise  conditions 
under  which  it  is  reasonable  to  assume  the  Lambertian 
model  for  a  particular  technique,  and  the  conditions  un¬ 
der  which  the  Lambertian  model  breaks  down.  This  more 
general  diffuse  reflectance  model  provides  a  more  solid 
physical  foundation  upon  which  to  develop  accurate  ob¬ 
ject  feature  extraction  techniques  in  computer  vision.  Our 
model  can  be  used  to  explain  the  relative  strengths  of  the 
specular  and  diffuse  reflection  components  from  smooth 
inhomogeneous  dielectric  surfaces  purely  in  terms  of  the 
physical  parameters  of  the  surface  itself  [18].  This  in  turn 
can  be  used  to  describe  the  relative  brightness  of  specu¬ 
larities  on  dielectric  surfaces. 
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Abstract 

The  more  general  capabilities  of  polarisation  vision  for 
image  understanding  motivates  the  building  of  camera 
sensors  that  automatically  sense  and  process  polarisation 
information.  Described  in  this  paper  are  a  variety  of  de¬ 
signs  for  polarization  camera  sensors  that  not  only  sense 
partiaUy  linearly  polarised  light,  but  some  of  whidi  com¬ 
putationally  process  polarisation  information  at  pixel  res¬ 
olution  to  produce  a  visualisation  of  sensed  polarisation, 
and/or,  a  visualisation  of  physical  information  directly  re¬ 
lated  to  sensed  polarisation.  Described  are  designs  that 
include  the  use  of  radial  polarisation  filters,  liquid  crys¬ 
tals,  beamsplitters,  and,  VLSI.  All  designs  discussed  are 
currently  patent  pending. 

1  Introduction:  Polarization  Vision  and 
Polarization  Camera  Computational 
Sensors 

As  human  beings  we  naturally  think  of  vision  in  terms 
of  perception  of  intensity  and  color.  Polarisation  of  light 
mi^t  appeu  to  be  of  little  relevance  or  benefit  to  au¬ 
tomated  vision  systems  simply  because  the  human  visual 
system  is  almost  completely  oblivious  to  this  property  of 
light.  In  context  of  Physics-Based  Vision  there  is  a  com¬ 
piling  motivation  to  study  polarisation  vision  -  the  extra 
physic  dimennons  of  polarisation,  beyond  that  of  in¬ 
tensity  of  li^ht,  carry  added  information  about  a  world 
scene  that  m  turn  provides  a  richer  set  of  descriptive 
physical  constraints.  As  a  result,  the  derivation  of  some 
important  low-level  and  high-level  descriptions  of  imaged 
scenes,  which  may  be  infeasible  or  very  difficult  to  obtain 
from  intensity  information  alone,  can  be  made  possible  or 
can  be  immensely  simplified  &om  analysis  of  polarization. 
These  include  important  visual  tasks  Uke  material  classi¬ 
fication  according  to  relative  electrical  conductivity  (e.g., 
dielectric/metal),  specular  and  diffuse  reflection  compo¬ 
nent  analysis,  identification  of  specular  reflection,  color 
constancy,  and,  image  region  and  image  edge  segmenta¬ 
tions.  A  detail^  description  of  a  variety  of  polarisation- 
based  vision  methods  are  contained  in  [4],  [5],  [7],  [6],  [2]. 

As  Computer  Vision  idgorithms  have  been  designed  to 
perform  intensity  and  color  vision  so  have  the  video  cam¬ 
era  sensors  (e.g.,  CCD,  CID)  that  ue  used  in  Computer 
Vision  laboratories  and  for  image  understanding  appli¬ 
cations.  A  criticism  that  has  sometimes  been  leveled  at 
polarisation-based  vision  methods  is  the  inconvenience  of 
obtaining  polarisation  component  images  by  having  to 
place  a  linear  polarizing  filter  in  front  of  an  intensity  cam¬ 
era  sensor  and  mechanically  rotating  this  filter  by  hand  or 
by  motor  into  different  orientations.  This  inconvenience  is 
simply  a  result  of  commercially  available  camera  sensors 


being  geared  towards  taking  intennty  images  instead  of 
polarisation  images.  In  our  conception,  polarisation  vi¬ 
sion  is  no  more  a  “multiple  view”  problem  than  is  color 
vision,  and  a  camera  sensor  can  be  developed  that  can  au¬ 
tomatically  sense  polarisation  components  and  even  auto¬ 
matically  compute  physical  scene  properties  that  are  di¬ 
rectly  related  to  thu  polarisation  information.  Such  a 
polarization  camera  sensor  was  originally  suggested  in  [6], 
and  in  the  past  year  in  the  Computer  Vision  Laboratory 
at  Johns  Hopkins  we  have  built  and  are  continuing  to  de¬ 
velop  a  variety  of  designs  for  such  sensors.  We  discuss 
in  this  paper  a  number  of  these  designs  for  polarisation 
cameras  that  sense  and  process  partially  linearly  polarised 
light  information. 

The  polarization  state  of  light  characterises  its  complete 
description  as  an  electromagnetic  wave,  apart  from  wave¬ 
length.  Polarisation  is  a  more  general  physical  description 
of  Ught  than  intensity  which  characterizes  its  energy.  For 
instance,  intensity  can  be  derived  from  a  linear  sum  of  po¬ 
larisation  components.  Practically  all  light  occurring  in 
robotic/ vision  environments  and  naturally  occurring  light 
fields  is  partially  linearly  polarized  which  means  that  the 
polarisation  state  of  such  light  can  be  represented  by  the 
superposition  of  unpolarised  and  completely  linearly  po¬ 
larized  component  states  (this  includes  unpolarised  and 
linearly  polarised  states  themselves).  A  state  of  partially 
linearly  polarized  light  can  be  uniquely  measured  by  sens¬ 
ing  light  after  passing  through  a  dichroic  material  which 
absorbs  all  component  orientations  of  polarisation  except 
along  one  axis  (i.e.,  only  one  axis  of  polarisation  is  trans¬ 
mitted  through  the  material).  If  a  dichroic  filter  is  made 
so  that  all  parts  of  the  filter  have  the  same  transmission 
axis  (e.g.,  a  standard  polarising  filter)  then  the  transmit¬ 
ted  radiance  of  light  through  the  filter  as  a  function  of 
angular  orientation  of  the  transmission  axis  will  be  a  si¬ 
nusoid  with  periodicity  of  180”  as  depicted  in  Figure  1. 
This  sinusoid  can  be  experimentally  recovered  by  taking 
transmitted  radiance  measurements  for  3  or  more  unique 
orientations  of  the  transmission  axis.  The  transmitted 
radiance  sinusoid  can  be  completely  described  by  the  par 
rameters  /m«ai  Imim  &nd,  the  phase,  B,  of  the  sinusoid 
which  represents  its  relative  horisontal  translation  in  the 
graph  of  Figure  1  (e.g.,  the  angular  orientation  at  which 
Imin  occurs  with  respect  to  0”  on  the  polarizer  vernier). 
Another  set  of  three  parameters  which  characterises  par¬ 
tially  linearly  polarized  light  that  are  of  direct  importance 
to  polarization-based  image  understanding  are: 

(partial  polarisation) 

fmaa  "I"  'min 

(total  intensity)  (1) 

(phase)  B  . 
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It  can  be  shown  that  partial  polarization  represents  the 
fraction  between  0  and  1.0  that  the  linearly  polarized  com 
ponent  makes  up  of  partially  linearly  polarized  light  [3]. 
The  relative  phase,  0,  represents  the  angular  orientation  of 
the  linear  component  of  polarization.  Thus  partially  lin¬ 
early  polarized  light  provides  2  extra  physical  dimensions 
of  sensory  information  beyond  intensity  of  light. 

Transmitted  Radiance  Sinusoid 


The  three  parameters  in  equations  1  describing  par¬ 
tially  linearly  polarized  light  can  be  recovered  by  trans¬ 
mitted  radiance  measurements  Iq,  As,  /go  taken  at  0°, 
45",  and  90"  angular  orientation  of  the  transmission  axis, 
respectively: 

a  _i//o  + /go  -  2/45 

0  =  (1/2)  tan  ( - - - - - ), 

too  —  to 

if  (/go  <  lo)  [  if  (As  <  A)  0  =  «  +  90  else  0  0  -  90] 

Imtx  /min  =  A  "t"  Aoi  (2) 

/m»»  ~  /min  _  Ao  ~  A 
Imam  "I"  /min  (Ao  "I"  A)  COS  20 

Of  course,  3  other  angular  orientations  could  be  used,  and 
more  than  3  angular  orientations  can  be  used  to  overcon¬ 
strain  the  transmitted  radiance  sinusoid.  We  have  found 
that  the  above  measurements  perform  very  well. 

Visualizing  Polarization 


LmgUi  of  vector  s  Pvijil  Polviulion  c 

immi  -  /win 


J>IQUH_E  2  Int«n«ily -/-«♦/-(« 

Because  humans  do  not  observe  polarization  directly 
except  with  the  aid  of  special  filters,  it  is  beneficial  for 
a  polarization  camera  to  produce  some  kind  of  visualiza¬ 
tion  for  representing  sensed  polarization  information  (e.g., 
intensity-color  representation)  for  scene  analysis.  We  uti¬ 
lize  a  hue-saturation-intensity  visualization  for  partially 
linearly  polarized  light  Wolff  and  Mancini  [8].  Such  a 
scheme  was  suggested  by  Bernard  and  Wehner  [1]  as  a 
functional  similarity  between  polarization  vision  and  color 
vision  for  biological  vision  systems.  Figure  2  shows  a  nat¬ 
ural  one-to-one  mapping  of  a  state  of  partial  linear  polar¬ 
ization  into  a  hue,  saturation  (i.e.,  excitation  purity),  and, 
intensity,  derived  respectively  from  the  orientation  of  the 


plane  of  the  completely  linear  polarized  component,  the 
partial  polarization,  and  the  intensity  of  the  light.  There¬ 
fore,  in  a  polarization  image,  unpolarized  light  appears 
achromatic  and  regions  that  are  significantly  partially  po¬ 
larized  appear  chromatically  saturated.  The  intensity  of 
light  in  a  polarization  image  is  simply  the  pixel  intensity 
itself,  regardless  of  color,  and  can  be  easily  processed  by 
intensity- based  vision  methods.  Thb  distinctly  demon¬ 
strates  how  a  polarization  image  is  a  generalization  of  a 
gray  level  intensity  image.  A  number  of  color  polarization 
images  using  this  visusdization  are  displayed  on  the  last 
page  of  this  article. 

The  use  of  partially  linearly  polarized  light  for  image 
understanding  is  explained  in  detail  in  [5],  [7l,  [6],  [2], 
Only  a  few  main  points  will  be  summarized  nere.  Signifi¬ 
cant  partial  polarization  (i.e.,  above  10%)  in  a  scene  can 
be  due  to  specular  reflection,  and/or  diffuse  reflection  &om 
inhomogeneous  dielectric  objects  near  occluding  contours. 
The  transmitted  radiance  sinusoids  for  the  specular  and 
diffuse  reflection  components  are  respectively  90"  out  of 
phase.  With  respect  to  the  plane  determined  by  the  sur¬ 
face  normal  and  the  viewing  vector,  the  maximum  trans¬ 
mitted  radiance  for  specular  reflection  occurs  for  orienta¬ 
tion  perpendicular  to  this  plane,  while  for  diffuse  reflection 
at  occluding  contours  of  inhomogeneous  dielectric  objects 
the  maximum  transmitted  radiance  occurs  for  orientation 
parallel  to  this  plane.  This  is  an  important  physical  prin¬ 
ciple  that  can  be  exploited  to  help  distinguish  between 
partial  polarization  due  to  specular  reflection  and  diffuse 
reflection.  On  smooth  and  mildly  rough  surfaces  the  phase 
of  the  transmitted  radiance  sinusoid  gives  surface  normal 
constraint  information  [4],  [6],  [7].  The  pattern  of  trans¬ 
mitted  radiance  sinusoid  phases  from  specular  reflection 
occurring  at  multiple  surface  orientations  on  an  object 
gives  physical  shape  cues  that  can  be  exploited  for  object 
recognition. 

Another  important  mode  of  physical  information  for 
interpreting  objects  in  a  scene  is  identification  of  intrin¬ 
sic  material  classification.  It  turns  out  that  if  the  spec¬ 
ular  angle  of  incidence  is  between  30"  and  80",  and  the 
specular  component  of  reflection  is  strong  relative  to  the 
diffuse  component,  the  quantity,  /ma*//min,  derived  from 
transmitted  radiance  sinusoid  parameters,  b  a  very  reli¬ 
able  discriminator  for  varying  leveb  of  electrical  conduc¬ 
tivity.  This  ratio  for  most  metals  varies  between  1.0  and 
2.0  while  for  dielectrics  this  ratio  b  above  3.0.  The  theory 
of  this  is  explained  in  [5],  [6]. 

Whether  a  polarization  camera  computes  a  vbualiza- 
tion  of  sensed  polarization  information  at  each  pixel,  or 
computes  a  visualization  of  physical  information  (e.g.,  di¬ 
electric/metal  composition)  at  each  pixel  related  to  sensed 
polarization,  a  polarization  camera  is  inherently  a  compu¬ 
tational  sensor.  It  should  be  fully  realized  that  as  in¬ 
tensity  is  a  compression  of  polarization  component  in¬ 
formation,  that  a  polarization  camera  can  function  as 
a  conventional  intensity  camera,  so  that  intensity  vbion 
methods  can  be  implemented  by  such  a  camera  either 
alone,  or,  together  with  polarization-based  vbion  meth¬ 
ods.  As  intensity-based  methods  are  physical  instances  of 
polarization-based  methods,  a  camera  sensor  geared  to¬ 
wards  polarization  vision  does  not  in  any  way  exclude  in¬ 
tensity  vision,  it  only  generalizes  it  providing  more  phys¬ 
ical  input  to  an  automated  vision  system!  Adding  color 
sensing  capability  to  a  polarization  camera  makes  it  pos¬ 
sible  to  sense  the  complete  set  of  electromagnetic  param¬ 
eters  of  light  incident  on  the  camera. 
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2  Polarization  Camera  Using  Radial  Po¬ 
larization  Filters 


Figure  1  depicts  the  magnitude  of  light  radiance  of 
transmitted  partially  linearly  polarized  light  through  a 
particular  transmission  axis  of  a  dichroic  material,  as  this 
transmission  axis  is  rotated.  Most  standard  polarizing  fil¬ 
ters  are  made  from  a  dichroic  material  with  constantly 
aligned  transmission  axis  across  the  material.  Using  a 
standard  polarizing  filter,  recovery  of  the  transmitted  ra¬ 
diance  sinusoid  at  each  pixel  requires  rotation  of  this 
axis  and  serially  grabbing  at  least  3  images  respective  to 
unique  orientations. 

There  is  a  simple  way  of  implementing  a  “l-big-pixel” 
polarisation  camera  utilizing  a  radial  polarization  filter 
that  allows  recovery  of  the  transmitted  radiance  sinusoid 
in  a  single  image.  Figure  3a  depicts  the  concentric  circle 
configuration  of  transmission  axes  on  such  a  filter.  Assum¬ 
ing  the  same  state  of  partial  linear  polarization  striking 
across  this  filter,  the  magnitude  of  transmitted  radiance 
along  each  circular  transmission  axis  will  be  a  complete 
sinusoid  for  every  180'  of  circular  arc.  Figure  3b  shows  lin¬ 
early  polarized  light  (produced  by  the  square  filter)  pass¬ 
ing  through  a  circular  radial  polarization  filter.  The  pat¬ 
tern  looks  something  like  two  pairs  of  paint  brush  bristles 
emanating  from  the  center  of  the  filter.  The  darkest  axis 
going  through  the  center  of  the  circular  filter  in  Figure  3b 
is  where  linear  polarized  light  is  being  extincted  %  tan¬ 
gents  to  circular  transmission  axes  perpendicular  to  the 
orientation  of  the  linearly  polarized  light.  The  brightest 
axis,  oriented  90'  to  the  darkest  axis,  is  where  the  orien- 
!.ation  of  linear  polarized  light  is  parallel  to  tangents  of 
circular  transmission  axes. 


FIGURE  3b 


Radial  polarization  filters  are  sometimes  termed  “axis- 
finders”  as  the  darkest  axis  tells  exactly  the  orientation  of 
the  transmission  axis  of  a  linear  polarizing  filter.  However, 
we  are  suggesting  here  that  they  be  used  for  a  more  gen¬ 
eral  purpose-  for  automating  the  computation  of  the  state 
of  partially  linearly  polarized  light.  A  radial  polarization 
filter  makes  it  possible  to  simultaneously  measure  Imax, 
Imin,  <uid  the  orientation  (i.e.,  phase)  at  which  these  occur 
90'  relative  to  one  another.  Clearly,  from  equations  1  the 
partial  polarization  can  be  easily  computed.  In  fact,  the 
pattern  produced  by  a  radial  polarization  filter  is  quite 
an  intuitive  visualization  for  partial  linear  polarization: 
the  contrast  between  darkest  and  brightest  axes  is  pro¬ 
portional  to  the  partial  polarization  (unpolarized  light  is 
simply  a  uniform  intensity  across  the  filter),  and  for  non¬ 
zero  partial  polarization  the  orientation  of  the  darkest  axis 
corresponds  to  the  orientation  of  the  linearly  polarized 
component.  It  is  straightforward  to  perform  image  pro¬ 
cessing  operations  that  will  extract  this  information  from 
a  pattern  such  as  the  one  in  Figure  3b  across  a  large  set  of 


pixels.  A  low  cost  “l-big-pixel”  polarisation  camera  can 
operate  simply  by  mounting  a  radial  polarisation  filter  on 
a  camera  lens  and  focused  on  a  scene  with  nearly  uniform 
partial  linear  polarization  within  the  field  of  view.  At 
higher  cost,  multiple  radial  polarization  filters  can  be  uti¬ 
lized  for  higher  resolution  measurement  of  partial  linear 
polarisation  across  a  field  of  view. 


3  Polarization  Camera  Using  Liquid 
Crystals 

Obtaining  the  transmitted  radiance  sinusoid  by  rotat¬ 
ing  a  polarizing  filter  in  front  of  an  intensity  camera  sensor 
is  a  mechanicaUy  active  process  that  produces  optical  dis¬ 
tortion  and  is  difficult  to  fully  automate.  Unless  the  axis 
perpendicular  to  the  plane  of  the  polarising  filter  is  exactly 
aligned  with  the  optic  axis  of  the  camera,  small  shifts  in 
projection  onto  the  image  plane  occur  between  different 
orientations  of  the  polarizing  filter.  At  intensity  discon- 
tiniuites  in  a  scene,  significant  shifts  in  image  intensity  are 
observed  giving  the  fi^e  interpretation  of  reflected  partial 
polarization  even  if  it  doe*  not  exist. 


FIGURE  4 


Figure  4  shows  a  liquid  crystal  polarisation  camera 
that  has  been  designed  and  built  at  Johns  Hopkins  using 
a  CCD  intensity  camera  with  a  fixed  polarizer  and  two 
Twisted  Nematic  (TN)  liquid  crystab  mounted  in  front. 
The  idea  behind  this  liquid  crystal  polarization  camera  is 
very  simple.  Nothing  mechanicaUy  rotates;  the  polarizer 
remains  fixed  whUe  the  TN  liquid  crystals  electro-optically 
rotate  the  plane  of  the  Unear  polarized  component  of  re¬ 
flected  partiaUy  Unearly  polarized  Ught.  The  unpolarised 
component  is  not  effected.  In  gener^  the  transmitted  ra¬ 
diance  sinusoid  can  be  recovered  by  the  relative  rotation 
of  the  plane  of  Unear  polarization  with  respect  to  the  po¬ 
larizer.  Each  TN  Uquid  crystal  is  binary  in  the  sense  that 
it  either  rotates  the  plane  of  Unear  polarization  by  fixed  n 
degrees,  0'  <  n  <  90',  which  is  determined  upon  fabrica¬ 
tion,  and,  0  degrees  (i.e.,  no  twist).  Two  TN  Uquid  crys¬ 
tals  are  used,  one  at  n  =  45',  and  the  other  at  n  =  90', 
to  insure  at  least  3  sampUngs  of  the  transmitted  radiance 
sinusoid.  Components  of  partial  Unear  polarisation  are 
imaged  at  pixel  resolution  under  fuU  automatic  computer 
control,  and  these  are  processed  on  a  Datacube  MV-20 
board  programmable  via  Image  flow  software  from  a  SUN 
workstation.  For  details  see  Wolff  and  Mancini  [8].  One 
program  on  the  Datacube  MV-20  computes  from  polariza¬ 
tion  component  images  a  hue-saturation-intensity  visual¬ 
ization  at  each  pixel  for  partial  Unear  polarization.  An¬ 
other  program  automatically  computes  dielectric/metal 
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composition  by  thresholding  Our  liquid  crys¬ 

tal  polarization  camera  can  generate  up  to  2.5  polarization 
images  a  second.  The  main  timing  bottle-neck  is  the  re¬ 
laxation  time  of  100ms  for  each  of  the  liquid  crystals  to 
switch  states.  With  the  most  current  faster  liquid  crystals 
we  can  at  least  double  the  rate  of  polarization  images  per 
second,  and  we  intend  to  incorporate  these  newer  liquid 
crystals  in  our  implementation.  A  nice  feature  about  our 
liquid  crystal  polarization  camera  is  that  with  the  Dat- 
acube  MV-20  board,  it  is  a  programmable  computational 
sensor  in  that  sensed  polarization  components  can  be  pro¬ 
cessed  in  a  variety  of  ways. 

Color  Figures  Cl,  C2,  C3,  C4,  and,  C5  at  the  back  of 
this  article  show  polarization  images  taken  with  our  liquid 
crystal  polarization  camera  depicting  partial  linear  polar¬ 
ization  at  pixel  resolution  in  the  hue-saturation-intensity 
visualization  scheme  defined  in  Figure  2. 

Figure  Cl  shows  how  a  polarization  image  provides  im¬ 
portant  information  about  a  scene  that  would  be  very  dif¬ 
ficult  and  perhaps  impossible  to  deduce  from  an  intensity 
image.  The  left  intensity  image  of  Figure  Cl  shows  what 
apparently  are  2  mugs  in  a  scene.  Looking  closely  at  the 
intensity  image  reveals  that  there  is  some  difference  be¬ 
tween  the  2  mugs;  the  left  mug  heis  its  letters  reversed. 
The  only  visual  cues  telling  that  the  left  mug  is  simply  a 
reflection  are  very  high  level  features  such  as  the  reversal 
of  recognizable  high  level  features  (e.g.,  alphabet  letters) 
or  the  edge  of  the  glass  mirror.  Otherwise  the  reflected 
intensity  (and  colon  of  the  2  mugs  look  essentieilly  the 
same.  This  type  of  problem  occurs  in  vision  fairly  fre¬ 
quently  such  as  when  stray  specular  glare  from  objects 
give  the  false  interpretation  that  real  edges  actually  exist 
there.  Consider  the  problem  of  an  autonomous  land  vehi¬ 
cle  viewing  a  scene  part  of  which  is  reflected  by  a  lake  or 
river.  How  does  the  vehicle  know  which  are  the  “real”  ele¬ 
ments  of  the  scene  ?  How  does  a  mobile  robot  know  when 
it  is  running  into  a  glass  door,  or  if  navigating  accord¬ 
ing  to  edge  cues,  which  are  geometric  edge  cues  opposed 
to  specular  edge  cues  ?  The  right  polarization  image  in 
Figure  Cl  shows  that  the  left  mug  has  Cyan  chromaticity 
implying  significant  partial  polarization.  Cyan  chromatic¬ 
ity  is  also  observed  at  specular  highlights  on  the  right  mug 
as  well.  (The  very  bright  center  of  specularities  saturate 
the  camera  so  that  pixels  record  gray  level  255  regardless 
of  the  state  of  the  TN  liquid  crystals.  This  gives  a  flat 


illuminated  with  an  extended  light  source.  While  the  po¬ 
larization  image  does  not  give  completely  unique  surface 
orientation  information,  the  pattern  of  specular  plane  con- 
strmnts  gives  enough  rudimentary  shape  information  to 
distinguish  different  shape  classes  for  object  recognition. 
For  instance,  on  a  cylindrical  shape  the  lines  of  constant 
color  hue  are  parallel  to  one  another  (Figure  C2)  while 
on  a  spherical  shape  lines  of  constant  color  hue  mutually 
intersect  at  a  point  (Figure  C3).  Besides  being  useful  in 
sorting  by  shape  systems  in  manufacturing,  outdoor  ob¬ 
jects  illuminated  by  skylight  serving  as  an  extended  illu¬ 
minator  may  be  able  to  be  distinguished  by  shape  class  as 
well. 

Figure  C4  shows  the  polarization  image  of  a  white  bil¬ 
liard  ball  under  point  source  illumination.  Chromaticity 
at  the  occluding  co’  -  lur  shows  partial  polarization  of  dif¬ 
fuse  reflection.  According  to  theory,  Imin  of  the  transmit¬ 
ted  radiance  sinusoid  occurs  for  orientation  parallel  to  the 
occluding  contour  edge,  and  the  hues  correspond  to  these 
edge  orientations  around  the  ball. 

Figure  C5  shows  the  polarization  image  of  a  pond  sur¬ 
rounded  by  rocks,  grass  and  trees.  Even  though  skylight 
is  partially  polarized,  at  most  angles  specular  reflection  of 
skylight  off  of  water  has  Imin  of  the  transmitted  radiance 
sinusoid  closely  aligned  with  the  surface  normal  to  the  wa¬ 
ter,  which  since  it  is  fluid  is  aligned  with  gravity.  In  this 
color  hue  coordinate  system,  the  Green  color  hue  is  for 
Imin  occurring  nearly  vertical  in  this  image.  While  there 
is  noted  shift  in  color  hue  for  ripples  in  the  pond  where 
surface  orientation  is  perturbed,  the  water  has  a  very  dis¬ 
tinct  reflected  polarization  signature  against  the  reflected 
polarization  signature  of  trees  and  grass  which  has  less 
chromaticity  (i.e.,less  partial  polarization)  and  variegated 
color  hue  (i.e.,  a  wide  range  of  polarization  phase).  Note 
also  the  significant  partial  polarization  from  the  rocks. 

Figure  5a  shows  a  circuit  board  with  solder  metal,  di¬ 
electric,  and,  solder  metal  covered  with  a  translucent  di¬ 
electric  material.  The  circuit  board  is  illuminated  with 
an  extended  light  source  so  that  a  strong  specular  com¬ 
ponent  is  reflected  from  all  object  points  into  our  liquid 
crystal  polarization  camera.  Figure  5b  shows  an  image 
where  Imax/Imin  IS  derived  from  partial  linear  polariza¬ 
tion  at  each  pixel  (and  scaled  in  the  range  0-255).  Darker 
regions  in  Figure  5b  represent  higher  electrical  conduc¬ 
tivity,  lighter  regions  represent  dielectric,  intermediate  re- 


of  unpolarized  light,  when  in  fact  the  reflected  light  from 
these  areas  are  significantly  partially  polarized.  This  is  a 
limitation  of  the  dynamic  range  of  the  SONY  XC-77  CCD 
camera  being  used,  and  NOT  our  polarization  vision  algo¬ 
rithm.)  Significant  partial  polarization  is  also  observed  at 
the  occluding  contour  of  the  right  mug  as  Red  color.  Note 
that  the  hue  colors  Cyan  and  Red  are  complementary  col¬ 
ors  indicative  of  transmitted  radiance  sinusoids  90°  out  of 


phase. 


Figure  C2  shows  the  intensity  and  polarization  images 
of  a  cylindrical  cup  illuminated  with  an  extended  light 
source  so  as  to  produce  specular  reflection  from  a  number 
of  different  surface  orientations.  The  different  color  hues 


shown  in  the  polarization  image  correspond  to  specular 
plane  surface  orientation  constraints.  In  this  example. 
Cyan  color  hue  corresponds  to  specular  planes  oriented 
vertically  in  the  image  while  the  complementary  color  hue. 
Red,  would  correspond  to  specular  planes  oriented  hori¬ 
zontal  in  the  image.  Almost  the  entire  spectrum  of  color 
hues  is  displayed  here.  Figure  C3  shows  intensity  and 
polarization  images  of  one  hemisphere  of  a  plastic  sphere 
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FIGURE  5b 

4  Polarization  Camera  Using  Beamsplit¬ 
ter 

A  common  design  for  high  quality  color  cameras  is  to 
use  a  beamsplitter  that  directs  equal  amounts  of  incom¬ 
ing  light  onto  3  separate  CCD  chips  for  red,  green,  and, 
blue.  A  similar  idea  can  be  used  to  direct  light  onto  mul¬ 
tiple  CCD  chips,  each  chip  covered  by  a  uniquely  oriented 
polarizing  filter.  Unfortunately  the  polarizing  properties 
of  most  common  kinds  of  beamsplitters  can  be  variable 
across  standard  wide  fields  of  view. 

Figure  6  shows  a  2-CCD  chip  polarization  camera  (i.e., 
using  2  CCD  cameras)  utilizing  a  polarizing  plate  beam¬ 
splitter,  built  in  the  Johns  Hopkins  Computer  Vision  Lab¬ 
oratory.  The  simplicity  of  this  design  steins  from  the  use 
of  a  special  coating  on  a  glass  plate  producing  a  beam¬ 
splitter  that  effects  the  polarization  of  transmitted  and 
reflected  light  in  a  nearly  constant  known  way  across  a 
fairly  wide  range  of  angles  (i.e.,  ±20").  The  polarization 
state  of  reflected  and  transmitted  light  is  effected  in  a  lin¬ 
early  independent  way  by  the  plate  beamsplitter.  If  the 
component  of  polarization  parallel  to  the  floor  is  repre¬ 
sented  by  P,  and  the  component  of  polarization  vertical 
to  the  floor  is  represented  by  S,  then: 

uP  ±  —  Itrantmittfd 

(1  -  a)P  +  {1  -  b)S  =  IrefUcttd 

where  a±6=l,  a,b  >  0  a  ^  b.  The  coefficients  a,b 
are  dependent  upon  the  coating  on  the  beamsplitter.  This 
results  in  the  solution 

p  _  ftran»TT\%ttC(i(  1  u)  J\ttitd 

*  —  7  I 

b  ~  a 

p  _  ^trantmittedi^  fleetfd 

a  —  b 

If  the  P  and  S  directions  happen  to  coincide  with  ’!;■  di 
rections  of  the  maximum  and  minimum  polarizaii.  ; 
ponents,  or,  if  the  specular  plane  for  specular  rf  :  ■  n 

from  an  object  surface  is  known,  then  the  par'i.ai  | 
ization  and  phase  can  be  computed  (i.e.,  the  tr-o 
radiance  sinusoid  can  be  computed).  Otherwise,  ju-.i  tin- 


P  and  S  component  magnitudes  are  known  with  respect 
to  the  mutually  orthogonal  directions  parallel  and  perpen¬ 
dicular  to  the  floor. 


FIGURE  6 


By  adding  a  single  TN  liquid  crystal  to  Figure  5  with 
the  2  CCD  chips  and  beamsplitter,  the  P  and  S  com¬ 
ponents  can  be  measured  respective  to  two  mutually  or¬ 
thogonal  orientations.  With  0  degree  twist,  the  P  and 
S  components  orientations  are  parallel  and  perpendicular 
to  the  page,  with  n  degree  twist  the  P  and  S  components 
are  n  degree  rotations  of  parallel  and  perpendicular  to  the 
page.  As  long  as  n  does  not  equal  90  degrees,  the  transmit¬ 
ted  radiance  sinusoid  is  being  sampled  in  4  unique  points 
and  can  be  uniquely  recovered  with  any  3  of  these  points. 
There  are  obvious  extensions  using  three  CCD  chips  at 
the  expense  of  more  difficult  registration  problems. 

Polarization  information  from  the  2-CCD  polarization 
camera  is  processed  on  our  Datacube  MV-20  board  using 
the  formula 


ii^±5ir 


(3) 


In  the  case  where  the  P  and  S  directions  are  aligned 
with  the  principal  polarization  component  directions  (i.e., 
the  directions  of  and  Imm),  formula  3  is  the  partial 
polarizJition  (equations  1).  In  general,  formula  3  is  less 
than  or  equal  to  the  partial  polarization  of  light,  and  even 
though  P  and  S  may  not  be  aligned  with  the  principal 
directions  at  a  pixel,  they  can  give  a  measure  of  “polar¬ 
ization  contrast"  which  can  be  useful.  With  the  added 
single  TN  liquid  crystal,  the  full  transmitted  radiance  si¬ 
nusoid  can  be  recovered  at  each  pixel,  and  processing  of 
this  polarization  information  is  similar  to  the  r  ly  it  is 
done  for  the  liquid  crystal  polarization  camera. 

Clearly  if  a  =  6,  there  is  no  solution  for  the  equations 
solving  for  P  and  S  above  since  the  simultaneous  linear 
equations  to  be  solved  are  the  same.  The  case  where  a  =  6 
represents  a  non-polarizing  beamsplitter  since  the  trans¬ 
mitted  .and  reflected  be.ams  both  have  the  same  polariza¬ 
tion  st<ate  .as  the  incident  polarization  state,  but  one  half 
the  r.adiance  of  the  incident  beam.  The  only  way  polar- 
r/ation  components  can  be  resolved  in  this  case  is  to  place 
polarizing  filters  over  the  CCD  chips  themselves,  each  chip 
fiaving  a  unitjue  orientation.  This  can  be  a  feasible  design 
for  a  polarization  camera  using  a  non-polarizing  beam¬ 
splitter  that  operates  over  a  sufficiently  wide  field  of  view. 


1035 


5  Polarization  Camera  Chips 

In  coUabotation  with  Prof.  Andreas  Andreou  at  Johns 
Hopkins  we  are  in  the  process  of  developing  self-contained 
VLSI  versions  of  polarization  cameras  that  sense  com¬ 
plete  states  of  partial  linear  polarization  on-chip,  compute 
state  of  partial  linear  polarization,  and,  compute  visual¬ 
ization  or  physical  information  related  to  sensed  polariza¬ 
tion.  VLSI  offers  very  high  computational  throughput  so 
that  VLSI  polarization  cameras  can  enable  operations  at 
very  high  speeds. 

6  Conclusion 

We  have  introduced  polarization  camera  computational 
sensors  that: 

•  Compute  sensed  polarization  from  either  an  image 
pattern  (e.g.,  radial  polarization  filters),  a  sequence 
of  polarization  component  images  (e.g.,  liquid  crystal 
polarization  camera),  or,  in  parallel  from  multiple  po¬ 
larisation  component  states  (e.g,  multiple  CCD  cam¬ 
eras  with  beamsplitter,  self-contained  VLSI  chip). 

•  Compute  a  visualization  of  polarization  information 
sensed  at  each  pixel  (e.g.,  hue-saturation-intensity 
representation  for  partially  linearly  polarized  light). 

•  Compute  a  visualization  of  physical  information  di¬ 
rectly  related  to  the  state  of  sensed  polarization  at 
each  pixel  (e.g.,  metal/dielectric  classification). 

We  feel  that  there  are  considerable  advantages  to  build¬ 
ing  a  polarization  camera  sensor  geared  towards  doing  po¬ 
larisation  vision.  Polarization  cameras  have  more  general 
capabilities  than  standard  intensity  cameras,  and  there 
already  exist  polarization  vision  methods  that  can  signif¬ 
icantly  benefit  and  enhance  a  number  of  application  ar¬ 
eas  such  as  aerial  reconnaissance,  autonomous  navigation 

ie.g,,  UGV),  target  recognition,  inspection,  and,  manu- 
acturing  and  quality  control.  Polarization  cameras  make 
polarisation  vision  methods  more  accessible  to  these  ap¬ 
plication  areas  and  others. 
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Abstract 

The  Lambertian  model  is  one  of  the  most  widely 
used  models  in  machine  vision.  For  several  real- 
world  objects,  the  Lambertian  model  can  prove 
to  be  a  very  inaccurate  approximation  to  the  dif¬ 
fuse  component.  While  the  brightness  of  a  Lam¬ 
bertian  surface  is  independent  of  viewing  direc¬ 
tion,  the  brightness  of  a  rough  diffuse  surface  in¬ 
creases  as  the  viewing  direction  approaches  the 
source  direction.  In  this  paper,  we  develop  a 
comprehensive  model  that  predicts  reflectance 
from  rough  diffuse  surfaces.  The  model  accounts 
for  complex  geometric  and  radiometric  phenom¬ 
ena  such  as  masking,  shadowing,  and  interreflec¬ 
tions  between  points  on  the  rough  surface.  The 
proposed  model  may  be  viewed  as  a  generaliza¬ 
tion  of  the  Lambertian  model  and  describes  re¬ 
flectance  characteristics  that  are  not  captured  by 
existing  models.  We  have  conducted  several  ex¬ 
periments  on  samples  of  rough  diffuse  surfaces, 
such  as,  plaster  and  sand.  All  of  these  surfaces 
demonstrate  significant  deviation  from  Lamber¬ 
tian  behavior.  The  reflectance  measurements 
obtained  are  in  strong  agreement  with  the  re¬ 
flectance  predicted  by  our  model.  We  conclude 
with  a  brief  discussion  on  the  implications  of 
these  results  on  machine  vision. 

1  Introduction 

Image  brightness  values  are  closely  related  to 
the  reflectance  properties  of  points  in  the  scene. 
Hence,  accurate  reflectance  models  are  funda¬ 
mental  to  the  advancement  of  machine  vision. 
A  surface  with  Lambertian  reflectance  appears 
equally  bright  from  all  directions.  This  model  for 
diffuse  reflection  is  one  of  the  most  widely  used 

•This  research  was  supported  in  part  by  DARPA  Contract 
No.  DACA  76-92-C-0007  and  in  part  by  the  NSF  Research 
Initiation  Award. 


assumptions  in  machine  vision.  It  is  used  explic¬ 
itly  in  the  case  of  shape  recovery  techniques  such 
as  shape  from  shading  and  photometric  stereo. 
It  is  also  implicitly  used  by  vision  techniques 
such  as  binocular  stereo  and  motion  detection 
to  solve  the  correspondence  problem. 

For  several  real-world  objects,  however,  the 
Lambertian  model  can  prove  to  be  a  poor  and 
inadequate  approximation  to  the  diffuse  compo¬ 
nent.  In  the  areas  of  machine  vision,  remote 
sensing,  and  computer  graphics,  each  picture  el¬ 
ement  (pixel)  can  represent  a  surface  area  with 
substantial  roughness.  Though  the  Lambertian 
assumption  is  often  reasonable  when  looking  at 
a  small  planar  surface  element,  the  roughness  of 
the  total  surface  covered  by  a  pixel  causes  it  to 
behave  in  a  non- Lambertian  manner.  This  devi¬ 
ation  from  Lambertian  reflectance  is  significant 
for  very  rough  surfaces,  and  increases  with  the 
angle  of  incidence.  In  this  paper,  we  develop 
a  comprehensive  model  that  predicts  reflectance 
from  rough  diffuse  surfaces,  and  provide  experi¬ 
mental  results  that  support  the  model.  The  pro¬ 
posed  reflectance  model  that  may  be  viewed  as 
a  vast  generalization  of  the  Lambertian  model. 

Prior  to  developing  the  diffuse  reflectance  model, 
we  conducted  a  detailed  survey  of  related  work 
in  the  areas  of  applied  physics  and  geophysics. 
Though  the  topic  of  rough  diffuse  surfaces  has 
been  extensively  studied,  a  complete  model  such 
as  the  one  presented  here  has  not  been  devel¬ 
oped.  In  1924,  Opik  (Opik,  1924]  designed  an 
empirical  reflectance  model  to  describe  the  non- 
Lambertian  behavior  of  the  moon.  In  1941,  Min- 
naert  [Minnaert,  1941]  modified  Opik’s  model  to 
obtain  the  following  reflectance  function: 

/r  =  ^;;|^(cOS^,  COS^r)^*”^^  (0  <  fc  <  1) 

This  function  was  designed  to  obey  Helmholtz’s 


reciprocity  principle  but  is  not  based  on  any  the¬ 
oretical  foundation.  It  predicts  that  the  radiance 
of  non-Lambertian  diffuse  surfaces  is  symmetri¬ 
cal  with  respect  to  the  surface  normal  direction. 
VVe  will  show  in  this  paper  that  this  assump¬ 
tion  is  incorrect.  Hapke  and  van  Horn  [Hapke 
and  Horn,  1963]  also  obtained  reflectance  mea¬ 
surements  from  rough  diffuse  surfaces  by  varying 
the  source  direction  for  a  fixed  viewer  direction. 
Their  measurements  show  that  the  peak  of  the 
radiance  function  is  shifted  from  the  peak  posi¬ 
tion  expected  for  a  Lambertian  surface. 

The  studies  cited  above,  were  attempts  to  de¬ 
sign  reflectance  models  based  on  measured  re¬ 
flectance  data.  In  contrast,  Smith  [Smith,  1967] 
and  Buhl  et  al.  [Buhl  et  ai,  1968]  attempted  to 
develop  a  theoretical  model  for  diffuse  reflection 
from  rough  surfaces.  These  efforts  were  moti¬ 
vated  primarily  by  the  reflectance  characteristics 
of  the  moon.  Infrared  emissions  from  the  moon 
indicate  that  the  surface  of  the  moon  reflects 
u:ore  light  back  in  the  direction  of  the  source  (the 
sun)  than  in  the  normal  direction  (like  Lamber¬ 
tian  surfaces)  or  in  the  forward  direction  (like 
specular  surfaces).  This  phenomenon  is  some¬ 
times  referred  to  as  backscattering^ .  Smith  mod¬ 
eled  the  roughness  of  the  moon  as  a  random  pro¬ 
cess  and  cLssumed  each  point  on  the  surface  to 
be  Lambertian  in  reflectance.  However,  Smith’s 
analysis  was  confined  to  the  plane  of  incidence 
and  is  not  easily  extendable  to  reflections  out¬ 
side  the  plane  of  incidence.  Buhl  et  al.  [Buhl 
et  al.,  1968]  modeled  the  surface  as  a  collec¬ 
tion  of  spherical  cavities.  They  accounted  for 
interreflections  using  this  surface  model,  but  did 
not  present  a  complete  reflectance  model  that 
accounts  for  masking  and  shadowing  effects  for 
arbitrary  angles  of  reflection  and  incidence. 

This  paper  presents  a  general  and  complete 
model  for  diffuse  reflectance.  This  model  can  be 
applied  to  isotropic  as  well  as  anisotropic  rough 
surfaces,  and  can  handle  arbitrary  source  and 
viewer  directions.  Further,  it  takes  into  account 
complex  geometrical  effects  such  as  masking, 
shadowing,  and  interreflections  between  points 
on  the  rough  surface.  We  begin  by  modeling 
the  surface  as  a  collection  of  long  .symmetric  V’- 

*  A  differ<*nt  harkscattering  mechanism  produces  a  sharp 
peak  close  to  the  source  direction  (see  [Hapke  and  Horn, 
Oetking,  19t>6.  Shibata  ft  al.,  1981,  Tagare  and  dePigueiredo, 
1991]).  This  is  not  the  mechanism  discussed  in  this  paper. 


(a)  (b)  (c) 


Figure  1:  Images  of  spheres  rendered  using  the  pro¬ 
posed  reflectance  model  for  rough  surfaces;  (a)  Lam¬ 
bertian  sphere  (<t  =  0);  (b)  rough  Lambertian  sphere 
(cr  =  30®)  (c)  rough  Lambertian  sphere  (cr  =  70®). 

cavities.  Each  V-cavity  has  two  opposing  facets 
and  each  facet  is  assumed  to  be  much  larger  than 
the  wavelength  of  incident.  This  surface  model 
w’as  used  by  Torrance  and  Sparrow  [Torrance 
and  Sparrow,  1967]  to  describe  specular  reflec¬ 
tion  from  rough  surfaces.  Here,  we  assume  the 
facets  are  Lambertian  in  reflectance  ^  First,  we 
develop  a  reflectance  model  for  anisotropic  sur¬ 
faces  with  one  type  (facet-slope)  of  V-cavities, 
and  with  all  cavities  aligned  in  the  same  ori¬ 
entation  on  the  surface  plane.  This  result  is 
then  used  to  develop  a  reflectance  model  for 
the  more  general  case  of  isotropic  surfaces  that 
have  Gaussian  facet-slope  distributions  with  zero 
mean  (/i  =  0)  and  arbitrary  standard  devia¬ 
tion  (cr).  The  standard  deviation  represents  the 
macroscopic  roughness  of  the  surface.  Figure 
1  shows  three  images  of  spheres  rendered  us¬ 
ing  the  proposed  reflectance  model.  In  all  three 
cases,  the  sphere  is  illuminated  from  the  viewer 
direction.  In  the  first  ca.se,  a  =  0.  and  hence 
the  sphere  is  Lambertian  in  reflectance.  .As  the 
roughness  value  increases,  the  sphere  begins  to 
appear  flatter.  In  the  extreme  roughness  case 
shown  in  Figure  Ic,  the  sphere  looks  like  a  flat 
disc  with  near  constant  brightness.  This  phe¬ 
nomenon  has  been  observed  and  reported  in  the 
case  of  the  full  moon. 

We  pre.sent  several  experimental  results  that 
demonstrate  the  accuracy  of  our  diffuse  re¬ 
flectance  model.  The  experiments  were  con¬ 
ducted  on  real  samples  such  as  wall  plaster  and 
sand.  In  all  rases,  the  reflectance  predicted  by 
the  model  was  found  to  be  in  strong  agreement 
with  the  measurements.  These  results  illustrate 


^In  cases  where  the  facets  have  a  specular  component  in 
aHrlilion  to  the  diffuse  component,  the  model  presented  in  this 
paper  ran  be  nsed  in  conjunction  with  the  Torrance- Sparrow 
model  [Torrance  and  Sparrow.  19fi7). 
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Figure  2:  Geometry  used  to  define  radiometric 
terms. 

that  the  deviation  from  Lambertian  behavior  can 
be  very  significant.  For  example,  when  the  an¬ 
gle  of  incidence  is  75°,  the  brightness  can  vary 
by  a  factor  of  three  when  the  viewing  direction 
is  varied.  The  results  presented  in  this  paper 
demonstrate  two  points  that  are  fundamental  to 
computer  vision,  (a)  Several  real-world  objects 
have  diffuse  components  that  are  significantly 
non- Lambertian,  (b)  The  reflectance  of  such  ob¬ 
jects  cannot  be  accurately  described  using  any 
of  the  previous  reflectance  models. 


It  is  the  flux  radiated  by  the  surface  per  unit 
solid  angle,  per  unit  foreshortened  area.  It  de¬ 
pends  on  the  direction  of  illumination  and  the 
direction  of  the  sensor.  The  relationship  be¬ 
tween  the  irradiance  and  radiance  of  the  surface 
is  determined  by  its  reflectance  properties.  The 
bi-directional  reflectance  function  (BRDF)  is  de¬ 
fined  as  the  ratio  of  the  radiance  to  the  irradi¬ 
ance; 


dLr{0T,<t>T‘-,6i,4>i) 

dE{0i,<i>i) 


(3) 


All  of  the  above  definitions  are  general,  in  that, 
they  are  valid  for  surfaces  with  any  reflectance 
characteristics.  A  special  type  of  reflectance  that 
is  widely  discussed  in  computer  vision  is  Lam¬ 
bertian  reflectance.  A  Lambertian  surface  is  an 
ideal  diffuser  whose  radiance  is  independent  of 
the  viewing  direction  of  the  sensor;  it  appears 
equally  bright  from  all  directions.  Its  BRDF  is 
/r  =  f  where  the  constant  p  is  the  albedo  of  the 
surface. 


2  Radiometric  Definitions 

In  this  section,  we  define  radiometric  concepts 
that  will  be  extensively  used  in  the  remaining  of 
this  paper.  These  concepts  are  discussed  in  de¬ 
tail  in  [Nicodemus  et  a/.,  1977].  Figure  2  shows 
a  surface  element  dA  illuminated  from  the  direc¬ 
tion  s  =  and  viewed  by  a  sensor  (image 

pixel,  for  example)  in  the  direction  v  =  (^,,0,). 
We  use  6  to  denote  zenith  angles  and  <j)  to  de¬ 
note  azimuth  angles.  The  sensor  subtends  an 
infinitesimal  solid  angle  dur  from  any  point  on 
the  surface. 

The  light  energy  reflected  by  the  surface  patch  is 
proportional  to  the  light  incident  on  the  patch. 
Irradiance  is  defined  as  the  light  flux  incident  per 
unit  area  of  the  surface: 

(1) 

This  is  the  directional  irradiance  of  the  surface  as 
it  represents  the  light  energy  incident  from  the 
direction  The  brightness  measured  by 

the  sensor  is  proportional  to  the  radiance  of  the 
surface  patch  in  the  direction  (0r,<t>r)-  Surface 
radiance  is  defined  as: 


3  Surface  Roughness  Model 

All  surface  models  found  in  applied  physics  and 
geophysics  literature  can  be  divided  into  two 
broad  categories.  In  the  first  case,  the  sur¬ 
face  is  modeled  as  a  random  process  (see  [Beck¬ 
mann,  1965,  Wagner,  1966,  Smith,  1967]).  Us¬ 
ing  this  approach,  it  is  difficult  to  derive  a  re¬ 
flectance  model  for  arbitrary  source  and  viewer 
directions  as  well  as  to  analyze  interreflections. 
In  the  second  category,  surfaces  are  assumed 
to  be  composed  of  several  elements  with  some 
primitive  shape,  for  example,  spherical  cavities, 
V-cavities,  holes,  etc  (see  [Buhl  et  ai,  1968, 
Torrance  and  Sparrow,  1967]).  As  we  show  in 
this  paper,  the  effects  of  shadowing,  masking, 
and  interreflection  need  to  be  modeled  in  order 
to  obtain  an  accurate  reflectance  model.  To  ac¬ 
complish  this,  we  use  the  roughness  model  pro¬ 
posed  by  Torrance  and  Sparrow  [Torrance  and 
Sparrow,  1967]  that  assumes  the  surface  to  be 
composed  of  long  symmetric  V-cavities  (see  Fig¬ 
ure  3).  Each  cavity  is  composed  of  two  facets. 
The  width  of  each  facet  is  assumed  to  be  small 
compared  to  its  length.  The  roughness  of  the 
surface  is  determined  by  a  probability  function 
used  to  model  the  distribution  of  facet  slopes. 


Lr(0r,4>r\0i,<l>i) 


d(f>r  (  ,  <f>r  i  ^  ) 

dA  COS0r  d^T 


We  assume  each  facet  area  da  is  small  compared 
to  the  area  d A  of  the  surface  patch  that  is  imaged 
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Figure  3:  Surface  modeled  as  a  collection  of  V- 
cavities. 

by  a  single  pixel.  Hence,  each  pixel  includes  a 
very  large  number  of  facets.  Further,  the  facet 
area  is  large  compared  to  the  wavelength  A  of 
incident  light  and  hence  geometrical  optics  can 
be  used  to  derive  the  reflectance  model.  The 
above  assumptions  can  be  summarized  as;  A^  •< 
da  C  dA.  The  facets  could  be  relatively  small 
as  in  the  case  of  sand  and  plaster,  or  large  as  in 
the  case  of  outdoor  scenes  of  terrain. 


surfaces  with  different  slope-area  probability  dis¬ 
tributions  are  examined,  (a)  Uni-directional 
Single-Slope  Distribution:  This  distribution 
results  in  a  non-isotropic  surface  where  all  facets 
have  the  same  slope  and  all  cavities  are  aligned 
in  the  same  direction,  (b)  Isotropic  Single- 
Slope  Distribution:  Here,  all  facets  have  the 
same  slope  but  they  are  uniformly  distributed  in 
orientation  on  the  surface,  (c)  Gaussian  Dis¬ 
tribution:  This  is  the  most  general  case  exam¬ 
ined  where  the  slope-area  distribution  is  assumed 
to  be  normal  with  mean  zero.  The  roughness  of 
the  surface  is  determined  by  the  standard  devia¬ 
tion  of  the  normal  distribution.  The  reflectance 
model  obtained  for  each  of  the  above  surface 
types  is  used  to  derive  the  succeeding  ones. 

The  Projected  Radiance: 


Slope-Area  Probability  Distribution: 


We  denote  the  slope  and  orientation  of  each  facet 
in  the  V-cavity  model  as  {da,4>a)-  Torrance  and 
Sparrow  have  assumed  all  facets  to  have  equal 
area  da.  They  use  the  distribution  n{8a,4>a)  to 
represent  the  number  of  facets  per  unit  surface 
area  that  have  the  normal  a  =  (Oa,<f>a)-  Here, 
we  use  a  probability  distribution  to  represent  the 
fraction  of  the  surface  area  that  includes  facets  of 
a  particular  slope.  We  refer  to  this  as  the  slope- 
area  distribution  P(6a,4>a)-  The  facet  number 
distribution  and  the  slope-area  distribution  can 
be  related  as  follows: 


dA  n{9ai4>a)  da  cos 9a 

dM 

n{9a,<i>a)daCOs9a  (4) 


The  slope-area  probability  distribution  is  easier 
to  use  than  the  facet-number  distribution  in  the 
following  derivation  of  the  reflectance  model.  For 
isotropic  surfaces,  we  simply  have  n{9a,<t>a)  = 
n{9a)  and  P(9a,4>a)  =  P(^a)  since  the  distribu¬ 
tions  are  rotationally  symmetric  with  respect  to 
the  global  surface  normal. 


4  Reflectance  Model 

In  this  section,  we  derive  a  reflectance  model  for 
rough  diffuse  surfaces.  The  V-cavity  model  is 
used  to  describe  the  surface  geometry  and  each 
facet  on  the  surface  is  assumed  to  be  Lamber¬ 
tian  in  reflectance.  The  following  three  types  of 


Consider  a  surface  area  dA  that  is  imaged  by 
a  single  sensor  element  in  the  direction  v  = 
(dr>d>r).  and  illuminated  by  a  distant  point  light 
source  in  the  direction  s  =  {9i,<i>,).  The  area 
is  composed  of  a  very  large  number  of  symmet¬ 
ric  V-cavities.  Each  V-cavity  is  composed  of  two 
facets  with  the  same  slope  but  facing  in  opposite 
directions.  Our  approach  is  to  compute  the  ra¬ 
diance  contribution  of  each  facet  on  the  surface. 
Then,  the  total  radiance  of  the  surface  patch  can 
be  computed  as  an  aggregate  of  the  contribu¬ 
tions  of  all  facets.  Consider  the  flux  reflected  by 
a  facet  with  area  da  and  normal  a  =  {9„.0a)- 
As  shown  in  Figure  3.  the  projected  area  on  the 
surface  occupied  by  the  facet  is  da  cos  9a  ■  Hence, 
while  computing  the  contribution  of  the  facet  to 
the  radiance  of  the  surface  patch,  we  need  to  use 
the  projected  area  da  cos 9a  and  not  the  actual 
facet  area  da.  The  radiance  contribution  of  the 
facet  thus  determined  is  what  we  call  the  pro¬ 
jected  radiance  of  the  facet  and  is  given  by: 


d^r(9a,<i)a) 

(  da  cos  9a  )  cos  9r  duJr 


(•'■>) 


For  ease  of  description,  we  have  dropped  the 
source  and  viewing  directions  from  the  notations 
for  radiance  and  flux  in  the  above  expression. 


Total  Radiance: 


Now  consider  the  slope-area  distribution  of  facet 
orientations  given  by  P(9a.4>a)-  The  total  radi¬ 
ance  of  the  surface  can  be  obtained  as  the  aver¬ 
age  of  the  projected  radiance  Lrp(9a.0a)  of  all 
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the  facets  on  the  surface: 

f2w 

/  P{0a,<t>a)  Lrp(0a,4>a)d^ad6a 

.0J4>a=0 

(6) 

4.1  Model  for  Uni-directional  Single- 
Slope  Distribution 

The  first  surface  type  we  consider  has  all  facets 
with  the  same  slope  Oa-  Further,  all  the  V- 
cavities  are  aligned  in  the  same  direction.  The 
results  obtained  for  this  anisotropic  surface  ge¬ 
ometry  will  later  be  used  in  the  analysis  of 
isotropic  surfaces. 

Radiance  from  a  Lambertian  Facet: 

Consider  a  Lambertic  '■'.cet  that  is  fully  illumi¬ 
nated  (no  shadowing;  id  is  completely  visible 
(no  masking)  from  the  sensor  direction.  The  ra¬ 
diance  of  the  facet  is  proportional  to  its  irradi- 
ance  and  is  given  by: 

Lr{daAa)=^E{9a,<t>a)  (7) 

The  irradiance  of  the  surface  is  E(0a,4>a)  = 
Eo<s,a>,  where  Eq  is  the  irradiance  when  the 
surface  is  illuminated  head-on.  Using  the  defini¬ 
tion  of  radiance,  the  flux  reflected  by  the  facet 
in  the  sensor  direction  is  obtained  as: 

d^r  =  ^  Eq  <s,a>  <v,a>  dadur  (8) 

Substituting  the  above  expression  for  reflected 
flux  in  equation  5,  we  obtain  the  projected  radi¬ 
ance  for  the  facet: 

L/rp^9(ij  0a)  —  ^  Eq  COS  0i  COS  0^ 

^1  +  tan^3tandocos(0a  -  0«)) 

^1  -b  tan  0„tan  Oa  cos(0a  -  0r))  (9) 

This  expression  clearly  indicates  that  the  pro¬ 
jected  radiance  of  a  tilted  Lambertian  facet  is 
not  equal  in  all  viewing  directions.  Hence,  a 
rough  Lambertian  surface  comprised  of  tilted 
facets  reflects  in  a  non- Lambertian  manner;  its 
radiance  varies  with  the  viewing  direction.  This 
phenomenon  is  observed  even  in  the  absence  of 
masking,  shadowing,  and  interreflection  effects. 


4.1.1  Geometric  Attenuation  Factor 

If  the  surface  is  illuminated  and  viewed  head-on, 
all  of  the  facets  are  fully  illuminated  and  visi¬ 
ble.  For  larger  angles  of  incidence  and  reflec¬ 
tion,  however,  facets  are  shadowed  and  masked 


by  adjacent  facets  (see  Figure  4).  In  the  case  of 
shadowing,  a  facet  is  only  partially  illuminated 
as  the  adjacent  facet  on  the  V-cavity  casts  a 
shadow  on  it.  In  the  case  of  masking,  the  facet 
is  only  partially  visible  to  the  sensor  as  its  ad¬ 
jacent  facet  occludes  it.  Both  these  geometrical 
phenomena  affect  the  projected  radiance  of  the 
facet  and  hence  must  be  taken  into  account.  Tor¬ 
rance  and  Sparrow  [Torrance  and  Sparrow,  1967] 
have  studied  these  effects  while  deriving  their  re¬ 
flectance  model  for  specular  facets.  We  analyze 
the  same  effects  here  for  Lambertian  facets.  The 
result  is  a  geometrical  attenuation  factor  {QAT) 
that  lies  between  zero  and  unity.  It  represents 
the  reduction  in  the  projected  radiance  of  a  facet 
due  to  masking  and  shadowing  effects.  Consider 


Figure  4:  Masking  and  shadowing  effects. 


the  masked  and  shadowed  regions  shown  in  Fig 
ure  4.  If  the  visible  area  of  a  facet  is  smaller  than 
the  illuminated  area,  the  masking  effect  domi¬ 
nates.  On  the  other  hand,  if  the  illuminated  area 
is  smaller  than  the  visible  area,  the  shadowing  ef¬ 
fect  dominates.  We  denote  the  length  and  width 
of  the  facet  by  /  and  w,  respectively.  Further, 
m,  and  m„  are  the  sections  of  the  facet  that  are 
shadowed  and  masked,  respectively.  The  area  of 
the  facet  which  is  both  illuminated  and  visible  is 
/  •  Min[w  —  ms,w  -  ttIi,].  The  ^AP  is  obtained 
by  dividing  this  expression  by  the  area  w  I  of  the 
facet: 


GAP  =  Min  [l  -  — ,  1  -  — 
w  w 


(10) 


We  need  to  express  the  GAP  in  terms  of  the 
source  direction  (^i,0i)  and  the  sensor  direction 
{0r,<t>T)-  For  any  given  set  of  source  and  sen¬ 
sor  angles,  the  shadowed  and  masked  regions  m, 
and  can  be  derived  using  trigonometry.  This 
derivation  has  already  been  presented  in  [Blinn, 
1977]  and  hence  we  will  not  discuss  it  here.  The 
final  expression  for  the  GAP  is  found  to  be: 

GAP=  (11) 

r,  .#  r„  2<h,a><n,s>  2<n,d><n,  oll 

[  [  <a,s>  <o,r>  JJ 
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Projected  Radiance  and  QAT'. 

The  projected  radiance  of  a  Lambertian  facet 
is  obtained  by  simply  multiplying  the  geometric 
attenuation  factor  with  the  projected  radiance 
obtained  under  the  assumption  of  no  masking 
and  shadowing  effects.  Table  1  details  the  QAT 
and  the  corresponding  projected  radiance  for  all 
cases  of  shadowing  and  masking.  Note  that  the 
projected  radiance  is  denoted  as  the  super¬ 
script  1  is  used  to  indicate  that  the  radiance 
is  due  to  direct  illumination  by  the  source.  In 
the  next  section,  we  will  use  denote  the 

radiance  due  to  interreflections  (multiple  reflec¬ 
tions). 


4.1.2  Interreflection  Factor 

In  our  reflectance  model,  we  also  account  for  in¬ 
terreflections;  light  rays  bouncing  between  ad¬ 
jacent  facets.  These  effects  are  significant  for 
rough  surfaces  with  relatively  high  albedo  val¬ 
ues.  We  have  the  task  of  modeling  interreflec¬ 
tions  in  the  presence  of  masking  and  shadow¬ 
ing  effects.  In  the  case  of  Lambertian  surfaces, 
the  energy  in  an  incident  light  ray  diminishes 
rapidly  with  each  interreflection  bounce.  Hence, 
we  model  only  two-bounce  interreflections  and 
ignore  subsequent  bounces.  The  experimental 
results  show  that  this  approximation  is  a  good 
one.  Interreflections  are  more  significant  for 
very  rough  surfaces  and  less  so  for  surfaces  with 
mostly  “open”  V-cavities.  Since  the  long  V- 
cavity  model  is  only  an  approximation  to  the 
real  surface,  we  propose  to  use  a  parameter  y, 
where  0  <  x  <  1,  to  represent  the  coefficient  of 
the  interreflection  component  of  radiance.  The 
real  surface  may  not  have  very  long  cavities  and 
hence  the  parameter  x  can  be  adjusted  to  ac¬ 
count  for  such  discrepancies.  For  very  rough 
surface  with  deep  cavities,  x  **  ^^nd  for  rel¬ 
atively  open  and  shorter  cavities  (as  in  the  ca.se 
of  terrain),  x  ^  0.5  -  0.75  . 

In  the  following  discussion,  we  refer  to  surface  ra¬ 
diance  due  to  direct  illumination  by  the  source 
as  L\  and  the  radiance  due  to  the  second  bounce 
(first  interreflection)  as  L^.  We  will  use  the  same 
superscripts  for  the  projected  radiance.  The 
two- bounce  interreflection  component  for  a  Lam¬ 
bertian  facet  can  be  expressed  as  [Koenderink 


and  van  Doom,  1983]: 

^r(x)  =  ~  J  J  A'(X’y)^J(y)‘^y  (12) 

where  x  is  a  point  on  the  facet  whose  interreflec¬ 
tion  component  is  determined  as  an  integral  of 
the  radiance  of  all  points  y  on  the  adjacent  facet. 
A'(x,y)  is  called  the  kernel  and  represents  the 
geometrical  relationship  between  x  and  y.  Since 
the  V-cavity  is  very  long  compared  to  its  width, 
it  can  be  viewed  as  a  two-dimensional  shape  with 
translational  symmetry.  For  such  shapes,  the  in¬ 
terreflection  component  can  be  determined  as  an 
integral  over  the  two-dimensional  cross-section  of 
the  shape.  The  above  interreflection  equation  is 
therefore  reduced  to: 

Ll(x)  =  ^  j  K'(x,y)L\({y))dy  (13) 

where  x  and  y  are  the  shortest  distances  of  points 
X  and  y  from  the  intersection  of  the  two  facets 
(see  Figure  5).  K'  is  the  kernel  for  the  transla¬ 
tional  symmetry  case  and  is  derived  in  [Forsyth 
and  Zisserman,  1989]  to  be: 

,  7rsin-(2ga) _ xy _ 

2  (x2  +  2xi/cos(2^a)  +  y^)^/- 

(14) 

We  know  that  the  orientation  of  the  considered 


Figure  5:  Interreflections  in  a  V-cavity. 


facet  is  a  =  (0a,4>a)  and  the  orientation  of  the 
adjacent  facet  is  a'  =  {0a,(f>a  +  tt).  The  limits 
of  the  integral  in  the  interreflection  equation  are 
determined  by  the  masking  and  shadowing  of  the 
facets.  Let  m„  be  the  width  of  the  facet  which  is 
visible  to  the  viewer.  Let  m*  be  the  width  of  the 
adjacent  facet  that  is  illuminated.  As  in  section 
4.1.1,  expressions  can  be  obtained  for  the  visible 
and  illuminated  sections: 


171  r 

— -  =  Max  0,  M in  1 , 
w 


<d,v>  jJ 


<a.s> 

<a\s> 


11 


(15) 

(16) 
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■■■■ 

GAT 

No  Masking 

or 

Shadowing 

1 

pp  <a,s><a,v> 

^  ®<n,d><n,v> 

^Eq  cos  9i  cos  9a  (l  +  tan  9a  tan  9i  cos  (0a  -  <pi 
^1-1-  tan  9a  tan  9r  cos  {4>a-  <I>t  )) 

Masking 

2<h,v><n,a> 

^£o2<fl,s>  = 

^  Eq  cos  9i  cos  ^a  2(  1  -1-  tan  9a  tan  9i  cos  (0a  -  0i ) ) 

<v,a> 

Shadowing 

p  r  2<h,S><a,V>  _ 

’T  <v,h> 

^Eq  cos  9i  cos  9a  2{l  +  tan  9a  tan  9r  cos  (0a  -  0r  )) 

Table  1:  Projected  radiance  of  a  facet  for  different  masking/shadowing  conditions. 


By  using  the  following  change  of  variables:  r  = 
J  we  get  the  projected  radiance  due  to 

two-bounce  interreflections  as: 


<a',s><a,v> 

<a,h><v,h> 


^  w  '  w 

Using  equation  14,  the  integral  is  evaluated  as: 


f  Ui,^) + d(i,— )  -  d(— ,^)  -  d(i,i) 

w  w  w  w 


where: 


dix,y)  =  +  2xj/cos(26>a)  (19) 


We  refer  to  the  above  term  as  the  interreflection 
factor  {‘IT).  Then,  the  interreflection  compo¬ 
nent  of  the  projected  radiance  of  a  facet  with 
orientation  {0a,<f>a)  can  be  written  as: 

Trp  =  (r)^Eo  cos 9i  cos  Ba 

^1  -  tan  ^atan^5  cos  (d»a  -  d*i))  (20) 

^1  -1- tan0atan0„cos(d>a  -  <t>T)^IE{v,s,a) 


The  total  projected  radiance  of  the  facet  is  the 
sum  of  the  projected  radiance  due  to  source  illu¬ 
mination  (given  in  Table  1)  and  the  interreflec¬ 
tion  component: 

LrpiOaAa)  =  L%{0a.4>.)  +  X  i^rp(«a,  0a)  (21) 

As  discussed  earlier  in  this  section,  we  use  the 
parameter  x  to  weight  the  contribution  due  to  in¬ 
terreflections  to  account  for  discrepancies  in  the 


interreflections  predicted  by  the  V-cavity  surface 
model  and  the  real  surface.  Note  that  the  uni¬ 
directional  single-slope  surface  we  have  consid¬ 
ered  in  this  section  has  only  two  types  of  facets 
with  normals  {0a,<f>a)  and  (^a»0a  +  Hence, 
the  radiance  of  the  surface  for  any  given  source 
direction  and  sensor  direction  is  simply  the  av¬ 
erage  of  the  projected  radiance  of  the  two  facet 
types: 

Ij  —  V,i)  4“  f<rp(^a>  0a  4* '^)  (22) 

4.2  Model  for  Isotropic  Single-Slope  Dis¬ 
tribution 

We  now  consider  a  surface  where  all  V-cavities 
have  facets  with  the  same  slope  {0a)i  but  the  V- 
cavities  are  uniformly  distributed  in  orientation 
(0a)  in  the  plane  of  the  surface.  The  result  is  a 
surface  with  isotropic  roughness.  The  reflectance 
model  derived  for  this  surface  is  based  on  the  re¬ 
sults  obtained  for  the  single  slope  surface  in  the 
previous  section.  The  results  obtained  in  this 
section  are  important  as  they  can  be  used  to  ob¬ 
tain  a  reflectance  model  for  any  isotropic  surface. 

Consider  a  surface  with  only  one  type  of  V- 
cavities  that  are  distributed  uniformly  in  the 
plane  of  the  surface.  From  the  previous  section, 
we  know  the  projected  radiance  Llp(0a-,4>a)  of  a 
facet  with  normal  a  =  {6ai<t>a)-  All  facets  on  the 
isotropic  surface  have  the  same  slope  9a  but  dif¬ 
ferent  orientation  0a.  Hence,  the  radiance  of  the 
isotropic  surface  is  obtained  as  an  integral  of  the 
projected  radiance  over  0a: 

^  r  Llp{9a,4>a)d<Pa  (23) 

27r  J4,a=o 

For  lack  of  space,  we  will  avoid  detailed  deriva¬ 
tions  and  focus  on  the  main  results  obtained. 
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We  approach  the  problem  as  follows:  Given  a 
source  direction  {0i,(f>i)  and  a  sensor  direction 
{6ri4>r)  we  need  to  find  the  ranges  of  facet  ori¬ 
entation  <f>a  for  which  the  facets  are  masked, 
shadowed,  masked  and  shadowed,  and  neither 
masked  nor  shadowed.  The  projected  radiance 
for  each  range  is  given  in  Table  1.  The  prob¬ 
lem  then  is  to  decompose  the  above  integral  into 
different  parts  each  corresponding  to  a  different 
masking/shadowing  range^.  Using  basic  geom¬ 
etry,  we  have  identified  the  limits  of  the  inte¬ 
grals  corresponding  to  different  ranges  of  shad¬ 
owing/masking.  These  limits  are  represented  by 
the  critical  angles  0*  (for  shadowing)  and  (for 
masking).  The  critical  angle  </►*  is  related  to  the 
slope  0a  of  all  surface  facets  as  follows: 


4> 


3  _ 

C 


( 


(tanflatanfl,  ) 

0 


if  (tan  da  tandj)  >  1 
otherwise 


(24) 

The  critical  angle  is  determined  using 
the  same  expression  by  replacing  d,  with  dr. 
The  critical  angles  are  related  to  the  mask¬ 
ing/shadowing  ranges  as  shown  in  Table  2. 


Using  the  above  critical  angle  expressions.  Ta¬ 
ble  2,  and  Table  1,  we  decompose  the  integral 
of  equation  23  into  the  sum  of  several  integrals. 
Each  integral  can  be  evaluated  for  any  given 
viewer  directions.  However,  for  arbitrary  direc¬ 
tions  several  cases  arise  and  the  results  are  not 
easy  to  use  in  practice.  Therefore,  we  have  cho¬ 
sen  to  express  the  radiance  of  the  surface  for  any 
arbitrary  viewing  direction  (d^,  </>r)  as  a  weighted 
sum  of  the  radiance  Z-Jp||  in  the  plane  of  inci¬ 
dence  ((f>r  =  <t>ii4>\  +  ^)  and  the  radiance 
in  the  perpendicular  plane  {<f>T  =  <A,  ±  f).  We 
use  the  following  notation:  a  =  Maz[d,-,d,]  and 
/?  =  Af  m[d,,  dr];  if  a  =  di,  0“  =  <^*,  else  0“  = 
and  the  same  rules  apply  to  <^.  The  projected 
radiance  in  the  plane  of  incidence  is: 


-  tan^da  tana(l  — 


2^°-|-sin(2<>?) 


) 


M(0a',0,<i>r  -  <l>i)  =  < 


20c  .  a  28in0P  .  ^ 

-  tan  0a  — tan  /? 


if  cos  (0r  —  0i )  <  0 
if  cos(^,  —  0«)  ^  0 


Similarly,  the  projected  radiance  in  the  perpen¬ 
dicular  plane  is  determined  as: 


^rpj_(0a)  =  ^EoCOsOi  COS  da 


1  +  ■43(do;dr,dj) 


(26) 


A3(0aJr,0i) 


’  0 

2  - 1 - Z - 

ytan^  tan^ 

'  , - « - 

ytan^  tan^^a~l 

- 1— 

tan  \/tan^  ^,+tan^  Br 
r 


2 


if 0^+0^  >  f 


The  radiance  of  the  surface  in  any  arbi¬ 
trary  direction  is  approximated  as  the  following 
weighted  sum  of  Z/Jp||(da)  and  Zjpj^(da): 


•f'rp(^o)  ^1  COS  (<^r  |  f'rp||(da)  -f- 

(1-  |cOs(d>r-<^,)|)irpj.(«a)  (27) 

This  approximation  was  obtained  by  studying 
the  expressions  for  the  radiance  components  in 
the  two  planes.  It  is  very  accurate  in  general, 
with  a  slight  over-estimation  only  for  d^  =  d,  and 
d,-  —>■  7r/2.  Using  the  above  linear  combination  of 
the  project  radiance  in  the  two  planes,  we  obtain 
the  final  expression  for  the  projected  radiance: 


//*p(do)  =  ^Eo  cos  dj  cos  do 


14- 


(28) 


cos(0r  -  0,)(i4i(do;a)tan/?-f  i42(da;/9,0r  -  0,))  + 


(1-  |cOs(0c-0O|)yl3(da;d.,d.) 


1 4-  (25) 

CO8(0r  -  0i)(>4i(da;o)tan/?4-  A2(0a\P,<t>r  -  0i)) 
where  : 

Ai{9a,oi)  =  tan  dg  —  + 

TT 

^Imagine  a  V-cavity  rotated  about  the  global  surface  nor¬ 
mal  for  any  given  source  and  sensor  direction.  Various  mask¬ 
ing/shadowing  conditions  can  be  observed. 


Note  that  the  above  projected  radiance  is 
the  same  as  the  total  radiance  of  the  surface 
Ll{0a)  =  Tjp(da)  since  we  have  only  type  of 
facet  slope  on  the  surface  (P(do)  =  1).  In  the 
above  derivation,  we  have  not  considered  multi¬ 
ple  reflections  as  the  interreflection  component  of 
equation  21  is  difficult  to  intergrate  over  all  cav¬ 
ity  orientation  angles  4>a-  However,  in  the  next 
section  we  present  an  approximation  to  the  inter¬ 
reflection  component  for  surfaces  with  Gaussian 
slope-area  distribution. 


Aip||(da)  =  ^EoCOS0iCOS0a 
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Partial  Shadow 

No  Shadow 

Complete  Self- Shadow 

a 

I 

A 

\<l>a  -  (</>,■  +  7r)|  < 

Partial  Masking 

No  Masking 

Complete  Self- Masking 

\<t>a  -  4>r\  <  4>c 

Table  2:  Masking/shadowing  and  the  critical  angles. 


Once  again,  it  is  important  to  note  that  the  ra¬ 
diance  of  the  rough  surface  considered  here  is 
not  constant  with  respect  to  the  viewing  direc¬ 
tion  (6r,4>r)-  In  other  words,  it  behaves  in  a 
non- Lambertian  manner.  We  will  study  this  be¬ 
havior  more  closely  in  the  foUowing  section. 

4.3  Model  for  Gaussian  Slope  Distribu¬ 
tion 

The  surface  considered  above  consists  of  V- 
cavities  with  a  single  facet  slope.  Real  surfaces 
can  be  modeled  only  if  the  slope-area  distribu¬ 
tion  P{9a,4>a)  includes  cavities  with  several  dif¬ 
ferent  facet  slopes.  If  the  surface  roughness  is 
isotropic,  the  slope-area  distribution  can  be  de¬ 
scribed  using  a  single  parameter  namely  6a  since 
the  parameter  4>a  for  facet  orientation  in  the  sur¬ 
face  plane  is  uniformly  distributed.  The  radiance 
of  any  isotropic  surface  can  therefore  be  deter¬ 
mined  as: 

ff 

Lr=  P(ea)Lrp{6a)d6a  (29) 

Jo 

where  the  source  illumination  (no  interreflec¬ 
tions)  component  of  Lrp{6a)  is  given  by  equa¬ 
tion  28.  Here,  we  assume  the  isotropic  distribu¬ 
tion  of  facet  slope  to  be  Gaussian  with  mean  /x 
and  standard  deviation  a,  i.e.  P(6a;<7,ii).  Rea¬ 
sonably  rough  surfaces;  can  be  described  using  a 
zero  mean  (/x  =  0)  Gaussian  distribution.  The 
roughness  of  the  surface  is  then  determined  by 
the  parameter  <t. 

The  reflectance  model  can  be  obtained  by  using 
the  projected  radiance  Ljp(^a)  given  by  equa¬ 
tion  28  and  the  Gaussian  distribution  P{6a;<T,0) 
in  the  integral  of  equation  29.  The  resulting  in¬ 
tegral  cannot  be  evaluated.  Hence,  we  pursued 
a  functional  approximation  to  the  integral  that 
is  accurate  for  arbitrary  surface  roughness  and 
angles  of  incidence  and  reflection.  In  develop¬ 
ing  this  approximation,  we  carefully  studied  the 
functional  form  of  L\p{0a)-  This  enabled  us  to 
identify  some  basis  functions  that  can  be  used 
in  the  approximation.  Then,  we  conducted  a 


large  set  of  numerical  evaluations  of  the  integral 
in  equation  29  by  varying  the  surface  roughness 
(T,  the  angles  of  incidence  (0, ,</>,),  and  the  reflec¬ 
tion  angles  [9t,4>t)-  These  simulations  and  the 
identified  basis  functions  were  used  to  arrive  at 
an  accurate  functional  approximation  for  surface 
radiance.  This  procedure  was  applied  indepen¬ 
dently  to  the  source  illumination  component  as 
well  as  the  interreflection  component. 


The  final  approximation  results  are  given  below. 
As  before,  we  define  a  =  Max[6r.,6i]  and  /?  = 
Min[0r^6i\.  The  source  iUumination  component 
of  the  radiance  of  a  surface  with  roughness  a  is: 


L\ (di ,(i>i;  i Eq  cos 


Ci(cr,n)  + 


(30) 


cos  (0r  -  4>i){C2(oi^  (T,  fi)  -  C3{0\4>r  -  4>i  \  <r,  /i)  j  tan  l3 

-  |cos(<^r  -  <A,)|)C4(o;^;<r,/i)tan  (— ^) 

where  the  coefficients  are: 

C,  « 

TT  <7-^  4-  0.4 

C2  ^  0.4 
C3 


Ca 


.  0 

0.11( 


(r2 -1-0.21 

ifcos(0, -0.)<O 


otherwise 


(72-1-0.21 


The  functional  approximation  to  the  interreflec¬ 
tion  component  of  surface  radiance  is: 


Ll(9i,<^i\0rAr\(r)  =  p^Eo  cos  9i 


(t2  -1-  0.4 
a 


0.04- 


0.11cos(^r  ~  <^x)(cos  ^(sin  /?)'  ® 


(31) 


The  two  components  are  combined  using  the  in¬ 
terreflection  parameter  \  to  obtain  the  total  sur¬ 
face  radiance: 

Lr{6„iP,-,6r,<i>r;(r)=  (32) 

L\(6i,<))i\0r,(t>T\cr)  +  V  Ll(6i,(j)i\6r,(t>T\(T) 
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Finally,  the  bi-directional  reflectance  function 
of  the  surface  is  obtained  from  its  radi¬ 
ance  and  irradiance  as  = 

Lr{0i,<i>i;0T-,4>r\o')  /  E{8i,4>i).  It  is  important 
to  note  that  the  approximation  presented  here 
obeys  Helmholtz’s  reciprocity  principle. 

In  the  next  section,  we  present  several  experi¬ 
mental  results  that  verify  the  diffuse  reflectance 
model  presented  here.  We  conclude  this  section 
with  a  brief  illustration  of  the  main  characteris¬ 
tics  predicted  by  the  model.  Figure  6  shows  the 
reflectance  predicted  by  the  model  for  a  rough 
Lambertian  surface  with  roughness  a  =  70°, 
albedo  p  =  0.95,  and  \:  =  1.  The  radiance  Lr  in 
the  plane  of  incidence  {<l>r  =  0, ,  (/>, -t- tt )  is  plotted 
as  a  function  of  the  reflection  angle  Or  for  the  an¬ 
gle  of  incidence  0,  =  75°.  Two  curves  are  shown 
in  the  figure;  one  generated  by  numerical  inte¬ 
gration  of  equation  29  and  the  second  (plotted 
as  a  thin  line)  obtained  by  the  model  approxima¬ 
tion  given  by  equations  30  and  31.  Notice  that 
these  radiance  plots  deviat:  substantially  from 
Lambertian  reflectance.  The  surface  radiance 
increases  as  the  viewing  direction  approaches  the 
source  direction.  The  plot  can  be  divided  into 
three  sections.  In  the  backward  (source)  direc¬ 
tion,  the  radiance  is  maximum  and  gets  “cut¬ 
off”  due  to  strong  masking  effects  when  Or  ex¬ 
ceeds  Oi.  This  cut-off  occurs  exactly  at  Or  = 
and  is  independent  of  roughness.  In  the  middle 
section  of  the  plot,  radiance  varies  as  a  scaled 
tan  Or  function  with  constant  offset.  FinaUy,  the 
interreflections  dominate  in  the  forward  direc¬ 
tion  where  most  facets  are  self-shadowed  and  the 
visible  facets  receive  light  mainly  from  adjacent 
facets.  The  deviation  from  pure  Lambertian  be¬ 
havior  increases  with  the  angle  of  incidence  Oi. 


5  Experiments 

We  have  conducted  several  experiments  to  ver¬ 
ify  the  accuracy  of  the  reflectance  model  pre¬ 
sented  here.  The  experimental  set-up  used  to 
measure  the  radiance  of  samples  is  shown  in  Fig¬ 
ure  7.  In  the  case  of  outdoor  scenes,  each  sen¬ 
sor  element  (pixel)  typically  includes  a  large  sur¬ 
face  area  (several  inches  in  dimensions  and  often 
more).  Commercially  available  reflectance  mea¬ 
surement  devices  are  applicable  only  to  small 
samples.  Hence,  we  developed  our  own  mea¬ 
surement  device.  Each  sample  is  approximat- 


stroa( 
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Figure  6:  Diffuse  reflectance  from  a  surface  with 
roughness  a  =  70°  for  angle  of  incidence  =  75°. 
The  numerical  integration  is  the  thick  line  and  the 
functional  approximation  is  the  thin  line. 

edly  2x2  inches.  It  is  imaged  using  a  512x480 
pixel  CCD  camera  that  is  mounted  at  the  end 
of  a  6  foot  long  beam.  The  other  end  of  the 
beam  is  attached  to  a  rotary  stage  to  facilitate 
precise  variation  of  the  viewing  angle  Or.  The 
sample  is  illuminated  using  a  300  Watt  incan¬ 
descent  light  source.  The  illumination  direction 
is  varied  manually.  Images  of  the  sample 
are  digitized  and  the  radiance  is  computed  as  the 
average  brightness  over  all  pixels  that  lie  on  the 
sample. 


Figure  7:  Sketch  and  photograph  of  the  set-up  used 
to  measure  reflectance. 

Figure  8  shows  results  obtained  for  a  sample  of 
wall  plaster.  The  sample  has  matte  local  re¬ 
flectance  properties  but  is  very  rough;  it  is  ex¬ 
actly  the  type  of  surface  that  our  diffuse  re¬ 
flectance  model  characterizes.  The  reflectance 
is  represented  by  the  radiance  of  the  surface  in 
the  sensor  direction.  The  radiance  of  each  sam¬ 
ple  is  plotted  as  a  function  of  the  sensor  direc¬ 
tion  Or  for  different  angles  of  incidence  0,.  These 
measurements  are  made  in  the  plane  of  incidence 
{<i>r  =  4>i  —  0).  The  dots  represent  measured  ra¬ 
diance  values  while  the  solid  lines  are  predictions 
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obtained  using  the  reflectance  model  for  Gaus¬ 
sian  surface  roughness.  The  roughness  a  was 
empirically  selected  to  obtain  the  best  match  be¬ 
tween  measured  and  predicted  radiance. 

Similar  results  are  presented  in  Figures  9  and 
10  for  sample  B  (painted  sand  paper)  and  sam¬ 
ple  C  (white  sand).  For  three  samples,  the  ra¬ 
diance  increases  as  the  viewing  direction  Qr  ap¬ 
proaches  the  source  direction  6i  (backward  re¬ 
flection).  This  is  in  contrast  to  the  behavior  of 
rough  specular  surfaces  that  reflect  more  in  the 
forward  direction,  or  Lambertian  surfaces  where 
the  radiance  does  not  vary  with  viewing  direc¬ 
tion.  For  all  three  samples,  the  model  predic¬ 
tions  and  experimental  measurements  match  re¬ 
markably  well.  In  all  cases,  a  small  peak  is  no¬ 
ticed  near  the  source  direction.  This  is  a  differ¬ 
ent  phenomenon  from  the  one  described  by  our 
model;  it  is  the  backscatter  peak  noticed  by  oth¬ 
ers  [Oetking,  1966,  Hapke  and  Horn,  1963].  In 
the  case  of  sample  C  (sand),  we  see  a  small  spec¬ 
ular  component  in  the  forward  direction.  This 
due  to  the  specular  characteristics  of  individual 
sand  particles. 

6  Implication  on  Machine  Vision 

We  conclude  this  paper  with  a  brief  discus¬ 
sion  on  the  implication  of  the  diffuse  reflectance 
model  presented  here  on  machine  vision  algo¬ 
rithms.  Techniques  such  as  shape  from  shad¬ 
ing  and  photometric  stereo  make  assumptions  re¬ 
garding  the  reflectance  properties  of  surfaces  in 
the  scene.  Incorrect  modeling  of  the  reflectance 
properties  naturally  leads  to  inaccurate  shape 
recovery.  The  reflectance  model  presented  here 
clearly  indicates  that  rough  diffuse  surfaces  can¬ 
not  be  assumed  to  be  Lambertian  in  reflectance; 
they  appear  brighter  in  the  backward  (source) 
direction  rather  than  the  surface  normal  direc¬ 
tion  or  the  forward  (specular)  direction.  Fur¬ 
ther,  this  deviation  from  Lambertian  behavior 
increases  with  the  roughness  of  the  surface  and 
the  angle  of  incident  light.  The  model  can  there¬ 
fore  be  used  to  improve  the  performance  of  shape 
recovery  methods.  It  can  also  be  used  to  recover 
the  macroscopic  roughness  (<r)  of  surfaces  from 
diffuse  reflectance  measurements.  The  recovered 
roughness  can  be  used  to  predict  the  appearance 
of  the  surface  under  different  illumination  and 
viewer  directions. 


Figure  8:  Reflectance  measurement  and  reflectance 
model  (using  a  =  50®,  x  =  0.75)  plots  for  wall  plas¬ 
ter  (sample  A).  Radiance  is  plotted  as  a  function  of 
sensor  direction  (^r)  for  different  angles  of  incidence 
(0,  =30®,  45®,  60®). 


Figure  9:  Reflectance  measurement  and  reflectance 
model  (using  a  —  70®,  x  =  0.50)  plots  for  painted 
sand-paper  (sample  B). 


Figure  10:  Reflectance  measurement  and  reflectance 
model  (using  <t  =  70®,  x  =  0.50)  plots  for  sand  (sam¬ 
ple  C). 


Figure  11  compares  the  reflectance  map  of  a 
Lambertian  surface  with  that  of  a  rough  Lam¬ 
bertian  surface  with  a  =  70°.  It  is  interesting 
to  note  that  the  rough  Lambertian  surface  pro¬ 
duces  a  reflectance  map  that  appears  very  sim¬ 
ilar  to  the  linear  reflectance  map  hypothesized 
for  the  lunar  surface  (our  model  predicts  that 
this  linearity  of  the  reflectance  map  exists  only 
when  viewer  direction  is  close  to  source  direc¬ 
tion).  In  a  sense,  the  reflectance  model  presented 
here  establishes  a  continuum  from  pure  Lamber¬ 
tian  to  lunar  reflectance  by  simply  varying  the 
roughness  of  the  surface.  Finally,  the  surface 
roughness  model  used  to  derive  the  reflectance 
model  is  the  same  as  the  one  used  by  the  popular 


Torrance  and  Sparrow  [Torrance  and  Sparrow, 
1967]  model  for  specular  reflection  from  rough 
surfaces.  Hence,  reflectance  from  general  rough 
surfaces  may  be  modeled  as  a  linear  combination 
of  the  Torrance-Sparrow  model  and  the  model 
proposed  in  this  paper. 


Figure  1 1 :  Reflectance  maps  for  (a)  Lambertian  sur¬ 
face  p  =  0.9  (b)  rough  Lambertian  surface  (<t  =  70®, 
p  =  0.90,  X  =  0.75).  For  both  maps  the  angles  of 
incidence  are  dj  =  10®  and  </>i  =  45®.  Note  the  sim¬ 
ilarity  between  the  second  map  and  the  well-known 
linear  reflectance  map  previously  suggested  for  lunar 
reflectance. 


References 

[Beckmann,  1965]  P.  Beckmann.  Shadowing  of 
random  rough  surfaces.  IEEE  Transactions  on 
Antennas  and  Propagation,  AP-13:384-388, 

1965. 

[BUnn,  1977]  J.  F.  Blinn.  Models  of  light  re¬ 
flection  for  computer  synthesized  pictures. 
ACM  Computer  Graphics  (SIGGRAPH  11), 
19(10):542-547,  1977. 

[Buhl  et  ai,  1968]  D.  Buhl,  W.  J.  Welch,  and 
D.  G.  Rea.  Reradiation  and  thermal  emission 
from  illuminated  craters  on  the  lunar  surface. 
Journal  of  Geophysical  Research,  73(16):5281- 
5295,  August  1968. 

[Forsyth  and  Zisserman,  1989]  D.  Forsyth  and 
A.  Zisserman.  Mutual  illumination.  Proc. 
Conf.  Computer  Vision  and  Pattern  Recogni¬ 
tion,  pages  466-473,  1989. 

[Hapke  and  Horn,  1963]  B.  W.  Hapke  and 
Huge  Van  Horn.  Photometric  studies  of  com¬ 
plex  surfaces,  with  applications  to  the  moon. 
Journal  of  Geophysical  Research,  68(  15):4545- 
4570,  August  1963. 


[Koenderink  and  van  Doom,  1983]  J.  J.  Koen- 
derink  and  A.  J.  van  Doom.  Geometrical 
modes  as  a  generaJ  method  to  treat  diffuse 
interreflections  in  radiometry.  Journal  of  the 
Optical  Society  of  America,  73(6):843-850, 
1983. 

[Opik,  1924]  E.  Opik.  Photometric  measures  of 
the  moon  and  the  moon  the  earth-shine.  Pub¬ 
lications  de  L’Observatorie  Astronomical  de 
L’Universite  de  Tartu,  26(l):l-68,  1924. 

[Minnaert,  1941]  M.  Minnaert.  The  reciprocity 
principle  in  lunar  photometry.  Astrophysical 
Journal,  93:403-410, 1941. 

[Nicodemus  et  al.,  1977]  F.  E.  Nicodemus,  J.  C. 
Richmond,  and  J.  J.  Hsia.  Geometrical  Con¬ 
siderations  and  Nomenclature  for  Reflectance. 
National  Bureau  of  Standards,  October  1977. 
Monograph  No.  160. 

[Oetking,  1966]  P.  Octking.  Photometric  stud¬ 
ies  of  diffusely  reflecting  surfaces  with  applica¬ 
tion  to  the  brightness  of  the  moon.  Journal  of 
Geophysical  Research,  71(10):2505-2513,  May 

1966. 

[Shibata  et  ai,  1981]  T.  Shibata,  W.  Frei,  and 
M.  Sutton.  Digital  correction  of  solar  illu¬ 
mination  and  viewing  angle  artifacts  in  re¬ 
motely  sensed  images.  Machine  Processing  of 
Remotely  Sensed  Data  Symposium,  pages  169- 
177,  1981. 

[Smith,  1967]  B.  G.  Smith.  Lunar  surface  rough¬ 
ness:  Shadowing  and  thermal  emission.  Jour¬ 
nal  of  Geophysical  Research,  72(16):4059- 
4067,  August  1967. 

[Tagare  and  deFigueiredo,  1991]  H.  D.  Tagare 
and  R.  J.  P.  deFigueiredo.  A  theory  of  pho¬ 
tometric  stereo  for  a  class  of  diffuse  non- 
Lambertian  surfaces.  IEEE  Transactions  on 
Pattern  Analysis  and  Machine  Intelligence, 
13(2):133-152,  February  1991. 

[Torrance  and  Sparrow,  1967]  K.  Torrance  and 
E.  Sparrow.  Theory  for  off-specular  reflection 
from  rough  surfaces.  Journal  of  the  Optical 
Society  of  America,  57:1105-1114,  September 

1967. 

[Wagner,  1966]  R.  J.  Wagner.  Shadowing  of  ran¬ 
domly  rough  surfaces.  Journal  of  the  Acous¬ 
tical  Society  of  America,  41(1):138-147,  June 
1966. 


q 


q 


1048 


Separation  of  Reflection  Components 
Using  Color  and  Polarization 

Shree  K.  Nayar,  Xi-Sheng  Fang,  and  Terrance  Boult  * 

Center  for  Research  in  Intelligent  Systems,  Department  of  Computer  Science, 
Columbia  University  New  York,  N.Y.  10027,  U.S.A. 


Abstract 

Specular  reflections  and  interreflections  produce 
strong  highlights  in  brightness  images.  These 
highlights  can  cause  vision  algorithms,  such  as, 
segmentation,  shape  from  shading,  binocular 
stereo,  and  motion  detection  to  produce  erro¬ 
neous  results.  We  present  an  algorithm  for  sepa¬ 
rating  the  specular  and  diffuse  components  of  re¬ 
flection  from  images.  The  method  uses  color  and 
polarization,  simultaneously,  to  obtain  strong 
constraints  on  the  reflection  components  at  each 
image  point.  Polarization  is  used  to  locally  de¬ 
termine  the  color  of  the  specular  component, 
constraining  the  diffuse  color  at  a  pixel  to  a  one 
dimensional  linear  subspace.  This  subspace  is 
used  to  determine  neighboring  pixels  whose  color 
is  consistent  with  a  given  pixel.  Diffuse  color  in¬ 
formation  from  consistent  neighbors  is  used  to 
determine  the  diffuse  color  of  the  pixel.  In  con¬ 
trast  to  previous  separation  algorithms,  the  pro¬ 
posed  method  can  handle  highlights  that  have  a 
varying  diffuse  component  as  well  as  highlights 
that  include  regions  with  different  reflectance 
and  material  properties.  We  present  several  ex¬ 
perimental  results  obtained  by  applying  the  al¬ 
gorithm  to  complex  scenes  with  textured  objects 
and  strong  interreflections. 

1  Introduction 

Reflection  of  light  from  surfaces  can  be  classi¬ 
fied  into  two  broad  categories:  diffuse  and  spec¬ 
ular.  The  diffuse  component  results  from  light 
rays  penetrating  the  surface,  undergoing  multi¬ 
ple  reflections  and  refractions,  and  re-emerging 

•This  work  supported  in  part  by  DARPA  contract  DACA- 
76-92-C-007  and  NSF  contract  #IRI-90-579Sl 


at  the  surface.  This  component  is  distributed 
in  a  wide  range  of  directions  around  the  surface 
normal,  giving  the  surface  a  matte  appearance. 
If  the  viewing  direction  of  an  image  sensor  is  var¬ 
ied,  diffuse  reflections  from  scene  points  change 
slowly  and  in  the  ideal  case  of  Lambertian  sur¬ 
faces,  it  does  not  change  at  all.  The  specular 
component,  on  the  other  hand,  is  a  surface  phe¬ 
nomenon.  Light  rays  incident  on  the  surface  are 
reflected  such  that  the  angle  of  reflection  equals 
the  angle  of  incidence.  Even  for  marginally 
rough  surfaces,  the  specular  reflections  are  con¬ 
centrated  in  a  compact  lobe  around  the  specu¬ 
lar  direction.  This  concentration  of  light  energy 
causes  strong  highlights  in  brightness  images  of 
scenes.  These  highlights  can  cause  vision  algo¬ 
rithms  for  scene  segmentation  and  shading  anal¬ 
ysis  to  produce  erroneous  results.  If  the  sen¬ 
sor  direction  is  varied,  highlights  shift,  dimin¬ 
ish  rapidly,  or  suddenly  appear  in  other  parts  of 
the  scene.  This  strong  directional  dependence 
of  specular  reflection,  poses  serious  problems  for 
vision  techniques  such  as  binocular  stereo  and 
motion  detection.  Hence,  specularities  are  often 
undesirabb  in  images. 

In  this  paper,  we  present  an  algorithm  that  sep¬ 
arates  the  diffuse  and  specular  components  of 
brightness  from  images.  Separation  of  reflection 
components  has  been  a  topic  of  active  research  in 
the  past  few  years.  We  discuss  only  those  efforts 
that  have  resulted  in  algorithms  that  have  been 
tested  on  real  images.  Most  of  this  work  is  based 
on  the  dichromatic  reflectance  model  proposed 
in  [Shafer,  85).  The  dichromatic  model  suggests 
that,  in  the  case  of  dielectrics  (non-conductors), 
the  diffuse  component  and  the  specular  compo¬ 
nent  generally  have  different  spectral  distribu- 
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tions.  Hence,  the  color  of  an  image  point  can 
be  viewed  as  the  sum  of  two  vectors  with  differ¬ 
ent  directions  in  color  space.  Using  this  model, 
[Klinker,  88]  and  [Gershon,  87]  independently 
observed  that  color  histogram  of  an  object  with 
uniform  diffuse  color  takes  the  shape  of  a  skewed 
T  with  two  limbs.  One  limb  corresponds  to 
purely  diffuse  points  on  the  object,  which  have 
the  same  color  but  differ  in  magnitude,  and  the 
second  limb  represents  a  highlight  region.  They 
proposed  algorithms  for  automatically  identify¬ 
ing  the  two  limbs  and  used  the  directions  of  the 
limbs  to  separate  the  diffuse  and  specular  com¬ 
ponents  at  each  object  point.  Later,  [Bajcsy,  et. 
al.,  90]  showed  that  the  color  histogram  of  an  ob¬ 
ject  could  have  additional  limbs  that  correspond 
to  highlights  caused  by  interreflections  between 
objects.  More  recently,  [Lee,  9l]  proposed  mov¬ 
ing  a  sensor  and  applying  spectral  differencing  to 
color  histograms  of  consecutive  images  to  iden¬ 
tify  specular  points  in  the  image.  This  method 
however  does  not  compute  accurate  estimates  of 
the  specular  component  at  each  image  point. 

All  of  the  above  algorithms  rely  solely  on  color 
information  to  separate  specular  and  diffuse  re¬ 
flections  from  images.  Since  the  separation  is 
not  possible  when  an  image  point  is  treated  in 
isolation,  these  methods  analyze  the  anatomy  of 
color  histograms.  Two  major  limitations  result 
from  the  above  approach.  First,  real  scenes  are 
complex  and  include  objects  with  texture  and 
varying  reflectance.  Color  histograms  of  such 
scenes  are  generally  unpredictable  and  a  set  of 
linear  clusters  such  as  the  skewed  T  are  unlikely. 
Second,  aU  points  on  the  highlight  region  are  as¬ 
sumed  to  have  the  same  diffuse  component  (color 
and  magnitude).  Even  for  an  object  with  uni¬ 
form  ref.  ctance,  this  assumption  is  valid  only  if 
the  object  surface  is  very  smooth.  In  the  case 
of  rough  surfaces,  the  highlights  spread  over  a 
wider  range  of  surface  normals  and  the  specular 
limb  of  the  skewed  T  does  not  have  a  well-defined 
direction. 

Recently,  [Wolff  and  Boult,  9l]  proposed  a  po¬ 
larization  based  method  for  separating  specular 
and  diffuse  components  from  gray-level  (black 
and  white)  images.  Details  of  this  method 
will  discussed  later.  Their  polarization-based 
method  assumes  that  the  diffuse  component 
(color  and  magnitude)  is  constant  over  the  en¬ 


tire  highlight  region.  They  also  assume  that  the 
material  type  and  surface  normal  do  not  vary 
within  the  highlight  region.  Using  these  assump¬ 
tions,  an  estimate  for  a  constant  diffuse  term  is 
obtained.  We  will  show  later  that  the  assump¬ 
tions  made  in  [Wolff  and  Boult,  9l]  are  often 
not  practical  in  the  context  of  real  scenes. 

This  paper  presents  a  new  algorithm  for  the  sep¬ 
aration  of  specular  and  diffuse  reflection  compo¬ 
nents  from  images.  This  algorithm  uses  color 
and  polarization  simultaneously,  to  obtain  new 
constraints  of  the  reflection  components.  As  a 
result,  it  does  not  suffer  from  many  of  the  prob¬ 
lems  associated  with  previous  methods  based  ei¬ 
ther  color  or  polarization.  We  assume  that  the 
scene  consists  of  dielectric  objects.  This  leads 
to  two  assumptions:  (a)  the  dichromatic  model 
is  applicable,  and  (b)  the  specular  component 
is  polarized  while  the  diffuse  component  is  not. 
The  restrictions  imposed  by  these  assumptions 
are  discussed  in  subsequent  sections.  The  pro¬ 
posed  algorithm  can  estimate  specular  compo¬ 
nents  that  result  not  only  from  direct  source  illu¬ 
mination  but  also  interreflections  between  points 
in  the  image.  We  show  that,  under  reason¬ 
able  assumptions,  polarization  information  can 
be  used  to  obtain  the  color  of  the  specular  com¬ 
ponent  independently  for  each  point  with  a  spec¬ 
ular  component.  For  each  such  point  the  result  is 
a  line  in  color  space  on  which  the  diffuse  vector 
must  lie.  This  line  imposes  strong  constraints 
on  the  color  of  the  diffuse  component  of  that  im¬ 
age  point.  Neighboring  diffuse  colors  that  satisfy 
these  constraints  are  used  to  compute  the  diffuse 
component  of  the  image  point. 

Since  the  specular  color  of  each  image  point 
is  computed  independently,  our  approach  has 
the  foUowing  advantages  over  previous  methods: 
(a)  The  diffuse  component  is  not  assumed  to  be 
constant  under  the  highlight  region;  (b)  The 
Fresnel  ratio  (which  depends  on  the  material 
properties  and  the  angle  of  incidence)  can  vary 
over  highlight  regions;  and  (c)  the  specular  com¬ 
ponent  may  be  textured.  The  algorithm  requires 
that  each  image  point  has  a  few  (at  least  three) 
neighbors  that  have  the  same  computed  diffuse 
color  (that  is,  direction  in  color  space  but  not 
necessarily  magnitude).  Color  Figure  la  (see 
pages  with  color  photos)  shows  a  highlight  that 
spreads  over  an  object  with  strong  surface  tox- 
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ture.  Color  Figure  lb  shows  the  diffuse  image 
of  the  object  recovered  using  the  proposed  algo¬ 
rithm.  In  the  experimental  section,  we  present 
several  other  results  obtained  by  applying  the 
algorithm  to  complex  scenes  with  multiple  high¬ 
lights  and  interreflections. 

2  Reflection  and  Interreflection 

We  begin  by  describing  the  mechanisms  involved 
in  the  processes  of  reflection  and  interreflection. 
Specular  reflections  cause  highlights  in  the  im¬ 
age  that  pose  serious  problems  for  a  variety  of 
vision  techniques.  The  objective  of  this  paper  is 
to  remove  specularities  due  to  single  reflections 
as  well  as  interreflections. 

Figure  1  shows  two  points,  A  and  B,  in  a  scene. 
Reflection  from  the  point  A  has  two  components, 
namely,  diffuse  and  specular.^  The  diffuse  com¬ 
ponent  arises  from  the  scattering  of  light  rays 
that  enter  the  surface  and  undergo  multiple  re¬ 
flections  and  refractions.  The  specular  compo¬ 
nent,  on  the  other  hand,  is  a  surface  phenomenon 
and  results  from  single  reflection  of  incident  light 
rays.  The  surface  may  be  assumed  to  be  com¬ 
posed  of  several  planar  elements,  or  facets,  where 
each  facets  has  its  own  orientation.  The  result 
is  a  specular  component  that  spreads  around  the 
specular  direction,  the  width  of  the  distribution 
depending  on  the  roughness  of  the  surface  [Tor¬ 
rance  and  Sparrow,  67]. 

Now  let  us  consider  the  phenomenon  of  inter- 
reflections.  Points  in  the  scene  receive  light  not 
only  from  the  light  sources  but  also  from  other 
scene  points.  Assume  that  point  B  reflects,  into 
the  sensor,  light  energy  from  the  direction  of 
point  A.  The  resulting  image  brightness  value 
can  be  viewed  as  the  linear  combination  of  four 
possible  interreflection  components:  (a)  diffuse- 
diffuse;  (b)  specular-diffuse;  (c)  diffuse-specular, 
and  (d)  specular-specular.  In  each  case,  the  first 
term  represents  the  component  received  from 
point  A  and  the  second  represents  the  compo¬ 
nent  reflected  by  point  B.  In  general,  point  B 
could  reflect  light  due  to  both  direct  illumination 
by  light  sources  as  well  as  interreflections  from 
other  scene  points.  We  consider  each  brightness 

'  Recently,  [Nayar,  et.  al.,  9l]  proposed  a  reflectance  frame¬ 
work  that  includes  three  primary  components  of  reflection:  the 
diffuse  lobe;  the  specular  lobe;  and  the  specular  spike.  In  this 
paper,  the  two  specular  components  can  be  combined  to  yield, 
the  specular  component. 


Figure  1:  Components  of  reflection  and  inter¬ 
reflection. 

value  in  the  image  as  the  sum  of  two  compo¬ 
nents,  diffuse  and  specular.  The  specular  com¬ 
ponent  can  result  either  from  direct  iUumination 
by  a  light  source  or  due  to  the  diffuse-specular 
or  specular-specular  interreflections.  We  assume 
that  the  specular  reflection  received  by  the  sen¬ 
sor  from  any  ^ven  point  is  either  due  to  source 
illumination  or  due  to  interreflections  and  not 
both.  In  other  words,  any  given  scene  point  is 
positioned  and  oriented  to  specularly  reflect  es¬ 
sentially  from  either  a  light  source  or  another 
scene  point  but  not  both.  This  assumption  holds 
well  except  for  very  rough  surfaces. 

3  Polarization 

The  method  presented  in  this  paper  uses  a  polar¬ 
ization  filter  to  determine  the  color  of  the  specu¬ 
lar  component.  In  this  section,  we  present  a  brief 
overview  of  polarization  and  discuss  the  type  of 
surfaces  for  which  it  provides  useful  information. 
Detailed  discussions  on  the  theory  of  polariza¬ 
tion  can  be  found  in  [Born  and  Wolf,  65).  In 
the  field  of  machine  vision,  polarization  meth¬ 
ods  were  first  introduced  in  [Koshikawa,  79] 
who  used  ellipsometry  for  shape  interpretation 
and  recognition  of  glossy  objects.  More  recently, 
[Wolff  and  Boult,  9l]  examined  the  use  of  linear 
polarization  for  highlight  removal  and  material 
classification. 

Figure  2  shows  a  surface  element  illuminated  by 
a  source  and  imaged  by  a  sensor.  A  polariza¬ 
tion  filter  is  placed  in  front  of  the  sensor.  As 
in  the  previous  section,  let  the  image  brightness 
value  corresponding  to  the  surface  element  be 
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Figure  2:  Surface  element  illuminated  by  a 
source  and  imaged  through  a  polarization  filter. 


as:  /  =  /j  +  /g,  where,  /a  is  the  diffuse  com¬ 
ponent  and  /g  is  the  specular  component.  The 
linear  polarization  of  each  light  wave  is  deter¬ 
mined  by  the  direction  of  its  electric  field  vec¬ 
tor.  In  general,  the  light  energy  (due  to  several 
light  waves)  reflected  by  a  surface  may  be  par¬ 
tially  polarized.  The  extent  of  polarization  de¬ 
pends  on  several  factors  including  the  material 
of  the  reflecting  surface  element,  its  orientation 
with  respect  to  the  image  sensor,  and  the  types 
of  reflection  mechanisms  (specular  or  diffuse)  at 
work. 

The  diffuse  component  of  reflection  tends  to  be 
unpolarized.'^  In  contrast,  the  specular  compo¬ 
nent  tends  to  be  partially  polarized;  rotation  of 
the  polarization  filter  varies  the  specular  com¬ 
ponent  as  a  cosine  function,  as  shown  in  Figure 
3.  The  specular  component  can  be  expressed  as 
the  sum  of  a  specular  constant  /gc  and  a  specu¬ 
lar  varying  term  that  is  a  cosine  function  with 
amplitude 


Figure  3:  Image  brightness  plotted  as  a  function 
of  polarization  filter  position. 

waves  in  the  directions  perpendicular  and  paral¬ 
lel  to  the  plane  of  incidence,  respectively.  The 
relationship  between  /gc  and  /gy  and  the  Fresnel 
coefficients  is: 

f»c  +  lav  _  /±(>?,  V>) 

I>c  ^||(7/,V>)  ^  ^ 

The  parameter  tj  is  the  complex  index  of  refrac¬ 
tion  of  the  surface  medium,  and  depends  on  the 
properties  of  the  reflecting  material.  The  param¬ 
eter  V’  is  the  angle  of  incidence.^ 

Note  that  the  terms  /j  and  in  equation  1 
are  constant  and  can  be  represented  by  a  single 
component  /c  =  /j  -f-  /gc  to  obtain:  I  =  Ic  + 
/gy  cos  2{0  —  q).  For  any  given  position  of  the 
polarization  filter  we  have: 

/i  =  /c  -b  /gy  cos  2{0i  -  a).  (3) 


I  =  Id  +  Isc  +  /gv  cos  2(0  -  a)  (1) 

where,  0  represents  the  angle  of  the  polarization 
filter  and  a  is  the  phase  angle  determined  by 
the  projection  of  the  normal  of  the  surface  el¬ 
ement  onto  the  plane  of  the  polarization  filter. 
The  exact  values  of  /gc  and  /gy  depend  on  the 
material  properties  and  the  angle  of  incidence. 
This  dependence  is  determined  by  the  Fresnel  re¬ 
flection  coefficients  and  F||(7/,  V’)  which 

represent  the  polarization  of  the  reflected  light 

^  Note  that  this  assumption  does  not  hold  near  the  occlud¬ 
ing  contour  of  an  object,  see  [Boult  and  Wolff,  9l].  That  paper 
addresses  the  classification  of  scene  edges  based  on  their  po¬ 
larization  characteristics. 


This  can  also  be  represented  as  the  dot  product 
of  two  vectors: 

fj  =  (1,  cos  20,  ,  sin  20, ) 

V  =  (  /c,  /gy  cos  2a,  /gy  sin  2a ) 

/i  =  fi  •  V.  (4) 

Let  M  be  the  total  number  of  discrete  filter  posi¬ 
tions  used  to  obtain  the  image  brightness  values 

^  For  metals,  the  two  FVesnel  coefficients  are  nearly  equal  ex¬ 
cept  close  to  the  grazing  angle  (when  0  lies  between  70  and  90 
degrees).  Thus,  linear  polarization  based  methods  are  gener¬ 
ally  not  effective  for  metals.  For  dielectrics  (non-conductors), 
however,  the  two  FVesnel  coefficients  differ  substantially  except 
for  near-normal  angles  of  incidence  (when  rlt  is  less  than  10-15 
degrees). 


1052 


{/i  I  i  =  1,2, M}.  If  M  =  3,  equation  4  yields 
a  linear  system  of  equations  that  can  be  solved  to 
obtain  the  parameters  /c,  lav,  and  a.  If  A/  >  3, 
we  have  an  over-determined  linear  system  that 
can  be  solved  to  obtain  more  robust  estimates  of 
Ic,  /sv,  and  a,  in  the  presence  of  image  noise.^ 

From  Ic  and  /gv?  we  can  obtain  the  maximum 
and  minimum  values  of  image  brightness  as: 


fmin  —  Ic  la 


^max  —  fc  "b  la 


(5) 


The  degree  of  polarization  at  a  scene  point  can 
be  determined  as  [Born  and  Wolf,  65]: 


P  = 


^max  fmin 
^  max  "b  ^  min 


(6) 


The  degree  of  polarization  lies  between  0  and 
1  and  can  be  used  during  highlight  removal  to 
classify  points  into  those  that  are  only  diffuse 
(p  C  1)  and  those  that  include  a  specular  com¬ 
ponent.  However,  this  measure  must  be  used 
with  care  as  both  /„„„  and  /^ax  include  the  con¬ 
stant  specular  component  Igc  as  well  as  the  dif¬ 
fuse  component  /j;  varying  either  lac  or  /a  has 
the  same  effect  on  p. 


We  conclude  this  section  with  a  note  on  previous 
work  on  highlight  removal  using  polarization.  A 
method  fo-  computing  /a  and  /,  by  rotating 
the  polarization  filter,  is  presented  in  [Wolff  and 
Boult,  9l].  From  the  above  discussion,  we  know 
that  /a  and  lac  are  both  constant.  They  can  be 
computed  from  Ic  only  if  we  know  the  ratio  of 
the  Fresnel  coefficients,  q  =  F±{Ti,tlj)/ 
for  the  corresponding  scene  point.  The  Fresnel 
coefficients  are  determined  by  the  material  prop¬ 
erties  of  the  scene  point  as  well  as  the  angle  of 
incidence.  Neither  of  these  factors  are  known. 
To  constrain  the  problem,  [Wolff  and  Boult,  91] 
use  all  points  (pixels)  on  a  segmented  highlight. 
They  assume  that  both  the  diffuse  component 
fa  as  well  as  the  Fresnel  ratio  q  are  constant  un¬ 
der  the  highlight  region,  and  estimate  them  us¬ 
ing  all  points  within  the  highlight  region.  This 
assumption,  however,  is  unrealistic  for  two  rea¬ 
sons.  First,  in  real  scenes,  the  diffuse  component 
/a  within  the  highlight  region  may  vary  due  to 

^  This  formulation  of  the  problem  using  vectors  saves  sub¬ 
stantial  computations  compared  to  the  non-linear  formulation 
of  the  type  o  +  4  sin*  (fi  —  a)  used  in  [Boult  and  Wolff,  9ll  and 
[Wolff,  90]  that  require  the  use  of  iterative  non-linear  estima¬ 
tion  techniques. 


the  curvature  of  the  surface  or  due  to  texture  on 
the  surface.  Secondly,  the  Fresnel  ratio  q  can¬ 
not  be  assumed  to  be  constant  since  the  angle  of 
incidence  can  vary  substantially  over  large  high¬ 
light  regions  resulting  from  extended  sources  in 
the  scene.  The  latter  of  these  problems  was  dis¬ 
cussed  in  [Wolff,  90]  but  no  solutions  were  pro¬ 
posed. 

4  Color 

We  now  discuss  the  role  of  color  in  the  removal  of 
specularities.  We  present  an  overview  of  the  rep¬ 
resentation  of  diffuse  and  specular  components 
in  color  space  and  discuss  previous  work  on  the 
removal  of  highlights  using  color.  In  contrast  to 
gray-level  images,  color  images  represent  wave¬ 
length  (A)  dependence  of  the  light  reflected  by 
a  scene.  Let  x(A)  be  the  spectral  distribution  of 
the  light  reflected  by  a  scene  point,  and  s(  A)  rep¬ 
resent  the  response  of  the  sensor  to  wavelength. 

Typically,  color  images  are  obtained  by  using 
three  filters  with  responses  r(A),  g(^),  and  6(A) 
that  have  peaks  close  to  the  wavelengths  that 
humans  perceive  as  “red,”  “green,”  and  “blue.” 
The  resulting  three  brightness  values  measured 
by  a  sensor  element  constitute  the  color  vector 
I  =  I/*”,  /*,  /**!  for  the  corresponding  point  in 
scene.  The  three  brightness  values  in  the  color 
vector  are  related  to  the  spectral  distribution  of 
the  reflected  light  as: 

r  =  J  x(X)  r{\)  s{X)dX 

/*  =  /  x(A)sf(A)s(A)dA  (7) 

=  J  x(A)6(A)s(A)dA 

Each  brightness  value  includes  a  diffuse  compo¬ 
nent  and  a  specular  component.  Hence,  in  three- 
dimensional  color  space  we  have  the  following 
decomposition:  I  =  Ij  -|-  I,. 

The  dichromatic  reflectance  model  [Shafer,  85] 
suggests  that,  for  dielectrics,  the  spectral  distri¬ 
bution  of  the  diffuse  component  is  determined  by 
the  colorant  in  the  surface  whereas  the  specular 
component  preserves  the  spectral  distribution  of 
the  incident  light.  As  a  result,  the  two  vectors 
Id  and  Ig  generally  have  different  directions  in 
color  space.  The  two  vectors  will  however  have 
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the  same  direction  if,  for  instance,  a  gray  object 
is  illuminated  by  white  light. 


Figure  4;  Diffuse  and  specular  components  in 
color  space. 

As  discussed  in  the  introduction,  [Klinker, 
88]  and  [Gershon,  87]  independently  used  the 
dichromatic  reflectance  model  to  remove  high¬ 
lights  from  images  by  the  two  limbs  of  the  skewed 
T  in  color  space.®  These  methods  are  based  on 
two  main  assumptions;  (a)  the  object  is  seg¬ 
mented  away  from  the  scene  and  has  a  single 
uniform  diffuse  color  and  (b)  the  diffuse  compo¬ 
nent  is  near  constant  within  the  highlight  region. 
These  two  assumptions  must  hold  for  a  skewed 
T  to  be  formed  in  color  space.  The  first  assump¬ 
tion  is  often  violated  in  real  scenes  where  objects 
may  be  textured  or  have  patches  with  different 
reflectance  properties.  The  second  assumption 
can  be  valid  only  if  the  surface  is  very  smooth, 
thus  producing  a  very  compact  highlight.  Even 
for  marginally  rough  surfaces,  the  highlight  is 
expected  to  include  object  points  with  a  range 
of  surface  orientations.  In  such  cases,  the  spec¬ 
ular  limb  of  the  skewed  T  spreads  into  a  wide 
cluster  in  color  space  that  is  difficult  to  separate 
from  the  diffuse  limb.  This  last  observation  has 
also  been  made  by  Novak  and  Shafer  (Novak  and 
Shafer,  92]. 


^Along  the  same  lines,  [Bajcsy,  et.  al.,  90)  observed  that 
the  color  histogram  can  include  additional  limbs  that  corre¬ 
spond  to  highlights  caused  by  inteireflections.  The  highlight 
removal  algorithms  of  [Klinker,  SS)  and  [Gershon,  87]  could, 
in  theory,  be  modified  to  identify  additional  limbs  and  remove 
specularities  due  to  interreflections. 


5  Removal  of  Specularities  using 
Color  and  Polarization. 

In  this  section,  we  develop  a  method  for  remov¬ 
ing  specular  reflections  from  images.  We  make 
the  following  assumptions. 

(A)  The  dichromatic  reflectance  model  applies  at 
each  point.  Hence  the  scene  consists  of  di¬ 
electric  objects,  and  therefore  specular  re¬ 
flections  and  interreflections  are  polarized 
while  the  diffuse  reflections  are  not.  Fur¬ 
ther,  the  color  of  the  incident  light  at  each 
scene  point  is  different  from  the  color  of  the 
material. 

(B)  The  specular  interreflections  result  from  ei¬ 

ther  the  diffuse- specular  mechanism  or  the 
specular-specular  mechanism  and  not  both. 
In  the  first  case,  the  incident  light  is  unpo¬ 
larized  and  hence  the  Fresnel  ratio  is  sim¬ 
ply  q  =  In  the  second 

case,  the  incident  light  is  partially  polarized 
and  the  effective  Fresnel  ratio  for  the  sur¬ 
face  point  \s  q  =  aF±(rj,ip)/bF\\{T),il^).  The 
parameters  a  and  b  account  for  the  partial 
polarization  of  the  incident  light. 

(C)  Fresnel  coefficients  F^.{T],1p)  and  F||(r/,  0) 
are  independent  of  the  wavelength  of  inci¬ 
dent  light.  This  assumption  is  reasonable 
(see  [Driscoll,  78])  since  we  are  assuming  di¬ 
electrics  and  operating  in  the  visible-light 
spectrum.  Assumptions  (B)  and  (C)  result 
in  the  Fresnel  ratio  q  being  equal  for  all  three 
color  bands. 

The  color  of  the  specular  component  is  computed 
locally  at  each  pixel.  This  places  strong  con¬ 
straints  on  the  diffuse  component  Ij.  Neighbor¬ 
ing  diffuse  points  that  satisfy  these  constraints 
are  then  used  to  compute  the  diffuse  component. 
This  approach  has  the  following  advantages  over 
all  of  the  previous  methods  for  highlight  removal: 

•  In  contrast  to  the  previous  methods  based 
either  on  color  or  polarization,  we  do  not  as¬ 
sume  that  the  diffuse  component  1^  is  con¬ 
stant  within  each  highlight  region.  In  fact, 
the  surfaces  could  be  textured  with  patches 
of  different  materials  underlying  the  high¬ 
lights. 

•  The  specular  color  of  each  point  is  com¬ 
puted  independently.  This  local  approach 
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does  not  require  prior  segmentation  of  ei¬ 
ther  highlights  or  objects  in  the  scene.  Fur¬ 
ther,  the  highlights  need  not  be  compact; 
the  method  can  handle  substantial  surface 
roughness  conditions. 

We  describe  the  technique  for  specularity  re¬ 
moval  by  focusing  on  a  single  image  point  x. 
The  same  procedure  is  applied  independently  to 
all  image  points.  The  color  vector  for  the  image 
point  is:  I  =  Ij  -h  I*.  Given  the  above  assump¬ 
tions,  the  Fresnel  ratio  q  is  the  same  for  all  three 
color  bands.  Hence  the  cosine  term  in  equation  4 
will  be  in  phase  for  the  3  color  bands  and  for  the 
polarization  filter  position  0,  we  have  the  color 
vector:  Ij  =  Ic  -H  Isv  cos  2(d,  -  q).  In  our 
experiments,  we  have  used  6  or  more  polarizer 
positions.  This  gives  us  an  over- determined  lin¬ 
ear  system  of  equations  that  are  solved  to  obtain 
robust  estimates  of  Ic,  Isv,  and  «•  The  color  vec¬ 
tors  corresponding  to  maximum  and  minimum 
polarization  (see  equation  5)  are: 

Imax  ~  Ic  "h  lav 
Imin  —  Ic  ~  lav 

Two  tests  are  used  to  determine  if  the  image 
point  X  is  to  be  processed  any  further.  First,  a 
degree  of  polarization  p  (expression  6)  is  com¬ 
puted  for  each  of  the  three  color  bands.  If  the 
largest  of  three  p  estimates  is  less  than  a  thresh¬ 
old  value  T\,  the  point  is  not  sufficiently  polar¬ 
ized  and  is  assumed  to  be  purely  diffuse.  In  this 
case,  the  next  image  point  is  examined. 

If  the  degree  of  polarization  of  the  point  x  is 
greater  than  T\,  the  angle  /?  subtended  by  the 
vector  k  =  I^ax  -  Imin  from  the  origin  0  is  com¬ 
puted  (se^  Figure  5).  If  /3  is  less  than  a  threshold 
7’2,  the  color  of  the  specular  component  is  very 
similar  to  that  of  the  diffuse  component  Ij,  and 
the  dichromatic  model  cannot  be  used  with  con¬ 
fidence. 

On  the  other  hand,  if  the  point  x  is  polarized  and 
its  /?  value  is  not  small,  we  proceed  to  compute 
its  diffuse  component  Ij.  If  we  can  determine  Igc, 
then  the  diffuse  component  can  be  computed  as 
Ijj  =  Ij.  -  I,^.  Once  this  is  done,  the  specular 
component  I*  can  be  separated  from  any  one  of 
the  images  I;.  We  can  also  obtain  an  approxima¬ 
tion  to  the  specular  image  we  would  see  without 
a  polarizer  by  subtracting  Ij  from  5 (Imax  +  Imin)- 


Recall  that  the  specular  components  Isc  and  Isv 
satisfy  1^  +  Isv  =  Isc  9-  Unfortunately,  the 
Fresnel  coefficient  q  is  not  known  as  it  depends 
on  the  material  properties  and  the  angle  of  in¬ 
cidence.  Though  we  have  estimates  of  Ic  and 
Igv,  we  do  not  have  a  simple  way  of  determin¬ 
ing  Id-  The  color  measurements  I;  obtained  by 
rotating  the  polarization  filter  lie  on  a  straight 
line  L  (see  Figure  5)  in  color  space.  The  diffuse 
component  Id  is  unaffected  by  rotations  of  the 
polarizer;  only  the  specular  component  Ig  varies. 
The  specular  component  varies  along  a  straight 
line  since  the  cosine  functions  in  the  three  color 
bands  are  in  phase  (assumptions  B  and  C). 


Figure  5:  Using  neighboring  points  and  specular 
line  constraint  to  compute  the  diffuse  component 

Id- 

Though  we  are  unable  to  compute  the  diffuse 
component  locally,  the  specular  line  gives  use¬ 
ful  constraints  on  the  diffuse  component.  These 
constraints  are  applied  to  the  neighboring  pix¬ 
els  of  X  to  obtain  the  diffuse  component  Id- 
Assume  that  the  diffuse  component  of  x  corre¬ 
sponds  to  the  point  P.  The  position  of  P  on  the 
specular  line  L  can  be  parametrized  as  follows: 
P  =  Imin  ~  pk  where  p  (0  <  p  <  p)  is  the  dis¬ 
tance  of  P  from  Imin  as  shown  in  Figure  b  and  p 
is  defined  as: 

(1  .  r  /  .  g  T  .  ^  \ 

^min  <niin  <  mm  i  /o\ 

— (S) 

p  determines  the  point  P  that  is  the  intersec¬ 
tion  of  the  specular  line  L  with  one  of  the  three 


1055 


planes  of  the  color  space.  In  the  example  shown 
in  Figure  5,  the  specular  line  intersects  with  the 
R  -  G  plane.  In  general,  however,  L  could  inter¬ 
sect  any  one  of  the  three  planes  that  constitute 
the  color  space.  Define  Vi  to  be  the  plane  that 
passes  through  the  points  (0,  Imin?  Imax)-  The 
expression  for  V\  is:  A  F  +  B  +  C =  0, 
where: 


A 

B 

C 


IcS  T  .  b  _  /  .  g 

A  min  ^  ^  n — 


‘  nun 

it*’  /  • 

'''  *  nun 


-  k^L 


T  .  g  _  Jfcg  / 

^  J  min  ^  *n 


nun 
b 


(9) 


Since  we  do  not  have  sufficient  constraints  to 
compute  the  diffuse  component  Ij  of  the  point  x 
from  the  color  measurements  Ij,  we  use  neighbor¬ 
ing  image  points  that  satisfy  the  following  con¬ 
ditions: 

(1)  A  neighboring  image  point  y  can  be  used  if 
we  know  its  the  diffuse  component  Q.  This 
occurs  if  y  has  a  low  degree  of  polarization  p 
and  hence  can  be  assumed  to  be  purely  dif¬ 
fuse,  or  if  its  diffuse  component  has  already 
been  computed. 

(2)  In  color  space,  the  vector  Q  must  lie  on  the 
plane  Vi-  Further,  it  must  lie  between  the 
vectors  Imin  and  P,  since  the  diffuse  vector 
Id  of  the  point  x  lies  on  the  line  L  between 
the  points  InUn  and  P.  Note,  however,  that 
Q  can  lie  inside  or  outside  the  triangle  (0, 
Imim  P  )• 

If  these  conditions  are  satisfied,  the  neighboring 
point  y  is  assumed  to  have  the  same  diffuse  color 
as  the  point  x.  Then,  the  line  passing  through 
Q  and  the  specular  line  L  intersect  to  give  P,  an 
estimate  of  the  diffuse  component  of  x. 


Due  to  noise  in  the  color  and  polarization  mea¬ 
surements,  the  diffuse  component  Q  of  a  neigh¬ 
boring  point  is  not  expected  to  exactly  satisfy 
the  above  conditions.  To  accommodate  for  such 
discrepancies,  we  compute  the  angle  7  subtended 
by  Q  with  respect  to  Vi  (see  Figure  5): 


sin  7 


AQ’^  +  BQ<^  +  CQ^ 

Va^  +  b^  + 


(10) 


If  7  is  larger  than  a  threshold  value  Ta,  Q  is  not 
used  any  further.®  If  7  is  small  (7  <  T3),  the 


®Note  that  we  have  used  the  angle  -y  rather  than  the  dis¬ 
tance  of  Q  from  T’l .  This  is  because  a  neighboring  point  may 


point  P  is  computed  by  extending  the  vector  S 
to  intersect  the  line  L  as  shown  in  Figure  5.  This 
can  be  done  without  computing  the  projection  S 
of  Q  on  the  plane  Vi-  Consider  the  plane  P2  that 
passes  through  the  points  (0,  Q,  P).  It  can  be 
expressed  as  D  F  +  E  F  -{■  F =  0 ,  where: 

D  = 

E  =  Q*’P"  -  g^P**  (11) 

F  =  P^ 

Since  the  planes  Vi  and  V2  are  perpendicular, 
we  have:  AD  +  BE+CF  =  0  ,wliich  can  be 
expanded  using  equations  9  and  11.  By  substi¬ 
tuting  the  expression  for  P  given  by  equation  8 
in  the  expansion,  a  solution  for  the  line  parame¬ 
ter  p  is  directly  obtained: 

/  A  (g«/min'’  -  ^ 

+  B  (g**  w  -  g^/min*’) 

^  V  +C(g’‘/min‘^-g«W)  / 

^  (a  (g«ifc*’  -  g*’ifc*)  ^ 

+B  (g’’r  -  g^jb*’) 

^  +C{Q^k<^- Q<^k^)  ) 

(12) 


The  above  process  is  repeated  for  all  neighbor¬ 
ing  diffuse  components  Qj  that  satisfy  conditions 
(1)  and  (2).  The  result  is  a  set  of  estimates 
{Pj  I  j  =  1,2, ....,  A).  If  A  <  T4  (we  use  T4  = 
3  in  our  implementation)  there  are  not  enough 
neighboring  diffuse  components  to  compute  a  ro¬ 
bust  estimate  of  la  for  the  point  x.  If  N  >  T4, 
the  mean  and  standard  deviation  of  pj  are  com¬ 
puted  as: 


P 


Wi 


(p  - 

N  -  1 


(13) 


where  the  weight  Wj  given  to  each  pj  equals  the 
magnitude  of  the  corresponding  diffuse  compo¬ 
nent,  II  Qj  ||.  The  mean  value  p  is  accepted  if 
the  standard  deviation  (Tp  is  less  than  a  threshold 
T 5,  i.e.  the  estimates  pj  form  a  compact  cluster 


have  a  very  small  diffuse  component  that  lies  close  to  the  ori¬ 
gin  0  and  as  a  result  is  also  close  to  plane  Vi-  Such  a  point 
could  have  relatively  large  errors  due  to  image  noise  and  must 
not  be  used  in  the  computation  of  Ij . 
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(a)  lavg  for  complex  scene  (b)  Ij  (diffuse  image)  (c)  Ig  (specular  image) 


) 

) 


(d)  lavg  (e)  h  (f) 

Color  Figure  3:  Recovery  algorithm  applied  to  a  very  complex  scene,  (a)  Orginal  {lavg)  image,  (b) 
Recovered  diffuse  image  /j.  (c)  Recovered  specular  component  Ig-  (d)-(f)  show  a  close  up  of  the 
blue  torus  from  the  lower  left  comer  of  the  scene. 


on  the  line  L.  This  constraint  is  used  to  ensure 
that  different  diffuse  colors  in  the  neighborhood 
of  X  that  happen  to  lie  close  to  the  plane  Vx, 
are  not  used  together  to  obtain  an  erroneous  es¬ 
timate  of  Id.  Once  p  has  been  determined,  the 
diffuse  component  la  =  P  of  the  image  point  x 
is  obtained  using  equation  8. 

Figure  6  compares  the  above  algorithm  with  pre¬ 
vious  techniques  based  on  either  polarization  or 
color.  The  comparison  is  done  using  the  40X40 
image  window  shown  in  Color  Figure  Id.  The 
image  window  includes  a  highlight  that  spreads 
over  regions  of  different  diffuse  color.  Figure 
6(a)  shows  the  histogram  in  (R,G,B)  color  space 
for  the  image  window.  Note  that  the  anatomy 
of  the  histogram  is  very  complex  and  does  not 
lend  itself  to  the  skewed  T  analysis  proposed 
in  [Klinker,  88].  Figure  6(b)  shows  the  cluster 
obtained  in  the  (IminJnuuc)  space  for  the  green 
band  of  the  image  window."  The  polarization 
based  method  for  highlight  removal  proposed  in 
[WolflF  and  Boult,  9l]  is  based  on  the  assumption 
that  this  cluster  in  (/mim^max)  space  is  linear. 
As  seen  in  the  Figure  6(b),  the  cluster  does  not 
form  a  straight  line  and  hence  the  polarization 
based  method  is  not  applicable.  Finally,  Figure 
6(c)  shows  the  result  obtained  using  the  algo¬ 
rithm  proposed  in  this  paper.  The  figure  shows 
Imim  Imax»  and  the  specular  line  computed  for 
the  center  pixel  of  the  window  shown  in  Color 
Figure  1(d).  The  color  space  constraints  are  used 
to  select  neighboring  diffuse  colors  and  compute 
the  diffuse  component  of  the  pixel. 

The  algorithm  proposed  in  this  section  is  applied 
to  all  points  in  the  image.  Not  all  image  points 
end  up  with  an  Id  estimate.  An  image  point 
may  lie  in  the  middle  of  a  very  large  highlight,  in 
which  case,  it  may  not  have  a  sufficient  number 
of  neighbors  with  diffuse  colors  that  satisfy  con¬ 
ditions  (1)  and  (2)  or  produce  a  compact  clus¬ 
ter  of  intersection  points  on  the  line  L.  Hence, 
we  apply  the  algorithm  repeatedly  to  the  image 
points.  This  iterative  approach  is  effective  in  the 
case  of  complex  scenes;  each  iteration  provides  a 
new  set  of  computed  diffuse  colors  thus  increas¬ 
ing  the  likelihood  of  finding  neighboring  diffuse 
colors  in  the  next  iteration.  A  large  highlight,  for 
instance,  shrinks  in  size  with  each  iteration.  The 


^The  green  band  was  selected  as  the  average  degree  o,  po¬ 
larization  for  the  image  window  is  maximum  in  this  band. 


iterations  are  discontinued  when  no  new  diffuse 
estimates  are  obtained. 

6  Experimentation 

This  section  presents  experimental  results  ob¬ 
tained  using  the  proposed  algorithm.  Here  we 
begin  with  very  a  short  description  of  the  ex¬ 
perimental  setup  and  a  few  details  of  the  im¬ 
plementation  of  the  algorithm.  This  is  fol¬ 
lowed  by  results  obtained  by  applying  the  al¬ 
gorithm  to  scenes  with  textured  objects,  pri¬ 
mary  (source)  specularities,  and  secondary  (in¬ 
terreflection)  specularities. 

Experimental  setup  is  important  as  we  need  both 
registration  between  color  bands  and  also  lin¬ 
earization  of  camera  response.  Details  of  our 
setup  can  be  found  in  [Nayar  et  ai,  1993].  That 
report  also  contains  more  implementational  de¬ 
tails  and  experimental  examoles. 

It  is  worth  noting,  that  the  current  setups  have 
some  minor  problems  that  might  be  overcome  us¬ 
ing  precision  filter  mounts  and  simple  correction 
methods.  The  first  is  that  the  movement  of  the 
polarizing  filter  can  cause  slight  shifts  of  the  im¬ 
age  (approximately  1  pixel).  The  second  is  that 
we  have  not  corrected  for  chromatic  aberration 
of  the  lens.  Finally,  we  are  using  CCD  tech¬ 
nology  which  is  prone  to  blooming  affects  near 
strong  highlight  regions.  All  of  these  problems 
manifest  themselves  as  errors  in  polarization  fit¬ 
ting  (and  hence  specularity  removal)  in  small  (1- 
2  pixel)  neighborhoods  of  scene  boundaries. 

6.1  Implementation  Details 

For  each  scene,  a  set  of  color  images  are  ob¬ 
tained  by  rotating  the  polarization  filter.  The 
images  are  first  corrected  using  the  calibration 
data.  Then,  the  polarization  parameters  Imin- 
^max»  and  a  are  computed  for  each  color  chan¬ 
nel.  These  parameters  are  computed  using  the 
linear  least  squares  (LS)  fitting  method.  The  re¬ 
sults  of  the  polarization  fitting  are  6  images  for 
each  color  channel,  /max,  /avg  (= 
p  (percent  polarization),  phase  (the  angle  a). 
and  RMSE  (root  mean  square  error  in  fitting). 
Of  these,  only  the  /max  and  /min  images  are  di¬ 
rectly  used  by  the  specularity  removal  algorithm. 
The  others  are  used  only  to  debug  the  algorithm 
and  analyze  the  results.  The  /^vg  image  is  what 
would  be  obtained  without  a  polarizer  but  with 
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a  50%  neutral  density  filter  instead.  The  RMSE 
gives  the  error  in  fitting  the  image  data  to  equa¬ 
tion  1,  and  most  pixels  in  the  image  have  fitting 
errors  that  are  less  than  0.3  gray  levels.® 

The  algorithm  requires  labeling  of  points  as 
purely  diffuse  or  partially  specular,  v  liich  is  done 
by  using  the  degree  of  polarization  p.  Since 
p  depends  on  /mim  the  noise  level  in  p  varies 
with  /min-  Rather  than  using  a  fixed  thresh¬ 
old  T\  on  p  to  identify  partially  specular  points, 
our  implementation  uses  a  threshold  that  varies 
with  /min-  T\  varies  from  around  5%  for  points 
with  /min  >  200  to  around  10%  for  points  when 

/min  ^  20. 

The  algorithm  has  three  other  thresholds  that  af¬ 
fect  its  performance;  the  threshold  T2  for  the  an¬ 
gle  /3,  the  threshold  T3  for  the  angle  7,  and  T4  for 
the  standard  deviation  <7p.  The  algorithm  is  not 
too  sensitive  to  T2  and  T4,  which  have  been  set 
at  0.08  and  arcco5(0.99),  respectively.  The  angle 
threshold  Ta,  which  determines  if  points  lie  close 
to  the  plane  V\ ,  has  a  strong  affect  on  the  quality 
of  computed  results  as  well  as  the  computation 
time.  The  current  implementation  starts  with 
a  relatively  small  threshold  value  (T3=0.02),  and 
doubles  it  after  every  10  iterations. 

6.2  Experimental  Results 

In  the  each  of  the  following  examples,  the  images 
obtained  after  fitting  the  polarization  parame¬ 
ters  are  used  to  computed  the  diffuse  color  image 
la.  This  image  is  then  subtracted  from  the  aver¬ 
age  color  image  lavg  to  obtain  the  specular  color 
image  I*.  lavg  is  the  color  image  we  would  ob¬ 
tain  if  the  polarization  filter  were  not  used.  The 
first  example  is  shown  in  Color  Figure  1(a).  It  is 
the  lavg  image  of  a  mug  with  a  flowered  pattern 
on  it.  The  petals  of  the  flower  are  of  different 
colors  and  within  each  petal  there  is  a  moder¬ 
ate  amount  of  diffuse  color  variation.  Along  the 
middle  of  the  mug  is  a  large  highlight.  Color 
Figure  1(c)  shows  the  Imin  image  obtained  after 
the  polarization  fitting  process.  This  image  rep¬ 
resents  the  best  image  (i.e.  minimum  specular 
component)  obtainable  by  simply  rotating  the 

^If  the  data  is  not  consistent  with  the  cosine  model,  one  or 
more  of  the  assumptions  made  by  the  algorithm  are  violated. 
Hence,  pixels  whose  RMSE  value  is  greater  than  6  gray  levels 
are  marked  as  outliers  and  are  not  used  for  specularity  removal. 
For  reasons  mentioned  above,  fitting  near  ?cene  edges  is  gen¬ 
erally  unreliable  and  hence  errors  in  the  specularity  removal 
occur  mainly  within  2-3  pix-’ls  around  an  edge. 


polarization  filter.  Color  Figure  1(b)  shows  the 
diffuse  image  Ij  computed  using  the  proposed  al¬ 
gorithm.  Color  Figure  l(d)-l(g)  show  close-up 
views  of  the  I^vg  diffuse  image,  specular  image, 
and  Imin  respectively,  for  a  40X40  image  win¬ 
dow  on  the  flower  cup.  Clearly,  the  algorithm 
was  successful  in  separating  the  two  reflection 
components  despite  the  texture  underlying  the 
highlight  region. 

The  second  example  includes  strong  interrefiec- 
tion  effects.  Color  Figure  2(a)  shows  the  orig¬ 
inal  image  (lavg)  of  a  scene  including  a  blue 
plastic  plate  and  a  part  of  the  McBeth  color 
chart.  Color  patches  on  the  chart  are  reflected 
by  the  plate.  There  are  pieces  of  plastic  tape 
(some  dark  reddish  and  others  black)  stuck  on 
the  plate.  Also  visible  is  a  film  canister,  which 
interreflects  portions  of  the  color  chart  as  well  as 
the  surrounding  environment.  Color  Figure  2(b) 
and  Color  Figure  2(c)  show  the  diffuse  and  spec¬ 
ular  components  computed  by  the  algorithm. 
We  see  that,  despite  the  strong  interreflections, 
these  images  are  quite  accurate.  We  do,  how¬ 
ever,  see  that  the  primary  highlight  on  the  left 
side  of  plate  has  not  been  completely  removed. 
This  may  have  been  caused  by  very  high  bright¬ 
ness  values  in  the  highlight  region  for  which  the 
sensor  calibration  is  not  reliable. 

The  final  example  is  shown  in  Color  Figure  3. 
Color  Figure  3(a)  shows  the  lavg  image  of  a  com¬ 
plex  scene  for  which  the  algorithm  partially  fails 
in  some  regions.  Color  Figure  3(b)  and  3(c)  show 
the  diffuse  and  specular  components  computed 
by  the  algorithm.  The  primary  highlights  on 
the  blue  and  red  tori  are  accurately  removed. 
Color  Figure  3(d)  -  3(f)  show  details  of  the  re¬ 
sults  obtained  for  the  blue  torus.  The  diffuse 
and  specular  components  are  well  separated  in 
the  highlight  region.  Errors  in  the  separation 
are  however  seen  on  the  occluding  boundary  of 
the  torus.  This  results  from  the  strong  polariza¬ 
tion  of  the  diffuse  component  on  the  occluding 
boundary;  the  assumption  that  the  diffuse  com¬ 
ponent  is  unpolarized  is  violated. 

Some  of  the  interreflections  on  the  left  wall  of 
the  red  box,  such  as  the  interreflection  of  the 
white  cup  in  the  upper  left  corner  of  the  scene, 
are  also  removed.  However,  the  results  arc  not  as 
good  for  the  interreflections  of  the  blue  torus  and 
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the  marker  pen.  This  is  because  of  the  follow¬ 
ing  reasons.  First,  the  diffuse  red  color  in  these 
interreflection  areas  is  very  bright,  resulting  in 
a  low  (3-4  %)  degree  of  polarization.  Hence, 
many  of  the  pixels  are  incorrectly  labeled  as  “dif¬ 
fuse.”  Second,  because  of  the  geometry  of  the 
scene,  the  walls  of  the  box  interreflect  between 
themselves,  producing  specular  reflections  that 
have  the  same  color  (red)  as  the  diffuse  compo¬ 
nent.  As  a  result,  the  specular  and  diffuse  colors 
could  not  be  separated  even  when  the  algorithm 
thresholds  were  lowered  to  force  a  computation. 
Hence,  it  is  not  that  the  algorithm  has  removed 
too  much  red  in  the  interreflections  of  the  blue 
torus  and  the  pen,  but  rather  it  was  unable  to 
remove  the  red  specular  reflection  from  the  red 
wall.  This  is  a  manifestation  of  the  dichromatic 
reflectance  assumption  made  by  the  algorithm. 
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Figure  6:  (a)  The  color  histogram  for  a  40X40  window  of  the  cup  image  shown  in  Color  Figure  1(d). 
The  anatomy  of  the  histogram  is  too  complex  to  identify  the  skewed  T  used  in  [Klinker,  88).  (b) 
Cluster  in  (/min^/max)  space  used  by  the  polarization  based  proposed  in  [WolfF  and  Boult,  91). 
This  cluster  does  not  form  a  straight  line,  an  assumption  that  the  previous  method  is  based  on.  (c) 
Separation  of  diffuse  and  specular  components  of  the  center  pixel  of  the  window  using  the  proposed 
method  based  on  color  and  polarization. 
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Abstract 

This  research  is  intended  to  enable  recognition 
in  real  operational  scenes  with  multi-sensor  im¬ 
ages  by  implementing  generic  image  elements 
determined  from  the  physics  of  image  forma¬ 
tion.  Segmentation  aims  at  a  piecewise  com¬ 
posite  description  of  the  image  surface,  i.e. 
smooth  surfaces  bounded  by  discontinuities. 

A  major  failing  is  description  of  extended 
boundaries  of  the  image  surface.  Estimation  of 
extended  edges  is  limited  by  accuracy  of  local 
edge  estimation  (edgel)  in  angle  and  transverse 
position.  Angular  accuracy  of  usual  edgel  esti¬ 
mation  is  so  bad  that  it  is  usually  ignored.  A 
new  operator  with  improved  accuracy  has  led 
to  preliminary  extended  edges  that  are  greatly 
improved. 

To  incorporate  image  features  in  Bayesian  in¬ 
ference,  a  statistical  performance  model  is  es¬ 
sential  to  provide  conditional  probabilities  es¬ 
sential  for  evidential  accrual.  Theoretical  mod¬ 
els  have  been  derived;  they  are  verified  by  truth 
from  synthesized  imagery.  For  a  previous  op¬ 
erator,  results  were  verified  from  truth  in  one 
image.  Preliminary  results  have  been  obtained 
for  other  common  edge  estimation  operators. 

1.  Introduction 

This  research  is  intended  to  enable  recognition  in  real 
operational  scenes  with  multi-sensor  images  by  imple¬ 
menting  generic  image  elements  determined  from  the 
physics  of  image  formation.  Segmentation  is  common 
across  many  types  of  images  in  the  sense  that  image  for¬ 
mation  involves  only  a  few  physical  mechanisms  from 
visible  to  IR;  surface  reflection,  body  reflection,  emis¬ 
sion.  The  role  of  segmentation  is  to  describe  the  image 
surface  compactly  as  a  piecewise  smooth  composite  sur¬ 
face,  i.e.  smooth  surfaces  bounded  by  discontinuities. 

'This  research  was  supported  in  part  by  a  contract 
from  the  Ait  Force,  F30602-92-C-0105  through  RADC  from 
DARPA  SISTO,  “Model-based  Recognition  of  Objects  in 
Complex  Scenes:  Spatial  Organization  and  Hypothesis 
Generation” . 


With  range  images,  a  piecewise  image  surface  descrip¬ 
tion  is  directly  a  piecewise  description  of  the  visible  ob¬ 
ject  surface.  With  intensity  images,  a  piecewise  image 
description  is  a  step  toward  inference  of  a  piecewise  ob¬ 
ject  surface  description. 

Image  surface  segmentation  generates  piecewise 
smooth  surfaces  bounded  by  observable  intensity  discon¬ 
tinuities  that  result  from  discontinuities  of  surface  geom¬ 
etry,  of  illumination,  or  reflectance.  The  intensity  surface 
may  have  discontinuities  at  a  point  or  along  a  curve.  Dis¬ 
continuities  along  curves  are  edges  of  the  image  intensity 
surface.  Edges  of  the  image  surface  are  typically  found 
by  a  local-to-global  process  of  local  estimation  of  discon¬ 
tinuities  followed  by  hierarchical  linking  into  extended 
boundaries.  Complexity  of  linking  is  exponential  in  er¬ 
ror  of  edgel  orientation  and  transverse  position.  Errors 
of  stereo  depth  estimates  or  dimensionail  measurements 
made  from  image  extended  curves  are  linear  in  errors  of 
edgel  orientation  and  position. 

Little  progress  has  been  made  on  linking  of  extended 
boundaries.  Although  a  display  of  edgels  can  be  inter¬ 
preted  by  a  human  observer,  the  set  of  edgels  are  un¬ 
connected  and  are  not  extended  edges;  only  the  human 
perceives  extended  edges  that  enable  interpretation.  In 
this  paper,  a  new  Wang-Binford  edgel  estimator  is  dis¬ 
cussed  which  has  led  to  effective  edgel  linking  with  a 
preliminary  algorithm. 

Almost  all  work  on  local  segmentation  of  image  dis¬ 
continuities  has  been  intended  for  step  discontinuities 
between  level,  planar  image  surfaces.  For  typical  images, 
only  about  a  third  or  less  of  discontinuities  are  of  this 
limited  class.  Typical  edgel  estimation  algorithms  are 
sensitive  to  shading,  both  in  estimating  spurious  step 
edges  where  there  are  none,  and  in  making  inaccurate 
estimates  at  edges  because  of  biases  that  result  from  an 
inaccurate  model.  Some  workers  ignore  edgel  angle  in¬ 
formation  in  using  edge  data.  The  algorithm  described 
here  makes  more  accurate  estimates  in  the  presence  of 
shading. 

[Herskovits  and  Binford,  1970]  point  out  several  types 
of  discontinuities  of  the  image  surface  are  common  in 
real  images,  corresponding  to  step,  delta  and  crease 
(“rooP)  defined  along  curves.  Also,  discontinuities  at 
points  (spots)  are  important.  This  research  has  made 
an  algorithm  for  edgel  estimation  for  delta  function  dis¬ 
continuities  along  curves,  not  described  here. 
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The  local-to-global  structure  is  motivated  by  complex¬ 
ity,  i.e.  subdividing  to  describe  small  disks  by  simple 
functions  that  are  computationally  tractable.  The  edgel 
describes  a  step  with  parameters  that  are:  a)  orientation; 
b)  transverse  position;  and  c)  contrast. 

The  Roberts  cross  and  Canny  operators  are  based  on 
the  gradient  of  intensity.  The  Canny  operator  is  a  one¬ 
dimensional  operator  with  a  heuristic  extension  to  two 
dimensions.  Both  have  bias  in  orientation  and  position 
in  the  presence  of  shading.  The  Binford-Horn  operator 
was  approximately  a  directional  second-derivative. 

For  recognition  in  complex  scenes,  sensor  and  infor¬ 
mation  fusion  is  based  on  Bayesian  inference  that  de¬ 
pends  on  conditional  probabilities  of  data  given  hypothe¬ 
ses  [Binford  87a].  An  extensive  statistical  model  of  per¬ 
formance  has  been  developed,  with  theoretical  analysis 
verified  by  truth  from  synthesized  images.  For  a  previ¬ 
ous  version  of  the  operator,  the  model  was  verified  by 
truth  from  a  real  image. 


G,(i,y)  =  ^[I{x,y)*G,{x,y)] 

=  H*,y)*l^G,(x,y)] 


where 


1 


is  the  2-dimensional  Gaussian  smoothing  function,  and 
*  represents  the  2-dimensional  convolution. 

Moreover,  the  2-dimensionaI  convolution  can  be  de¬ 


composed  into  the  product  of  two  1-dimensional  Gaus¬ 
sian  convolutions. 


2.  Wang-Binford  Operator 


2.4  Detection  and  Localisation 


2.1  Definition  of  a  Step  Edge 
The  profile  of  an  ideal  step  edge  can  be  expressed  as 


S(x)  =  Ai  •f/_,(r) 

where  Ai  is  a  scalar  and  U-i(x)  is  the  unit  step  function 
defined  as 


for  X  >=  0 
for  X  <  0 


But  the  intensity  image  is  blurred  by  the  optical  sys¬ 
tem  and  the  impulse  response  of  the  sensor.  The  blurred 
edge  can  be  approximated  by  Gaussian  convolution  ex¬ 
pressed  as: 


5t(x)  =  A3(CA_,(i)(»C7»(r)) 

where  Gt(z)  is  a  Guassian  function  with  zero  mean 
and  standard  deviation  <rt,  and  0  represents  the  1- 
dimensional  convolution.  This  is  shown  in  figure  1. 


For  a  blurred  step  edge,  the  direction  of  its  gradient 
is  normal  to  the  edge  and  the  transverse  profile  of  the 
gradient  magnitude  is  a  Gaussian  function. 

Without  loss  of  generality,  we  can  assume  the  step 
edge  coincides  with  the  x-axis.  Its  intensity  function  is 
expressed  as 

S(x,y)  =  A(U.i(y)»Gi(y)) 

After  the  gaussian  derivative,  the  gradient  is 
(G,(x,  y),  G„{x,  y))  ,  with 

<?*(*,  y)  =  I5(z,  y)  0  ^Gm{x)]  0  Gm{y) 

=  [A(f/_i(y)  0  G»(y))  0  ^G„(x)]  0  G„.(y) 

=  O0G„(y) 

=  0 

and 


Figure  1.  Profile  of  a  Blurred  Step  Edge 
Canny  solved  the  one-dimensional  variational  problem 
of  optimizing  a  combination  of  signal-to-noise  and  posi¬ 
tion  location  [Canny,  1983].  The  result  was  approxi¬ 
mately  a  gaussian  derivative. 


Gf{z,  y)  =  [5(x,  y)  0  Gm(x)]  0  ^Gm(y) 

=  [A(Cf_i(y)0G*(y))0G„(x)]0|^G„(y) 

=  [AU-i{y)  0  Gt(y)]  0  ^Gm(y) 

=  A|-[l/_,(y)0G»(y)0G„(y)] 

=  A|^([A_,(y)0Ge(y)) 

=  A.Ge(y) 

where 
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However, if  this  edge  is  shaded,  an  extra  term  has  to 
be  added  to  the  gradient.  Since  the  shading  effect  is 
smooth,  we  can  assume  this  extra  term  is  constant  within 
a  local  area,  and  the  perturbed  gradient  function  turns 
out  to  be 

iGx{x,y),Gy{x,y))  =  (ff.,v4G«(y)  +  j/,) 

As  shown  in  Figure  2,  the  direction  of  the  gradient 
veers.  If  an  edge  detector  depends  on  the  gradient  di¬ 
rection,  like  the  Canny  Operator  does,  one  would  expect 
it  to  have  poor  performance  whenever  shading  is  intro¬ 
duced.  It  is  difficult  to  remove  this  extra  component 
from  the  gradient  if  no  information  about  the  shading  is 
known  in  advance. 


Gradient  of  a  Step  Edge 
Gradient  of  the  Shaded  Edge 


min  52  (Gm(*+t,y-t-j)-Gmo  e  )* 

where  $  is  the  orientation  of  the  edgel,  u  is  the  transverse 
position  shift  from  the  center  pixel,  and  Gmo  is  propor¬ 
tional  to  the  contrast  of  the  edgel.  (as  shown  in  figure 

3) 


Figure  3.  Parameters  of  a  Blurred  Step  Edge 


This  is  a  nonlinear  fitting  problem.  The  o’c,  which  de¬ 
pends  on  both  ffm  is  assumed  to  be  known  before 

the  fitting.  It  can  be  estimated  from  the  image  easily. 
This  nonlinear  fitting  problem  can  be  even  simplified  to 
a  linear  one  as  follows. 

The  logarithm  of  the  gradient  magnitude  of  the 
blurred  step  edge  is  a  quadratic: 


Figure  2.  Gradients  of  a  Shaded  Step  Edge 

Even  though  the  gradient  vector  is  perturbed,  the  gra¬ 
dient  magnitude  function  still  keeps  the  information  we 
need  for  unbiased  localization.  In  the  above  case,  its 
gradient  magnitude  function  can  be  written  as 

Gm(x,y)  =  G«(y)-byy)2 

The  peak  of  Gm{x,  y)  coincides  with  the  x-axis,  that  is, 
it  coincides  with  the  true  edge.  The  gradient  magnitude 
function  is  still  constant  along  the  x  direction  and  is  an 
even  function  in  the  y  direction.  The  gradient  magnitude 
extracts  the  orientation  and  position  of  the  edge  without 
bias.  Moreover,  even  though  the  transverse  profile  of 
the  gradient  magnitude  function  is  no  longer  a  gaussian 
function,  a  gaussian  function  can  still  be  used  as  a  good 
approximation  that  will  cause  only  a  small  bias  in  the 
estimation  of  the  contrast,  which  is  less  important  than 
bias  in  orientation  and  position. 

To  detect  the  edge,  that  is  to  detect  the  peak  of 
the  gradient  magnitude  function,  the  measured  gradi¬ 
ent  magnitude  values  are  fit  with  a  parametric  surface, 
which  represents  the  gradient  magnitude  function  of  a 
step  edge  with  parameters  0,  u  (transverse  position),  and 
Gmo  (The  gradient  magnitude  value  at  the  edge).  The 
fitting  is  done  with  a  3  by  3  support.  Edge  detection  is 
solution  of: 


lo9{G,nO  €~  ) 

.  X  (-sin0  t  +  cosS  j-u)* 

=  log[Umo) - 2^2 - 

=  ai*  -1-  bxy  -I-  cy*  -f  di  -I-  cy  -b  / 

with 

0  =  -5^  sin*  9 

6  =  i  •  sin  9  •  cos  9 

c  =  -5^co8*9 

d  =  —  •  sin  9  •  u 

c  =  ^  •  cos  9  ■  u 

/  =  /oy(Gmo)  -  air  • 

Therefore,  a  linear  fit  minimizes  the  objective  function 
1 

52  (*og Gm (x -bi, y-b jl) - (a«* -b hij + cp -b di-\-ej -b /))* 
•j=-i 

and  the  parameters  of  the  edgel  can  be  obtained  from 
the  coefficients  of  the  parabolic  function  with 

g  =  tan->(‘'"^A"'— ) 

,  y  _  _  — dtin  #-b«co«# 

.  /o9(Gmo)  =  /-(a-bc)u* 
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A  linear  solution  gains  efficiency  and  avoids  conver¬ 
gence  problems.  It  is  not  necessary  to  know  o’,  before 
the  fitting.  Logarithms  are  computed  by  table  lookup. 
Thus,  the  Wang-Binford  operator  has  computation  cost 
comparable  to  the  Canny  operator. 

2.5  Thresholding 


Edges  occur  only  at  a  small  part  of  the  image.  Edge 
fitting  is  done  only  at  local  maxima  of  the  gradient  mag¬ 
nitude  that  are  above  threshold. 

The  threshold  is  determined  by  a  constant  false  alarm 
rate  (CFAR),  that  is,  constant  probability  of  detection  of 
an  edgel  where  there  b  no  edge  actually.  The  variance  of 
the  gradient  magnitude  is  determined  from  measurement 
of  the  variance  of  pixel  intensity.  The  noise  at  pixel  (i  j) 
I  b  approximately  an  independent,  identically 

dbtributed  (i.i.d.)  Gaussian  random  variable  with  zero 
mean,  standard  deviation  Its  pdf  can  be  expressed 
as 


Pn(ti)  = 


When  there  is  no  signal  around  (x,y),  the  gradient 
becomes; 


Gt(x,  y) 


and 


J— 


E 


n(*  +  ‘,y  + j) 
»>(*  +  »,y  +  i) 


G,(x,  y) 


E 


ijs-oo 


J _ i_ 


E 


i 

2ir<rl. 


n(H-i,y  +  j) 

n(*  +  »,y  + j) 


Since  Gx(x,y)  and  Gy{x,y)  are  linear  combinations  of 
Gaussian  random  variables,  they  are  again  Gaussians. 
The  pdf  of  the  gradient  (Gx(x,y),G,(i,y))  b: 


f grads, §radg{^ty)  ~ 


where 

A 

-  V 


2w  II  A  Hi 


1 

2x<rJ* 


<T 


3 


The  pdf  of  the  gradient  magnitude  b: 


*  >  0 


If  the  false  positive  rate  b  set  to  be  a,  only  (a  •  100)% 
of  gradient  magnitude  value  at  the  badcground  can  be 
greater  than  threshold  gr-  That  b 


1—0 


(2xdx) 


1 


From  the  above  equation,  the  threshold  gr  has  value: 

gr  =  (2«rjln(i)]i 
o 


2.6  Details  of  Implementation 


To  compute  the  threshold  for  edge  detection,  the  value 
of  <r„,  the  standard  deviation  of  pixel  intensity,  b  needed. 
It  varies  from  image  to  image.  It  can  be  measured  from 
the  sensor  with  controlled  scenes.  It  can  be  estimated 
from  an  existing  image  to  a  good  approximation. 

At  any  pixel  of  the  image,  the  intensity  value  consbts 
of  signal  with  noise.  To  separate  the  noise  from  the 
signal,  consider  the  approximation  to  the  laplacian: 

i.y)  +  /(*  + i.y) 

+I{x,y-  1)-|-  /(x,y-|- 1)] 

If  the  intensity  surface  of  the  signal  approximates  a 
plane  locally,  then  A'(x,  y)  consbts  only  of  a  noise  term. 
With  the  assumption  of  i.i.d.  noise,  the  variance  of  K 
becomes; 


Vuiance  of  K(x,y) 


That  is. 


(l-|-4.(0.25)>’ 

1.25 


<r„  =  0.8944  ^/Variance  of  K(x,y) 

A  digital  image  is  sampled  at  discrete  pixeb.  Both 
the  impulse  response  function  of  camera  pixels  and  of 
the  Gaussian  derivative  contribute  to  discretisation.  The 
effect  of  discretization  b  large  for  small  values  of  the 
width  of  the  Gaussian  derivative.  Thb  effect  has  been 
examined  theoretically  and  experimentally. 
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Figure  4.1. a  Intensity  Image 


Figure  4. 2. a  Intensity  Image 


Figure  4.1.c  Canny  Operator  Figure  4.2.c  Canny  Operator 


Figure  4.1  d  Wang-Binford  Operator 


Figure  4.2.d  Wang-Binford  Operator 
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The  Om  of  the  smoothing  filter  has  to  be  carefully  cho¬ 
sen  to  avoid  this  effect.  It  has  been  claimed  in  [Canny, 
1983]  that  the  minimum  value  of  the  a  of  the  Gaussian 
smoothing  function,  which  is  applied  before  sampling,  is 
approximately  equal  to  the  sampling  interval  r.  With 
a  similar  analysis,  the  value  of  <Te,  which  depends  on 
both  <Tm  has  to  be  bigger  than  1  to  avoid  the 

discretization  effect.  This  conclusion  is  support  by  em¬ 
pirical  data. 

3.  Statistical  Model  of  the  Detected  Edgel 

The  first  order  statistical  model  of  this  operator  is  its 
bias  vector.  The  second  order  statistical  model  is  its 
variance.  Together  they  specify  a  Gaussian  probability 
model  for  the  operator  to  use  in  performance  analysis 
and  in  Bayesian  inference. 

Noise  in  the  intensities  of  image  pixels  causes  the  po¬ 
sition,  orientation,  and  contrast  of  the  estimated  edgel 
to  fluctuate.  This  statistical  model  can  be  estimated  by 
either  experiment  or  theory.  We  did  it  in  both  ways.  It 
is  found  that  the  pdf  of  the  fluctuation  can  be  approx¬ 
imated  by  a  3-dimensional  Gaussian  distribution,  with 
mean  vector  m  =  0  and  diagonal  covariance  matrix  A. 

3.1  Experimental  Method 

A  program  was  written  to  generate  intensity  images 
with  known  gaussian  impulse  response  function  of  a  step 
edge  with  given  position,  orientation,  and  contrast.  Ran¬ 
dom  noise  with  given  energy  cr^  ,  was  added  to  image  val¬ 
ues  and  the  fluctuation  of  0,  u,  and  log{Gmo)  was  mea¬ 
sured.  This  process  is  repeated  5000  times  to  estimate 
the  covariance  for  orientation,  position  and  gradient  as 
a  function  of  edge  contrast. 

Orientations  were  0,  1 1.25, 22.5, 33.75,  and  45  degrees, 
and  contrast  was  4,  8,  16,  32,  64,  128,  and  256.  The  re¬ 
sult  of  this  experimental  approach  shows  that  the  pdf 
of  the  fluctuation  is  approximated  by  a  3-dimensional 
Gaussian  distribution  with  mean  m  =  0  and  diagonal 
covariance  A.  The  model  is  insensitive  to  orientation, 
while  the  standard  deviation  of  the  fluctuation  is  pro¬ 
portional  to  “  shown  in  Figure  5. 

A  similar  simulation  was  done  for  circular  edges  for 
radius  5,  10,  20,  40,  80,  160,  and  1600  pixels.  The  re¬ 
sult  is  about  the  same  as  the  straight  edge  case  when 
the  radius  is  bigger  than  10  pixels.  When  the  radius  is 
smaller  than  10  pixels,  one  extra  constant  term  has  to  be 
added  to  the  standard  deviation  of  log{Gmo)  as  shown 
in  Figure  6. 

3.2  Theoretical  Analysis 

We  use  a  perturbation  method  to  get  the  analytic 
model  as  follows: 

In  detection,  parameters  of  the  edgel  minimize  the  ob¬ 
jective  function: 

1 

♦(P.W)  =  (MGr„(:r-f-i,»-l-j)) 

-(ai*  +  bij  +  cp  +  di  +  ej  +  /)]* 


where  P  stands  for  the  coefficients  a,b,c,d,e,f  of  the 
parabolic  function,  and  W  stands  for  the  set  of  measured 
data  J(x,y)  =  /oy(Gm(x,  y)). 


That  is,  the  parameters  are  solutions  for  P  such  that: 


♦(P,  W)  = 


d*{P,  W) 
dP 


Denote  Po  as  the  solution  of  ♦(P,  Wo)  =  0,  where 
Wo  stands  for  the  noise-free  case.  When  noise  is  added, 
it  causes  the  measured  data  to  perturb  a  small  amount 
6W,  which  consequently  causes  a  small  fluctuation  6P 
in  the  coefficient  space.  The  relationship  between  6W 
and  6P  is  given  by: 


♦(Po  -1-  6P,  Wo  -»-  «W)  =  0 
By  expanding  the  above  equation,  we  have 


d'i 


0 


and  thus 


1^“*' 


Moreover,  the  parztmeter  of  the  edgel,  6,  is  a  function 
ofP  withe  =  M(P).  Thus: 


se  = 


dM{P) 

~dP~^^ 

,dM{P)  ^  , 
^  dP  '^dp’ 

(1  6W 


d* 

dW 


]-6W 


The  1st  and  2nd  moments  of  60  can  be  computed  as: 

E{6e)  =  fi  •  EiSW) 

Eisese'^)  =  n  E(6W6W^) 

Therefore,  the  statistical  model  of  66  follows  from 
the  statistical  model  of  6W. 


3.2.1  Statistical  Model  for  the  Measured  Data 

With  the  linearity  of  convolution  and  the  noise  model 
above,  the  fluctuation  of  gradient  is  caused  by  noise  only 
and  is  independent  of  signal.  That  is: 

t  f  u 

bGt{x,y)  =  yy  “2ir^^  *  “ 


6G,(i,y)  =  JJ-^^e~~^N{x-u,y-v)dudv 

where  N(i  j)  is  the  noise  component  at  pixel  (i  j).  It  has 
been  assumed  to  be  an  i.i.d.  Gaussian  random  variable 
with  zero  mean,  and  constant  standard  deviation  <t„. 

Since  integration  is  a  linear  operation,  6Gx(x,y)  and 
6Gy(z,y))  are  also  Gaussians.  With  some  straight¬ 
forward  computations,  we  have: 

E{6Gr(x,y))  =  E{6G,(x,y))  =  0 

and 
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Ei6G,{x,y)6Gr(x,y)) 

-  2f  ^  ^  1 

=  ^x.(i-i,y-!/) 


E{6G4x,y)6Gyii,y)) 

=  <xl  -  Nry{x-x,y-y) 


^t(*-i)’+(»-yn 


These  two  models,  from  synthesized  data  and  from  the 
theoretical  analysis  agree  for  a  straight  step  edge  and 
for  circular  arcs  with  radius  of  curvature  greater  than 
10  pixels,  (see  Figure  5.)  When  the  radius  is  smaller 
than  10  pixels,  a  constant  term,  6(log(Gmo)),  has  to  be 
added  to  the  standard  deviation,  as  shown  in  Figure  6 
and  estimates  of  u  and  log{Gmo)  slightly  biased. 

A  more  complicated  statistic^  model  for  6W  has  to 
be  made  for  edges  with  large  curvature.  This  part  is  not 
finished  yet.  However,  the  model  for  the  straight  edge  is 
still  good  enough  for  most  of  the  image. 

4.  Conclusion 


E{6Gy{x,y)6Gy{xM 


8x<rl  2  '  4x<t® 


= 


Nyy{x-x,y-y) 


Now,  investigate  the  dependence  of  log{Gm{x,y))  with 
noise.  By  definition: 

J{x,y)  =  log{Gm(x,y))  =  log{yjG,{x,yy  +  Gj,(x,y)2) 
Therefore, 


6J(x,y) 


G,(x,  y)6Grix,  y)  +  Gy(x,  y)SGy(x,  y) 
Gs(x,yy  +  Gy(x,y)^ 


In  this  paper,  we  demonstrated  an  algorithm  for  de¬ 
tection  and  estimation  of  step  edges  which  is  insensitive 
to  shading.  Even  though  this  operator  is  designed  espe¬ 
cially  for  straight  step  edges,  it  can  also  be  applied  to 
most  curved  edges.  The  detected  edgels  are  unbiased  in 
orientation  and  position.  The  resulting  improved  accu¬ 
racy  enables  linking  into  extended  edges  and  improved 
accuracy  of  estimation  of  image  dimensions. 

The  statistical  edgel  model  was  constructed  and  ver¬ 
ified.  This  model  is  valuable  in  the  Bayesian  network 
for  combining  evidence.  Even  though  this  model  is  good 
enough  for  most  images,  we  are  planning  to  build  a  more 
complete  model  for  highly  curved  step  edges. 

This  operator  is  still  not  effective  for  many  image  fea¬ 
tures,  e.g.  lines  or  spots.  A  complete  set  of  operators  has 
to  be  developed  to  deal  with  other  basic  image  features. 


Agaun,  since  6J{x,y)  is  the  linear  combination  of  two 
Gaussian  random  variables,  6Gx{x,y)  and  6Gy{x,y)),  it 
is  also  a  Gaussian  random  variable  with: 


E(«J(x,y))  =  0 
E{6J{x,y)6J{x,y)) 

^mn 


where: 

Q(u,v)  =  sin^  0Nxx{^,v)  —  28m0  co80Nxy{u,v) 

-I- cos*  $Nyy{u,v) 

3.2.2  Statistical  Model  for  0,  u,  and  log(Gmo) 

The  statistical  model  of  6W  combined  with  the  knowl¬ 
edge  of  and  gives  the  statisti¬ 

cal  model  of  6Q.  The  result  shows  that  the  pdf 
of  [60,6u,6{log{Gmo))]  can  be  approximated  by  a  3- 
dimensional  Gaussian  distribution  with  zero  mean  vec¬ 
tor  and  diagonal  covariance.  Moreover,  the  standard 
deviations  of  66,  6u,  and  6{log{Gmo))  are  found  to  be 
proportional  to  OnlGmo- 
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Abstract 

Computer  vision  algorithms  are  composed  of  dif¬ 
ferent  sub-algorithms  often  applied  in  sequence. 
Determination  of  the  performance  of  a  total  com¬ 
puter  vision  algorithm  is  possible  if  the  perfor¬ 
mance  of  each  of  the  sub-algorithm  constituents 
is  given.  The  performance  characterization  of  an 
algorithm  has  to  do  with  establishing  the  cor¬ 
respondence  between  the  random  variations  and 
imperfections  in  the  output  data  and  the  ran¬ 
dom  variations  and  imperfections  in  the  input 
data.  In  the  paper  by  Ramesh  and  Haralick,  [1], 
theoretical  models  for  the  random  perturbation- 
s  in  the  output  of  a  vision  sequence,  involving 
edge  finding,  edge  linking  and  line  fitting  were 
presented.  They  modelled  the  process  that  de¬ 
scribes  the  breakage  of  a  true  model  line  segment 
by  a  renewal  process  with  alternating  line  and 
gap  intervals.  However,  their  paper  assumed  in¬ 
dependence  of  gradient  estimates  obtained  from 
neighboring  pixel  locations. 

In  this  paper  we  show  how  one  can  relax  the 
independence  assumptions  and  derive  perturba^ 
tion  models  that  include  the  effects  of  correlation 
between  neighboring  gradient  estimates.  Under 
the  assumption  that  the  ideal  data  is  corrupt¬ 
ed  with  additive,  independent  additive  gaussian 
noise,  we  derive  expressions  that  describe  the  re¬ 
lationship  between  an  edge  gradient  estimate  at 
a  given  location  and  an  edge  gradient  estimate 
for  a  neighboring  pixel.  We  illustrate  how  the 
model  line  breakage  process  can  be  modeled  as  a 
Markov  process  whose  parameters  are  functions 
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of  the  true  edge  gradient,  edge  operator’s  neigh¬ 
borhood  size,  and  the  noise  variance.  Further¬ 
more,  we  derive  theoretical  expressions  for  the 
mean  positional  error  as  a  function  of  the  neigh¬ 
borhood  operator  window  size,  noise  variance, 
the  width  of  the  true  ramp  edge,  and  the  true 
edge  gradient.  We  also  outline  an  experimental 
protocol  used  for  evaluating  edge  pixel  position¬ 
ing  errors  and  discuss  the  results  obtained  from 
the  experiments. 

1  Introduction 

Computer  vision  algorithms  are  composed  of  dif¬ 
ferent  sub-algorithms  often  applied  in  sequence. 
Determination  of  the  performance  of  a  total  com¬ 
puter  vision  algorithm  is  possible  if  the  perfor¬ 
mance  of  each  of  the  sub-algorithm  constituents 
is  given.  The  performance  characterization  of  an 
algorithm  has  to  do  with  establishing  the  cor¬ 
respondence  between  the  random  variations  and 
imperfections  in  the  output  data  and  the  ran¬ 
dom  variations  and  imperfections  in  the  input 
data.  In  the  paper  by  Ramesh  and  Haralick,  [1], 
theoretical  models  for  the  random  perturbation- 
s  in  the  output  of  a  vision  sequence,  involving 
edge  finding,  edge  linking  and  line  fitting  were 
presented.  They  modelled  the  process  that  de¬ 
scribes  the  breakage  of  a  true  model  line  segment 
by  a  renewal  process  with  alternating  line  and 
gap  intervals.  However,  their  paper  assumed  in¬ 
dependence  of  gradient  estimates  obtained  from 
neighboring  pixel  locations. 

In  this  paper  we  show  how  one  can  relax  the 
independence  assumptions  and  derive  perturba¬ 
tion  models  that  include  the  effects  of  correlation 
between  neighboring  gradient  estimates.  Under 
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the  assumption  that  the  ideal  data  is  corrupt¬ 
ed  with  additive,  independent  additive  gaussian 
noise,  we  derive  expressions  that  describe  the  re¬ 
lationship  between  an  edge  gradient  estimate  at 
a  given  location  and  an  edge  gradient  estimate 
for  a  neighboring  pixel.  We  illustrate  how  the 
model  line  breakage  process  can  be  modeled  as  a 
Markov  process  whose  parameters  are  functions 
of  the  true  edge  gradient,  edge  operator’s  neigh¬ 
borhood  size,  and  the  noise  variance.  Further¬ 
more,  we  derive  theoretical  exprt.:'iions  for  the 
mean  positional  error  as  a  function  of  the  neigh¬ 
borhood  operator  window  size,  noise  variance, 
the  width  of  the  true  ramp  edge,  and  the  true 
edge  gradient.  This  paper  is  organized  as  follows. 
The  first  section  provides  a  brief  review  of  result- 
s  discussed  in  [1].  The  second  section  provides 
an  analysis  of  the  positional  error  introduced  by 
gradient  based  edge  operators.  The  third  section 
provides  a  discussion  of  a  perturbation  model  for 
the  edge  output  that  takes  into  account  the  de¬ 
pendence  of  estimates  at  neighboring  pixels.  A 
subsequent  section  provides  a  discussion  of  theo¬ 
retical  performance  measure  plots. 

2  Review  of  previous  results 

In  this  section  we  review  some  of  the  results 
outlined  in  the  paper  by  Ramesh  and  Haralick 
[1].  Our  work  extends  the  results  given  in  [1]. 
Ramesh  and  Haralick  [1],  describe  a  theoretical 
model  by  which  pixel  noise  can  be  successively 
propagated  through  an  edge  labelling  algorith- 
m,  an  edge  linking  algorithm  and  a  boundary 
gap  .illing  algorithm.  Assuming  an  edge  ideal¬ 
ization  of  a  linear  ramp  edge  and  i.i.d  Gaussian 
random  perturbations  on  pixel  grayvalues  they 
show  how  one  could  model  the  breakage  of  a 
true  line  segment  as  a  renewal  process  with  al¬ 
ternating  segment  and  gap  intervals.  They  show 
that  if  one  ignores  the  dependencies  between  ad¬ 
jacent  gradient  estimates  then  the  segment  and 
gap  interval  lengths  are  exponentially  distribut¬ 
ed  with  parameters  Ai  and  Aj  that  are  related 
to  the  true  edge  gradient,  the  neighborhood  op¬ 
erator  size  and  the  gradient  threshold  employed. 
They  also  show  how  the  output  after  a  gap  filling 
operation  could  still  be  modeled  as  an  alternat¬ 
ing  renewal  process  and  derive  the  length  distri¬ 
butions  for  the  segment  and  gap  intervals  after 


the  operation. 

In  reality,  there  is  an  overlap  between  the  edge 
detector  neighborhoods  centered  around  pixel- 
s  and  hence  there  is  some  dependence  between 
gradient  estimates  obtained  for  neighboring  win¬ 
dows.  In  addition,  if  one  assumes  that  the  noise 
at  each  pixel  is  locally  dependent  then  the  cor¬ 
relation  in  the  noise  would  introduce  correlation 
in  the  gradient  estimates.  In  addition,  the  analy¬ 
sis  in  [1]  did  not  include  positional  errors.  These 
positional  errors  are  of  significance  if  one  wishes 
to  analyze  higher-level  matching  algorithms. 

In  other  work,  [2],  we  focussed  on  performing 
theoretical  model-based  comparison  of  gradien- 
t  betsed  edge  finding  schemes  and  mathematical 
morphology  based  edge  finding  schemes.  The 
performance  analysis  was  done  by  assuming  an 
ideal  edge  model  and  a  noise  model  and  by  deriv¬ 
ing  expressions  for  probability  of  false  alarm  and 
probability  of  misdetection  of  edge  pixels.  Under 
the  Gaussian  noise  model  assumption,  the  theory 
indicated  that  the  morphological  edge  detector  is 
superior  to  conventional  gradient  based  edge  de¬ 
tectors,  that  label  edges  based  on  gradient  mag¬ 
nitude,  when  a  size  3  by  3  window  was  used.  We 
performed  experiments  to  validate  our  theoreti¬ 
cal  results  and  the  empirical  plots  indicated  that 
the  morphological  edge  operator  was  also  superi¬ 
or  when  a  5  by  5  window  is  used.  However,  the 
theoretical  plots  did  not  confirm  this  because  the 
theory  provided  only  an  upperbound.  In  [2]  we 
also  included  comparisons  of  results  obtained  for 
real  images.  A  simple  analysis  of  hysteresis  link¬ 
ing  was  also  done  in  this  paper  and  it  was  shown 
that  hysteresis  linking  improves  the  performance 
of  the  edge  operators. 

3  Positional  Error  Analysis 
of  Gradient  Based  Edge 
Detectors 

In  this  section  we  derive  the  expression  for  the 
mean  error  in  the  edge  pixel  location.  We 
consider  an  ideal  edge  model  that  has  an  one¬ 
dimensional  intensity  profile  of  a  ramp.  Specifi¬ 
cally,  the  intensity  profile  is  defined  by: 

/(i)  =  a-t-Gi  (1) 

fori  =  -if-l/2,...,A’-l/2 
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(2) 


=  a-  G{K  -  l)/2  for  x  < -K  -  1/2 
=  a  +  G(ii:  -  l)/2  for  *  >  JC  -  1/2 

We  assume  that  the  edge  detection  is  performed 
by  computing  the  gradient  by  fitting  a  planar 
surface  to  the  grayscale  values  as  in  [1].  In 
1-dimension  this  problem  is  equivalent  to  fit¬ 
ting  a  line  to  the  data  for  each  1  by  if  neigh¬ 
borhood.  There  are  two  kinds  of  errors  that 
are  introduced  in  the  fit,  one  error  is  the  sys¬ 
tematic  bias  that  is  introduced  in  the  fit  due 
to  the  approximation  of  the  function  /(x)  by 
a  linear  fit  in  the  1  by  if  neighborhood  and 
the  other  error  is  the  error  introduced  due  to 
the  additive  noise  in  the  input.  Let  G[x)  be 
the  gradient  estimate  obtained  when  the  least 
squares  fit  is  performed  for  the  window  of  ideal 

data  /(t),  i  =  X  -  (if  -  l)/2 . x  -I-  (if  -  l)/2. 

Clearly,  G(0),  the  gradient  estimate  when  the 
neighborhood  overlaps  the  entire  ramp,  is  equal 
to  the  true  slope  G.  Also,  G(x),  |x|  >  if  is  e- 
qual  to  aero.  In  addition,  one  can  note  that 
G(x)  is  a  symmetric  function  since  /(x)  is  sym¬ 
metric.  When  the  discrete  samples  are  corrupt¬ 
ed  with  additive  i.i.d  Gaussian  noise  with  ze¬ 
ro  mean  and  variance  <r^,  then  the  estimates 
for  the  gradient  values,  G(x),  are  normal  ran¬ 
dom  variables  with  true  mean  G(x)  and  variance 
<7*/ where  the  sum  is  taken  over  values  of 
i  =  —(if  -  l)/2, ...,(if  —  l)/2.  Neighboring 
gradient  estimates,  G(x)  and  G{x  -|-  j),  are  de¬ 
pendent  random  variables  because  of  the  overlap 
in  the  neighborhoods  used  during  the  estimation 
procedure. 

We  show  in  the  appendix  that  if  we  viewed  the 
sequence  of  2K  —  1  random  variables  G(x),  x  = 
— (if— 1),  ...,if— lasa  random  vector  G  then  G 
is  distributed  as  a  multivariate  Gaussian  random 
variable  with  mean  vector  G(x)  and  covariance 
matrix  ASA',  where  the  matrix  A  is  obtained 
from  fitting  kernel  coefficients  as  described  in  the 
appendix  and  S  is  the  covariance  matrix  of  the 
additive  noise  vector  which  is  assumed  to  be 
The  matrix  A  captures  the  dependence  between 
the  adjacent  gradient  estimates. 

In  order  to  compute  the  error  in  the  edge  pix¬ 
el  position,  we  assume  that  the  pixel  with  the 
maximum  gradient  magnitude  along  the  gradi¬ 
ent  direction  is  labelled  as  an  edge,  while  all  the 
other  pixels  are  labelled  as  non-edge  pixels.  That 


is,  the  edge  pixel’s  index  is  ep  when: 

G(ep)  >  G(x) 

Vx>-(K-l),x<(K-l),xjiep 


Hence  the  probability  that  the  location  t  is  la¬ 
belled  as  edge  is  given  by  a  multivariate  integral 
with  appropriate  limits  specified  by  the  gradient 
threshold  used.  That  is,  the  probability  is  given 
by  the  expression: 


Prob{ep  =  »)  = 


(3) 


r=T  /■>=•  ■■■/ 

i#* 

+  /!=..  •••  /  ^(G(x),AEA')dx,dxj 

Jxi=—oo  J  ^  '  J 


where  $  is  the  multivariate  normal  distribution 
function.  $  has  two  sums  in  the  integral  because 
the  threshold  T  is  actually  on  the  absolute  value 
of  the  gradient.  The  mean  error  in  the  edge  pixel 
location  is  then  given  by: 

K-l 

p  =  53  *  Proh{ep  =  i)  (4) 

»=o 

4  Boundary  Model  incor¬ 
porating  Dependencies  in 
Estimates 


In  the  previous  section  we  addressed  how  errors 
in  grayvalues  propagate  to  errors  in  pixel  loca¬ 
tions  at  the  output  of  the  edge  operator.  An 
alternating  renewal  process  with  gaps  and  edge 
segments  ([!])  is  used  to  describe  the  breakage  of 
a  true  model  segment  into  short  edge  segments 
and  gaps.  Under  the  assumption  that  the  gra¬ 
dient  across  the  edge  is  constant  along  the  true 
model  boundary  and  ignoring  the  dependence  be¬ 
tween  gradient  estimates  from  local  neighbours, 
it  was  seen  in  [1]  that  the  edge  segment  length- 
s  and  the  gap  lengths  can  be  approximated  as 
exponential  distributions.  In  this  ser  ion  we  il¬ 
lustrate  how  the  boundary  model  given  in  [1]  can 
be  extended  to  include  the  dependencies  due  to 
correlation  of  gradient  estimates. 

We  assume  the  ideal  model  for  the  intensity 
profile  across  the  boundary  to  be  a  ramp  edge 
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with  constant  gradient  as  one  walks  along  the 
model  line.  In  addition  we  assume  that  the  sam¬ 
ples  are  corrupted  with  i.i.d.  additive  Gaussian 
noise.  It  is  shown  in  the  appendix  how  one  can  re¬ 
late  the  gradient  estimate  obtained  at  a  particu¬ 
lar  location,  (r,  c),  to  the  gradient  estimate  at  a  n- 
earby  location,  (r-|-Jfe,  c-|-j).  Using  these  relation¬ 
ships  one  can  derive  the  expression  for  the  prob¬ 
ability  that  the  gradient  estimate  G(r  +  k,c  +  j) 
is  greater  than  the  threshold  T  given  that  the 
gradient  estimate  G(r,  c)  >  T.  For  simplicity,  we 
can  assume  that  the  ramp  edge  is  oriented  across 
the  column  direction.  In  this  situation  we  are 
interested  in  modelling  the  relationship  between 
the  gradient  estimates  at  successive  rows.  This 
scenario  is  equivalent  to  the  examination  of  the 
gradient  estimates  in  neighboring  pixels,  as  one 
walks  along  the  true  model  line.  We  visualize  the 
sequence  of  edge  labels  (I’s  and  O’s)  as  we  walk 
along  the  model  line  as  a  series  of  binary  random 
variables.  It  is  easily  seen  that  if  we  are  dealing 
with  independent  Gaussian  noise  at  each  pixel, 
the  gradient  estimate  G(r,  c)  is  dependent  on  the 
previous  (G(r  —  k,c),k  =  estimates. 

In  this  sense,  the  binary  edge  sequence  forms  a 
binary  Kth  order  Markov  chain.  The  Markov 
chain  can  be  specified  by  the  conditional  proba¬ 
bilities:  Prob{Xr  =  llA”,-*  =  1)  ,k  = 
and  Prob{Xr  =  llJf,-!  =  l,...,Xr-i  =  l),j  = 
2 . K.  These  probabilities  can  be  easily  de¬ 

rived  from  the  joint  distribution  for  the  G()’8 
by  computing  the  appropriate  multivariate  in¬ 
tegral.  For  example:  Prob{Xr  =  l|.yr-*  =  1) 
is  equal  to  Pro6(G(r,  c)  >  T|G(r  —  k,c)  >  T) 
and  is  given  by:Pr<^(G(r,  c)  >  T,  G(r  —  k,c)  > 
T)/ Prob{G{r  —  k,c)  >  T).  The  numerator  in  the 
above  expression  is  obtained  by  integrating  the 
joint  distribution  of  G{r,c)  and  G(r  —  A!,c)  with 
limits  of  integration  from  T  to  oo.  The  denomi¬ 
nator  is  the  integral  of  the  distribution  function 
for  G(r  —  k,c). 

5  Protocol  for  image  gener¬ 
ation  (for  edge  pixel  accu¬ 
racy)  and  evaluation 

Synthetic  images  of  size  51  rows  by  51  columns 
were  generated  with  step  edges  at  various  orien¬ 
tations  passing  through  the  center  pixel  (i2,  G)  =: 


(26,26)  in  the  image.  The  gray  value,  /(r,c), 
at  a  particular  pixel,  (r,  c),  in  the  synthetic  im¬ 
age  was  obtained  by  using  the  function  where 
p=  (r  —  Jt)co3(0)  -h  (c  —  C)sin(ff). 

■^(**1^)  ~  ■fmini  p  <  0  (5) 

=  ^max  1  otherwise. 

/mm  and  /mas  are  the  gray  values  in  the  left  and 
right  of  the  step  edge.  The  variables  /Z  and  G 
designate  a  point  in  the  image  on  which  the  step 
edge  boundary  lies.  In  our  experiments  we  set 
/mtn  to  be  100  and  /mas  fo  200.  We  used 
orientation  (0)  values  of  0,  15,  . . .,  175  degrees. 
To  generate  ramp  edges,  we  averaged  images  con¬ 
taining  the  step  edges  with  a  kernel  of  size  4by4  so 
that  the  resulting  ramps  have  5  pixels  width.  To 
these  ramp  edge  images  we  added  additive  Gaus¬ 
sian  noise  to  obtain  images  with  various  signal  to 
noise  ratios.  We  define  signal  to  noise  ratio  as: 

SW/Z  =  20log  (6) 

where  a,  is  the  standard  deviation  of  the  gray 
values  in  the  input  image  and  it„  is  the  noise  s- 
tandard  deviation.  We  used  SNR  values  of  0,  5, 
10,  20  dB.  They  correspond  to  er,/<rn  values  of 
1,  1.78,  3.162,  and  10  respectively.  Groundtruth 
edge  images  were  generated  by  using  the  follow¬ 
ing  function  where  p  =  (r  —  R)co3(B)  (c  — 
C)sin{9). 


h{r, 

c)  = 

0 

p  <  -0.5 

= 

1 

otherwise. 

h{r, 

c)  = 

0 

p  <  0.5 

= 

1 

otherwise. 

c)  = 

/i( 

r,c)  exor  /j( 

The  operators  employed  included  the  gradient 
based  (Gradient  computed  using  the  slope  facet 
model)  operator  and  the  morphological  blur- 
minimum  operator  discussed  in  [3].  In  the  Blur- 
minimum  morphological  edge  detector  a  pixel  is 
assigned  an  edge  label  if  the  edge  strength  com¬ 
puted  is  above  a  given  threshold  T.  The  edge 
strength  /<  is  given  by  the  equation: 

/,  =  mm{/i—  erosion{Ii,di3k{r)),  (8) 
dUation{Ii,di3k{r))  —  /i}. 

where  Ii  is  the  input  image  and  r  is  the  radius  of 
the  disk  that  is  used  as  the  structuring  element 
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in  the  morphological  erosion/dilation  operations. 
We  used  5  by  5  neighborhoods  for  the  edge  oper¬ 
ator  and  the  blur-minimum  operator.  The  edge 
accuracy  evaluation  proceeded  as  follows.  The 
edge  pixel  location  error  E  is  defined  as  the  dis¬ 
tance  along  the  gradient  direction  from  the  true 
edge  pixel  to  the  nearest  labelled  edge  pixel  (if 
one  exists,  in  the  edge  detector  output).  A  given 
ground  truth  edge  pixel  is  assumed  to  be  miss¬ 
ing  in  the  detector  output  if  if  there  are  no  edge 
pixels  in  the  detector  output  within  an  interval 
centered  on  the  ground  truth  edge  pixel.  The 
interval  is  oriented  along  the  gradient  direction 
and  the  number  of  pixels  in  the  interval  is  equal 
to  the  edge  operator  width.  We  will  refer  to  this 
interval,  as  the  “valid  zone”  for  each  pixel. 

In  addition  to  the  computation  of  edge  pixel 
location  error  as  given  above,  we  also  compute 
the  following  statistics  from  the  output  image. 
We  visualize  the  edge  and  non-edge  labellings  en¬ 
countered  as  one  walks  along  the  valid  zone  as  a 
sequence  of  alternating  0  and  1  runs.  We  com¬ 
pute  the  mean  and  variances  for  the  lengths  of 
the  gaps  and  the  edge  segments.  In  the  ideal  case 
when  there  is  no  error  the  edge  segment  lengths 
will  have  mean  value  of  1  and  a  variance  of  ze¬ 
ro,  whereas  the  gap  segment  lengths  will  have  a 
mean  value  equal  to  the  [W/2J,  where  W  is  the 
window  operator  neighborhood  size.  At  low  lev¬ 
els  of  edge  gradient  threshold  the  edge  detector 
responses  are  thick  regions  and  the  edge  segment 
length  values  may  vary  from  1  to  W.  The  seg¬ 
ment  length  and  gap  length  statistics  capture  this 
aspect.  Given  the  true  ground  truth  segment,  the 
edge  segment  length  and  gap  length  statistics  and 
a  value  for  the  probability  of  misdetection  of  the 
edge  operator,  we  can  generate  a  realization  of 
the  edge  detector  response  by  following  the  pro¬ 
cedure  outlined  in  the  appendix. 

6  Plots  and  Discussion 

The  results  obtained  from  the  experiments  are 
given  in  figures  1  through  6.  The  curves  were 
obtained  by  taking  the  running  mean  of  adja¬ 
cent  samples.  The  window  size  for  the  running 
mean  operation  was  5.  The  results  shown  in  the 
plots  are  the  results  obtained  after  10  replica¬ 
tions.  Figures  1  and  4  illustrate  how  the  mean 
length  of  the  run  of  edge  pixels  varies  with  edge 


strength  threshold  for  the  morphological  opera¬ 
tor  and  the  gradient  based  operator.  It  is  clear 
from  the  plots  that  as  the  edge  strength  thresh¬ 
old  is  increased  the  run  length  drops  to  a  value  of 
1.  When  the  gradient  threshold  is  high,  we  label 
lesser  number  of  pixels  as  edge  pixels  in  the  out¬ 
put  and  hence  the  runs  encountered  are  of  small 
width.  Another  point  that  the  plots  illustrate  is 
that  as  the  signal  to  noise  ratio  increases  from 
-5  to  -|-20  dB  the  slope  of  the  curve  increases. 
This  effect  is  due  to  the  fact  that  the  noise  has 
the  effect  of  smoothing  on  the  ideal  run-length 
profile.  Ideally,  we  expect  the  run-length  to  be 
a  linear  function  of  the  threshold  (since  the  in¬ 
put  consists  of  linear  ramp  edges).  Figures  2 
and  5  illustrate  how  the  mean  gap  length  varies 
with  the  edge  strength  threshold.  As  expected, 
the  mean  gap  length  monotonically  increases  as 
a  function  of  the  edge  strength  threshold.  In  the 
ideal  case,  we  expect  the  mean  gap  length  to  be 
a  linear  function  of  the  edge  strength  threshold 
and  in  the  presence  of  large  degree  of  noise  this 
ideal  function  is  blurred.  Figures  3  and  6  il¬ 
lustrate  how  the  mean  edge  pixel  positional  error 
varies  with  edge  strength  threshold.  It  is  clear 
that  the  error  drops  to  zero  when  the  signal  to 
noise  ratio  is  high.  When  the  signal  to  noise  ra¬ 
tio  is  0  or  5  dB  it  can  be  seen  that  the  mean 
error  is  as  much  as  0.5  pixels.  A  comparison 
of  the  plots  for  the  morphological  and  gradient 
based  operators  indicate  that  the  gradient  based 
scheme  is  superior  for  signal  to  noise  levels  of  0 
dB  and  higher.  The  gradient  based  scheme  has 
comparable  errors  when  the  signal  to  noise  ratio 
is  -5  dB.  The  conclusion  in  [2]  was  that  the  mor¬ 
phological  operator  had  superior  false  alarm  vs 
misdetect  characteristics.  The  experiments  here 
point  out  that  the  morphological  operator  has 
poorer  localization  performance.  In  a  subsequent 
paper  we  will  attempt  to  compare  the  empirical 
results  obtained  with  theoretical  results  by  utiliz¬ 
ing  the  theoretical  expressions  for  the  mean  pixel 
positioning  error.  The  exact  expression  for  the 
distribution  of  pixel  error  for  the  morphological 
edge  operator  discussed  in  [3]  still  needs  to  be 
worked  out.  We  are  also  in  the  process  of  evalu¬ 
ating  the  noisy  edge  generation  procedure,  that 
utilizes  similar  statistics  as  in  our  experiments, 
to  see  how  closely  it  models  errors  that  occur  in 
real  images. 
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Mean  edge  run  length  vs  Edge  strength  threshold 


(Moiphologteii  OpwBiQf) 


Edge  pixel  localization  error  vs  Edge  strength  threshold 


Figure  1:  Plot  of  Mean  edge  run  length  vs  Edge 
strength  threshold  for  various  signal  to  noise  ra¬ 
tios.  Orientation  of  the  true  edge  was  15  degrees, 
Window  size  5  by  5  for  Morphological  operator 


Figure  3:  Plot  of  Mean  pixel  positional  error 
vs  Edge  strength  threshold  for  various  signal  to 
noise  ratios.  Orientation  of  the  true  edge  was  15 
degrees,  Window  size  5  by  5  for  Morphological 
operator 


Mean  gap  length  vs  Edge  strength  threshold 


(Morphotogieal  optralor) 


Edg*  tlrvngit  thTMhoM 


Mean  edge  run  length  vs  Edge  strength  threshold 

(Gradwn  bawd  oparator) 


Figure  2:  Plot  of  Mean  gap  run  length  vs  Edge 
strength  threshold  for  various  signal  to  noise  ra¬ 
tios.  Orientation  of  the  true  edge  was  15  degrees. 
Window  size  5  by  5  for  Morphological  operator 


Figure  4:  Plot  of  Mean  edge  run  length  vs  Edge 
strength  threshold  for  various  signal  to  noise  ra¬ 
tios.  Orientation  of  the  true  edge  was  15  degrees. 
Window  size  5  by  5  for  Grstdient  operator 
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7  Conclusion 


Mean  gap  length  vs  Edge  strength  threshold 


( OrMlani  b«M<l  opwitor) 


200.0  400.0  600.0  600.0  1000.0 

Qridtoni  twMhohf 

Figuie  5:  Plot  of  Mean  gap  lun  length  vs  Edge 
strength  threshold  for  various  signal  to  noise  rar 
tios.  Orientation  of  the  true  edge  was  15  degrees, 
Window  size  5  by  5  for  Gradient  operator 


Edge  pixel  localization  error  vs  Edge  strength  threshold 


( QradNni  b«t4d  gp«f«lort 


200.0  400.0  600.0  8000  1000.0 


Figure  6:  Plot  of  Mean  pixel  positional  error 
vs  Edge  strength  threshold  for  various  signal  to 
noise  ratios.  Orientation  of  the  true  edge  was  15 
degrees,  Window  size  5  by  5  for  Gradient  opera¬ 
tor 


In  this  paper  we  provided  extensions  of  results 
provided  in  [1]  and  illustrated  how  one  can  re¬ 
lax  the  independence  assumptions  to  derive  ran¬ 
dom  perturbation  models  that  include  the  effects 
of  correlation  between  neighboring  gradient  esti¬ 
mates.  Under  the  assumption  that  the  ideal  data 
is  corrupted  with  additive,  independent  additive 
gaussian  noise,  we  derived  expressions  that  de¬ 
scribe  the  relationship  between  an  edge  gradient 
estimate  at  a  given  location  and  an  edge  gradient 
estimate  for  a  neighboring  pixel.  We  illustrated 
how  the  model  line  breakage  process  can  be  mod¬ 
eled  as  a  Markov  process  whose  parameters  are 
functions  of  the  true  edge  gradient,  edge  opera¬ 
tor’s  neighborhood  size,  and  the  noise  variance. 
Furthermore,  we  derived  theoretical  expressions 
for  the  mean  positioned  error  as  a  function  of  the 
neighborhood  operator  window  size,  noise  vari¬ 
ance,  the  width  of  the  true  ramp  edge,  and  the 
true  edge  gradient.  We  also  outlined  an  exper¬ 
imental  protocol  used  for  evaluating  edge  pixel 
positioning  errors.  The  results  from  the  experi¬ 
ments  illustrate  that  gradient  based  edge  schemes 
are  superior  (when  edge  localization  is  of  inter¬ 
est)  to  the  morphological  scheme  discussed  in  [3]. 
We  also  provided  an  algorithm  to  generate  syn¬ 
thetic  noisy  edge  images  that  utilize  the  statistics 
used  in  the  experiments. 


References 

[1]  V.Ramesh,  R.M.Haralick,  "Random  Pertur¬ 
bation  Models  and  Performance  Characteri¬ 
zation  in  Computer  Vision, "  Proceedings  of 
the  CVPR  conference,  June  1992. 

[2]  V.Ramesh,  R.M.Haralick,  “Performance  e- 
valuation  of  edge  operators,"  Special  Ses¬ 
sion  on  Performance  Evaluation  of  Mod¬ 
ern  Edge  operators,  Orlando  Machine  Vision 
and  Robotics  Conference,  20-24  April  1992. 

[3]  J.S.Lee,  R.M.Haralick,  and  L.G.Shapiro, 
"Morphologic  Edge  Detection,”IEEE  Jour¬ 
nal  of  Robotics  and  Automation,  Vol.  RA-3, 
No.2,  April  1987. 


1077 


8  Appendix  1 


We  aaaume  that  the  gray  values  in  each  neighbor¬ 
hood  in  the  input  image  can  be  approximated  by 
computing  a  planar  fit.  We  assume  further  that 
each  pixel  in  the  input  image  is  corrupted  with 
additive  Gaussian  noise,  t}{R,  C),  with  sero  mean 
and  variance  Let  aji^c,  0R,Ct  7a, c  de¬ 

note  the  estimates  for  the  coefficients  best  fitting 
plane  that  approximates  the  N  hy  M  neighbor¬ 
hood  surrounding  the  center  pixel  specified  by 
row  and  column  coordinates  (R,  C).  In  this  ap¬ 
pendix  we  derive  expressions  describing  the  re¬ 
lationship  between  the  estimates  otR+i^c+k 
&R,c  for  a/i+i.c+ib  and  qr^c-  Let  a,  /3,  and  7  be 
the  true  plane  coefficients.  It  can  be  easily  shown 
that  Ar^c  and  are  equal  to: 


-  _  _  ,  ’■'/(•Ri  ,n\ 

=  “  +  — ^ ^ ; — (®) 


Pr,c  =  /3  + 


EN  ipW  j 

T=-s  2^c=-m  ^ 

Ht=-N  C»j(-Ri  C), 


<10) 


(11) 


Now,  we  have:  aR^k,c-\-j  =  «+  : 

llr=-N  +  r  +  c  +  C  -I-  i) 

Is 

2^r=-JV  Z^es-M^ 

The  difference,  aR^.i,,c  —  «A,c  is  given  by: 

C+M  R-N+k-l 

X;  {R- R'HR',c')  {12) 

C’=C-M  R'=R-S 
C+M  R+N+k 

+  E  E 

C'=C-M  R'=R+N+1 
C+M  R+N 

-  E  E 

C>=C-M  R'zzR-S+k 


If  we  visualize  the  two  overlapping  neighbor¬ 
hoods,  the  first  term  in  the  above  equation  cor¬ 
responds  to  the  difference  contributed  by  the 
nonoverlapping  portion  of  the  left  mask,  the  sec¬ 
ond  term  corresponds  to  the  contribution  from 
the  nonoverlapping  portion  of  the  right  mask  and 
the  third  term  corresponds  to  the  contribution 
from  the  overlapped  portion  of  the  two  masks. 

Similarly  the  difference  between  the 
$R+k,c  ~  PR,Ct  can  be  shown  to  be  the  ratio  of: 


R+N+K  C+M 

Yl  E  (13) 

R'=R+N+1  C'=C-M 


R-N+K+1  C+M 

£  E  «»(^*^') 

R'=R-N  C'=C-M 


The  above  results  can  be  summarized  in  a  com¬ 
pact  fashion  using  matrix  notation.  Let  Va  de¬ 
note  a  vector  consisting  of  all  the  estimated  d’s. 
Let  A  denote  the  matrix  whose  rows  contain  the 
values  of  the  kernel  used  to  estimate  d.  Let  Eq 
denote  the  vector  consisting  of  the  additive  Gaus¬ 
sian  random  variables  Tf{R,  C)’s.  It  can  easily  be 
seen  that  =  MEq  -|-al  For  example,  the  ma¬ 
trix  A  for  I  J  is  given  by  the  transpose 

Vqa+i.c  J 

of  the  following  matrix: 

/  -N 
-N  +  1 
-1 

0 
0 
1 

N 

\  0 

where  5=1/  12^-m  “*1  ll**  vector 

of  7;’s  are  given  by: 


0  \ 
-N 

-N+1 

1 

-1 

0 

N-1 

N 


(14) 


Vr-n,c 
Vr+n+i,c 

If  the  T)'s  are  Gaussian  random  variables  with  ze¬ 
ro  mean  and  covariance  matrix  E  then  we  can 
see  that  the  values  in  the  vector  is  distribut¬ 
ed  as  a  multivariate  Gaussian  random  variable 
with  mean  al  and  covariance  matrix  AEA'. 


9  Appendix  2  -  Procedure 
for  generation  of  Noisy 
Edge  images 

The  procedure  for  generation  of  noisy  edge  re¬ 
sponse  images  is  explained  below.  First  a  ground 
truth  edge  image  is  created.  This  image  is  now 
perturbed  to  obtain  the  noisy  edge  image.  Let, 
G(r,c)  denote  the  pixel  valnes  in  the  ground 
truth  ideal  edge  image, 

WSIZE  denote  the  edge  operator  width 
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PMIS  denote  the  probability  of  miadetection 
Gap-mean  denote  the  mean  gap  length 
Gap-variance  denote  the  gap  length  variance 
Edge-rnn-mean  denote  the  mean  edge  run  length 
Edge-run-variance  denote  the  edge  run  length  vari¬ 
ance 
Then: 

for  each  pixel  (R.C)  in  groundtruth  image 
if  (  G(R,C)  »  EDGEUBEL  ) 
i 

!*  Ground  truth  pixel  ia  an 
edge  pixel  e/ 

X  >  thiiforaO:  /*  Generate  a  uniform 

random  number  beteeen 
0  and  1  */ 

if  (  X  <  Nisdetect-Probability  } 

< 

/«  Edge  pixel  ia  to  be  perturbed 
according  to  the  edge  rtm  and 
gap  statistics  «/ 

/«  Currently,  there  are  two  nodes  of 
noisy  edge  generation.  Mode  1 
corresponds  to  the  generation  of 
gaps  and  edge  runs  in  the 
direction  of  edge  gradient  when 
the  user  provides  the  mean  and 
variance  values  for  the  edge  run 
lengths  and  gap  lengths. 

*/ 

if  (  NODEl  ) 

i 

/*  Generate  a  gap  that  is  of  size 
less  the  width  of  the  edge 
operator.  The  gap  has  to  satisfy 
this  constraint  because  the  edge 
pixel  is  deemed  to  lie  within  the 
edge  zone  line  (oriented  along 
the  edge  gradient  direction  and 
is  of  width  VSIZE  pixels.)  Gsample 
is  a  procedure  that  generates  a 
sample  from  a  Gaussian  distribution. 

*/ 

AXCO]  >  Gsaaple(Gap-nean, 

Gap-variance) ; 
while  (  X  >  VSIZE  ) 

AXC03  *  Gsample  (Gap-mean. 

Gap-variance) ; 

AT[03  •>  Gsaaple(Edge-ran-nean, 

Edge-r\m-variance) ; 
CURPOS  -  AX[0]  ♦  ATCO]; 


if  (  CDRPOS  >  VSIZE  ) 

/*  Output  data  consists  of  a  gap 

followed  by  a  truncated  edge  run;  */ 

> 

else 

{ 

i  -  1; 

while  (  CURPOS  <  VSIZE  ) 

AXCi]  -  Gsample  (Gap-mean, 

Gap-variance) ; 

AT[i]  >  Gsaaple(Edge-run-aean, 

Edge-run-variance) : 
CURPOS  +-  (AX[i]  +  AYCi]); 
i++; 

> 

/*  Output  data  consists  of 
alternating  gaps  followed 
by  runs  */ 

} 

} 

/*  MODES  corresponds  to  the 

generation  of  an  edge  pixel  run 
from  the  specification  of  the  edge 
positional  errors. 

•/ 

if  (  MODES) 

{ 

X  >  Gsample  (Edge -r\m-ae an. 

Edge-run-variance) ; 

T  »  Gsample (pixel-error-mean, 

pixel-error-variance) ; 

/*  Output  data  now  consists  of  a 
run  of  length  X  centered  around 
(R.C)  ♦  (r',c').  The  values  r'.c’ 
are  the  coordinates  of  the  pixel 
that  is  of  distance  T  from  (R.C) 
along  the  line  (oriented  along 
the  gradient  direction) . 

•/ 

} 

} 

else  {  /•  Edge  pixel  was  missed  */  } 

> 

}  /*  End  of  "for  each  pixel 

in  ground  truth  image"  */ 
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Shape  from  Shadows  under  Error 

David  Yang  and  John  R.  Render* 


0  Abstract 

Render  [Render  and  Smith,  1987]  presented  a 
method  for  obtaining  shape  information  from 
shadows  cast  at  multiple  angles  of  illumination. 
While  the  method  is  correct  for  perfect  data,  it 
may  not  converge  to  a  solution  as  a  result  of 
noise  in  the  data,  digitization  of  the  image,  or 
thresholding  errors  when  determining  the  shad¬ 
ows.  This  work  extends  this  shape  from  shadows 
algorithm  to  work  in  the  presence  of  these  fac¬ 
tors.  The  constraints  in  the  original  algorithm 
use  trigonometric  rules  to  relate  the  heights  of 
discrete  points  (i.e.,  pixels)  based  on  whether  or 
not  the  points  are  in  shadow.  We  supplement 
these  constraints  with  rules  on  how  shadows  in 
one  image  constrain  those  in  other  images  of  the 
same  scene  and  with  the  same  viewpoint,  but 
with  different  illumination  angles.  A  heuristic 
is  devised  to  filter  the  data  to  satisfy  the  con¬ 
straints  and  produce  a  solution,  and  some  results 
are  presented  for  errorful  synthetic  and  real  im¬ 
ages. 

1  Introduction 

While  shadows  hide  information  in  parts  of  an 
image  (e.g.,  see  [Beckmann,  1965]),  researchers 
have  also  taken  advantage  of  the  relationship  be¬ 
tween  a  shadow  and  the  region  of  the  image  cast¬ 
ing  the  shadow.  The  various  aspects  of  this  re¬ 
lationship  have  been  used  for  boundary  segmen¬ 
tation  [Hambrick  and  Loew,  1985],  spline  esti¬ 
mation  of  surface  shape  [Hatzitheodorou,  1990], 
structure  hypothesis  verification  |rvin  and  McR- 
eown,  1989],  bounding  of  a  discrete  surface  pro¬ 
file  [Render  and  Smith,  1987],  surface  rough¬ 
ness  measurements  [Maerz  ef  ai,  1990],  struc¬ 
ture  height  estimation  [Paine,  1981],  scene  regis¬ 
tration  and  stereo  matching  [Perlant  and  McRe- 
own,  1990],  surface  gradient  measurement  under 
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limited  conditions  [Shafer,  1985],  object  bound¬ 
ary  determination  [Thompson  et  ai,  1987],  line 
drawing  interpretation  [Waltz,  1975],  and  micro¬ 
scopic  object  size  measurement  [Williams  and 
Wyckoff,  1944]. 

Unlike  intensity- based  algorithms  like  shape 
from  shading,  shadow-based  algorithms  do  not 
require  knowledge  of  the  surface  reflectance 
properties  of  the  surface  of  interest.  Implicit  in 
this  is  the  ability  to  handle  surfaces  with  non- 
uniform  reflectance  properties,  though  multi¬ 
colored  surfaces  and  highly  specular  surfaces 
may  require  adjustments  in  the  thresholding  pro¬ 
cess.  Below,  we  give  more  details  on  how  shad¬ 
ows  have  been  used  to  determine  object  heights 
and  surface  profiles. 

1.1  Shape  from  shadows 

Given  the  direction  of  the  sun  and  a  reference 
surface  a  known  distance  from  the  camera,  the 
height  of  structures  on  this  surface  can  be  de¬ 
termined  by  basic  trigonometry  [Paine,  1981]. 
Specifically,  the  height  of  a  structure  casting  a 
shadow  equals  the  length  of  the  shadow  multi¬ 
plied  by  cotan{9i),  where  9i  is  the  incident  angle 
of  illumination. 

Render  [Render  and  Smith,  1987]  use  multi¬ 
ple  lighting  positions  and  trigonometric  rules  to 
generate  a  profile  of  a  surface  based  on  self¬ 
shadowing.  Their  algorithm  generalizes  the 
above  rule  to  handle  the  possibility  of  self¬ 
shadowing  and  enhances  the  algorithm  with  3 
additional  constraints.  The  extra  constraints 
permit  the  estimation  of  heights  over  the  en¬ 
tire  surface,  rather  than  just  the  few  peak  points 
which  can  be  estimated  with  the  single  rule.  The 
only  restrictions  placed  on  the  surface  is  that  it 
can  be  described  by  a  function  z{x,  y),  and  is  on 
a  reference  surface.  The  first  requirement  sup¬ 
ports  a  simple  model  of  the  surface,  while  the 
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latter  requirement  allows  the  accurate  determi¬ 
nation  of  the  incident  angles  of  illumination. 

Since  this  paper  will  use  much  of  the  terminology 
of  the  previous  work,  we  will  summarize  it  here. 
As  can  be  seen  in  one  possible  set-up  in  fig.  1,  the 
surface  of  interest  is  placed  on  a  reference  sur¬ 
face,  and  lighting  is  placed  at  several  angles  in 
the  xz  plane  and  the  yz  plane.  Ideally,  the  light¬ 
ing  should  be  collimated  and  far  enough  from  the 
surface  to  approximate  lighting  from  an  infinite 
distance.  Intuitively,  the  lighting  in  the  xz  plane 
is  similar  to  discrete  positions  of  the  sun  at  the 
equator;  thus,  east  refers  to  the  positive  direction 
along  the  x-axis,  and  north,  south,  and  west  refer 
to  their  respective  directions.  This  also  inspired 
the  term  suntrace,  which  is  the  bit  matrix  (or 
vector  if  only  a  single  row  or  column  of  the  image 
is  being  reconstructed)  for  a  given  image  with  a 
specified  illuminant  position  indicating  whether 
each  pixel  is  shadowed  or  unshadowed.  Equiv¬ 
alently,  a  suntrace  is  the  overhead  view  of  the 
thresholded  image.  Fig.  2b  shows  the  suntrace 
corresponding  to  the  situation  in  fig.  2a. 

Fig.  2a  shows  a  sample  one-dimensional  curve 
with  illumination  from  the  west.  Since  X2  is  in 
shadow,  and  xj  is  the  closest  unshadowed  point 
to  the  west,  xi  is  said  to  shadow  X2.  Further¬ 
more,  if  X2  is  not  shadowed  for  any  higher  west¬ 
ern  iUumination,  xi  is  called  the  last  shadower  of 
X2.  From  here  on,  the  last  shadower  will  just  be 
called  the  shadower.  Since  xi  is  also  the  closest 
unshadowed  point  to  the  west  of  the  unshadowed 
X3,  if  X3  is  shadowed  for  all  lower  western  illumi¬ 
nations,  xi  is  called  the  failing  shadower  of  X3. 
For  the  given  shadow  region  between  x\  and  X3, 
X2  is  called  the  last  shadowed  point,  while  X3  is 
caJled  the  first  unshadowed  point,  xq  marks  the 
western  boundary  of  the  image,  while  X4  marks 
the  eastern  boundary. 

The  4  constraints  can  be  classified  as  upper  or 
lower,  depending  on  whether  they  relate  to  up¬ 
per  or  lower  bounds  on  the  height.  A  forward- 
upper  constraint  uses  the  bound  on  a  shadower 
to  constrain  the  bound  on  a  shadowed  point. 
A  forward-lower  constraint  does  likewise  for  a 
failing  shadower  and  a  point  it  fails  to  shadow. 
Backward  constraints  work  in  the  opposite  di¬ 
rection.  The  constraints  are  as  follows: 


forward  —  upper 

u(x)  <  u(/s(x))  -  |/s(x)  —  x|  +  sls{x) 
forward  —  lower 

/(at)  >  /(/«(*))  -  \fs{x)  -  x|  ♦  sfs{x) 
backward  —  upper 

u{fs{x))  <  u(x)  -I-  |/s(x)  -  xj  +  sfs(x) 
backward  -  lower 
l{ls{x))  >  /(x)  -f  |/s(x)  —  x|  *  sls{x) 

where 

u(x)  =  current  upper  bound  on  X 
/(x)  =  current  lower  bound  on  x 
ls(x)  =  shadower  of  X 
fs(x)  =  failing  shadower  of  X 
sls{x)  =  slope  of  illumination  when  x  is 
last  shadowed 

s fs{x)  =  slope  of  illumination  when  x  is 
first  unshadowed 

Fig.  3  shows  synthesized  suntrace  data  for  a  sim¬ 
ple  curve  and  the  resulting  bounds  on  the  shape. 

Hatzitheodorou  [Hatzitheodorou,  1990]  worked 
on  the  related  problem  of  finding  the  continu¬ 
ous  function  /  (x,  y)  which  satisfies  the  forward 
upper  constraints  given  by  the  shadow  data  and 
minimizes  the  possible  error.  His  approach  was 
to  approximate  the  actual  surface  in  a  shadow 
region  with  a  spline  function.  Noise  is  handled 
by  considering  a  further  constraint  on  the  shape 
of  the  solution-  e.g.,  a  smoothness  constraint. 

1.2  Outline  for  rest  of  paper 

While  the  shape  from  shadows  algorithm  always 
converges  and  produces  correct  bounds  on  the 
surface  shape  when  the  data  is  perfect,  it  may 
not  converge  when  there  are  errors  in  the  sun- 
traces.  As  will  be  shown,  even  an  error  of  a  sin¬ 
gle  pixel  is  enough  to  prevent  convergence  in  the 
unmodified  algorithm.  The  next  section  explains 
the  problems  faced  by  the  algorithm  when  deal¬ 
ing  with  errors  in  the  data.  Section  3  describes 
the  heuristic.  Section  4  presents  results  using 
both  synthetic  and  real  imagery.  The  final  sec¬ 
tion  summarizes  and  suggests  future  work. 

2  Non-convergence 

The  relaxation  algorithm  described  in  [Kender 
and  Smith,  1987]  propagates  constraints  until 
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no  further  changes  in  any  bounds  are  possible. 
Inconsistencies  between  2  or  more  images,  how¬ 
ever,  may  result  in  a  cycle  of  constraints  with 
a  point  indirectly  constraining  its  own  upper 
bound  to  be  lower,  or  its  lower  bound  to  be 
higher.  To  be  concise,  from  here  on,  only  cy¬ 
cles  which  indicate  such  a  multiple- point  incon¬ 
sistency  will  be  called  cycles.  Below,  we  give  an 
example  cycle  and  then  cover  two  special  cases. 

2.1  Example  cycle 

Fig.  4c  shows  a  simple  cycle  with  three  points 
extracted  from  the  suntrace  data  in  Fig.  4a-b. 
For  purposes  of  the  example,  assume  the  up¬ 
per  bound  on  xi  =3  is  initially  0,  and  the 
upper  bound  on  all  other  points  is  -l-oo.  The 
first  constraint  in  the  cycle  shows  xi  =  3  shad¬ 
owing  X2  =  25  when  6i  =  arccot(  1/3).  Since 
x\  has  an  upper  bound  of  0,  the  forward  con¬ 
straint  yields  a  new  upper  bound  on  x^  of  — 7|. 
The  second  constraint  is  another  forward  con¬ 
straint,  this  time  with  the  light  on  the  eastern 
horizon,  producing  an  upper  bound  on  xz  =  13 
of  -7g.  The  last  constraint  is  backward-upper 
since  X4  —  x\  =  13  is  the  failing  shadower  for  13 
when  di  =  arcco<(2/3)  and  the  illumination  is 
from  the  west.  The  result  is  a  new  upper  bound 
on  X4  =  xi  of  This  will  prevent  termination 
of  the  algorithm  unless  adjustments  to  the  data 
are  made.  We  call  —  |  the  error  in  the  cycle. 

2.2  Mutual  shadowers 

If  we  are  trying  to  reconstruct  a  curve,  two  points 
cannot  shadow  each  other  (for  different  light  po¬ 
sitions)  unless  they  each  shadow  the  other  when 
the  light  is  on  one  of  the  opposing  horizons. 
Otherwise,  the  forward-upper  or  backward-lower 
constraints  would  form  a  cycle.  Intuitively,  such 
an  error  would  imply  that  one  of  the  points  was 
higher  than  the  other  while  the  other  was  at 
least  as  high.  Equivalently,  this  constraint  im¬ 
plies  mutual  shadowing  is  valid  only  when  the 
two  points  are  of  the  same  height.  This  is  re¬ 
flected  in  the  equations  for  the  forward-upper 
and  backward-lower  constraints. 

In  fig.  3,  the  mutual  shadowing  of  points  xi  =  0 
and  X2  =  3  can  be  seen  in  the  suntraces  for 
0,  =  7r/2  for  western  and  eastern  illuminations. 
Since  both  suntraces  are  for  illumination  at  the 
horizon,  there  is  no  error.  The  reconstruction 


indeed  estimates  both  points  to  be  of  the  same 
height-  note  that  upper  and  lower  bounds  con¬ 
verge  at  both  points.  Thus,  when  valid,  mutual 
shadowing  is  useful  in  tightening  the  bounds. 

2.3  Single-point  inconsistencies 

If  xi  is  unshadowed  with  the  light  on  a  given  side 
of  the  surface-  e.g.,  the  right  side-  then  for  any 
higher  light  position  on  the  same  side,  xi  cannot 
become  shadowed  again.  Fig.  4a  shows  a  case 
with  the  western  suntraces  for  0,  =  arccot(2/3) 
and  0,  =  arccot{l)  at  xi  =  15.  By  summing 
the  effect  of  the  forward-upper  constraint  when 
xj  is  shadowed  with  the  effect  of  the  backward- 
upper  constraint  when  xi  is  unshadowed,  it  can 
be  seen  that  the  upper  bound  of  xj  and  its  shad¬ 
ower  (x2  =  14)  will  be  lowered  indefinitely.  Note 
that  in  a  different  surface  model,  xi  could  le¬ 
gitimately  become  shadowed  again.  To  see  why, 
consider  standing  up  a  slice  of  swiss  cheese  on 
its  end  and  moving  a  light  source  vertically  up 
past  it  on  one  side.  The  holes  in  the  cheese  will 
result  in  unshadowed  regions  within  the  shadow 
cast  by  the  cheese.  As  the  light  moves  up,  these 
unshadowed  regions  will  become  shadowed. 

3  Adjustment  Heuristic 

The  chosen  adjustment  heuristic  localizes  the 
blame  for  an  error.  It  starts  by  correcting  mutual 
shadowing  and  single  point  inconsistent  shadow¬ 
ing  errors.  A  local  maximum  is  chosen  as  the 
starting  point,  and  both  the  upper  and  lower 
bounds  for  this  point  are  set  to  0.  From  this 
point,  constraint  propagation  as  in  Render  and 
Smith,  1987]  is  performed.  Each  time  a  con¬ 
straint  changes  a  bound  on  a  point,  a  check  for  a 
cycle  in  the  constraints  is  done.  If  a  cycle  is  de¬ 
tected,  then  a  point  in  the  cycle  is  chosen,  and 
the  shadowing  data  for  that  point  is  adjusted 
to  either  fix  the  cycle  or  at  least  alleviate  the 
problem.  The  algorithm  is  explained  in  detail 
for  processing  data  for  a  single  vertical  sli(  e,  and 
considerations  for  handling  the  whole  surface  are 
covered  as  well. 

3.1  Initial  adjustments 

It  is  noted  that  when  the  light  is  incident  from 
a  given  direction,  all  pixels  along  that  bound¬ 
ary  of  the  image  should  be  unshadowed.  If  not, 
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this  may  indicate  that  these  pixels  are  shadowed 
by  an  ofF-screen  object,  which  could  produce  er¬ 
rors  or  overly  loose  bounds  in  the  results.  We 
assume  no  such  shadowing  occurs  and  mark  all 
boundary  pixels  on  a  given  side  of  the  image  as 
unshadowed  for  all  suntraces  with  illumination 
from  that  direction. 

The  next  step  fixes  single-point  inconsistencies. 
Here  again  there  is  a  problem  that  if  an  unshad¬ 
owed  point  becomes  shadowed  again,  it  is  impos¬ 
sible  to  determine  exactly  when  the  point  should 
first  be  unshadowed  in  the  absence  of  further  in¬ 
formation.  Favoring  the  unshadowed  status  will 
tend  to  create  reconstructions  with  spikes.  Fa¬ 
voring  the  shadowed  status  will  tend  to  create 
holes  in  the  reconstruction.  To  avoid  both,  it 
may  be  necessary  to  first  use  a  low- pass  filter 
on  points  involved  in  single-point  inconsistencies. 
Currently,  we  favor  the  unshadowed  status  and 
do  no  filtering. 

For  mutual  shadowing,  it  is  usually  impossible 
to  know  which  pixel  is  wrong.  If  only  one  of  the 
points  is  on  the  image  boundary,  it  can  be  rea¬ 
sonably  assumed  that  the  other  point  is  wrong, 
though  it  is  possible  that  there  should  be  a  shad- 
ower  of  one  of  these  points  in  between  them. 
Otherwise,  our  current  approach  arbitrarily  fa¬ 
vors  the  point  closer  to  the  image  boundary  for 
the  direction  in  which  it  is  unshadowed.  This 
point  is  then  marked  as  unshadowed  for  the  sun- 
traces  in  which  it  was  originally  marked  as  shad¬ 
owed  by  the  other  point.  Note  that  this  does  not 
risk  creation  of  a  single- point  inconsistency  since 
it  marks  a  point  as  unshadowed  in  all  suntraces 
with  the  light  to  one  side  of  the  surface. 

If  lighting  from  both  east-west  and  north-south 
directions  is  used,  mutual  shadowing  can  involve 
many  more  than  two  points  and  constraints,  but 
this  can  become  very  complicated  to  handle  as  a 
special  case.  We  leave  such  cases  to  the  general 
processing  of  multiple-point  inconsistencies. 

3.2  Adjusting  a  shadow  edge 

We  first  note  that  for  the  1-dimensional  case  of  a 
single  vertical  slice,  the  only  points  that  need  to 
be  considered  are  those  points  which  are  shad- 
owers,  last  shadowed  points  or  first  unshadowed 
points  in  at  least  one  suntrace.  As  a  result  of 
the  assumption  that  no  points  are  shadowed  by 


off-screen  objects,  this  includes  all  points  along 
any  image  boundary  in  the  direction  of  one  of 
the  light  positions  used.  The  bounds  for  all 
other  points  can  be  interpolated  from  the  bounds 
for  these  points.  This  can  produce  a  substan¬ 
tial  speedup  for  relatively  smooth  surfaces  (i.e., 
with  few  shadowers)  and  few  illuminant  posi¬ 
tions  (i.e-,  with  few  last  shadowed  or  first  un¬ 
shadowed  points). 

An  adjustment  for  a  forward-upper  constraint 
can  be  made  by  shortening  the  length  of  the 
shadow  (starting  at  the  end  away  from  the  shad- 
ower),  thus  raising  the  upper  bound  of  the  newly 
unshadowed  points  and,  when  constraints  are 
propagated,  of  all  the  other  points  in  the  cy¬ 
cle.  Similarly,  shortening  the  shadow  involved 
in  a  backward-lower  constraint  will  decrease  the 
lower  bound  of  the  failing  shadower.  On  the 
other  hand,  adjustments  for  either  forward-lower 
or  backward-upper  constraints  require  lengthen¬ 
ing  the  shadow. 

The  amount  by  which  to  shorten  or  lengthen  a 
shadow  depends  on  the  total  amount  by  which 
the  cycle  lowers  the  upper  bound  (or  raises  the 
lower  bound)  of  each  point  in  the  cycle,  the  light 
position  for  the  given  constraint,  adjacent  light 
positions  from  the  same  direction,  and  possi¬ 
bly  the  light  position  of  the  following  constraint 
in  the  cycle.  For  example,  consider  a  point  x 
which  is  constrained  by  a  forward-upper  con¬ 
straint  from  the  east,  and  x  constrains  the  next 
point  in  the  cycle  with  a  forward-upper  con¬ 
straint  from  the  north.  Shrinking  the  shadow 
for  the  eastern  suntrace  will  leave  x  shadowed 
at  the  next  lower  light  position.  If  x's  shadower 
before  the  adjustment  is  still  I’s  shadower,  the 
effect  of  the  adjustment  on  the  error  in  the  cy¬ 
cle  is  based  on  the  difference  between  the  slopes 
of  both  light  positions.  On  the  other  hand,  if 
X  constrains  the  next  point  in  the  cycle  with  a 
backward-upper  constraint  to  the  west,  the  effect 
of  shrinking  the  shadow  for  the  eastern  suntrace 
can  easily  be  greater  than  in  the  previous  case. 
Nonzero  slopes  of  the  failing  shadowers  to  the 
west  for  the  points  which  are  newly  marked  as 
unshadowed  will  raise  the  upper  bound  of  the 
next  point  in  the  cycle  even  more. 

Since  each  constraint  is  associated  with  a  partic¬ 
ular  light  position,  any  adjustment  to  the  data 
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for  this  light  position  should  be  made  compatible 
with  data  for  other  light  positions.  If  a  shadow 
is  to  be  shrunk,  the  newly  unshadowed  points 
must  not  be  shadowed  for  any  higher  light  po¬ 
sition  in  the  same  direction  to  avoid  creating  a 
single-point  inconsistency.  If  a  shadow  is  to  be 
e.xpanded.  a  corresponding  check  must  be  done 
with  respect  to  lower  light  positions. 

3.3  Choosing  edge  to  adjust 

To  minimize  the  amount  of  shadow  data  ad¬ 
justed,  points  involved  in  constraints  of  steeper 
slopes  are  preferred  since  small  adjustments  pro¬ 
duce  bigger  changes  in  the  bounds.  It  is  as¬ 
sumed  that  the  position  of  shadower  points  is 
more  likely  to  be  correct  than  the  position  of 
the  far  shadow^  boundary.  Consequently,  adjust¬ 
ments  are  not  made  to  shadower  points  after  the 
mutual  shadowing  errors  are  fixed.  This  implies 
an  adjustment  will  not  create  a  shadower. 

3.4  From  one  dimension  to  two 

Once  both  north-south  and  east-west  illumina¬ 
tions  are  used,  it  is  possible  to  determine  a  true 
2-D  map  of  2{x,y).  The  4  trigonometric  con¬ 
straints  must  now  be  applied  in  the  2  extra  di¬ 
rections  along  the  new  axis.  While  the  extension 
of  the  basic  height  estimation  algorithm  to  2  di¬ 
mensions  is  trivial,  the  interaction  of  these  con¬ 
straints  does  complicate  the  heuristic.  Specifi¬ 
cally,  when  constraints  along  two  axes  are  used, 
cycles  may  have  to  include  points  which  do  not 
belong  to  the  set  of  shadowers,  last  shadowed 
points,  or  first  unshadowed  points.  This  can 
happen  when  consecutive  constraints  in  a  cycle 
operate  along  different  axes.  It  is  still  desirable 
not  to  adjust  (or  create)  shadowers,  so  when  a 
point  in  the  middle  of  the  shadow  is  in  a  cycle, 
it  is  necessary  to  find  the  last  shadowed  point  of 
the  shadow  at  which  to  make  the  adjustment. 

4  Experimental  results 

For  the  real  images,  we  used  a  set-up  similar 
to  that  in  fig.  1,  using  uncoHimated  illumina¬ 
tion.  W'e  have  successfully  used  the  heuristic  on 
both  synthetic  and  real  errorful  data  for  1  dimen¬ 
sion.  and  for  synthetic  errorful  data  for  2  dimen¬ 
sions.  For  the  synthetic  images,  ideal  suntraces 
were  derived  from  a  discrete  curve  or  surface. 


Errors  were  then  introduced  into  the  suntraces, 
and  the  corrupted  suntraces  w’ere  used  as  input 
to  the  modified  algorithm.  As  described  earlier, 
fig.  4  shows  data  for  a  simple  step-function  curve 
where  there  is  a  multiple-point  inconsistency.  By 
lengthening  the  shadow  region  in  the  western 
suntrace  where  Oi  =  arccot(2/3)  by  just  1  pixel 
to  the  right,  the  reconstruction  in  fig.  4d  can 
be  obtained.  Fig.  5  shows  an  8x8  2-dimensional 
case.  Again,  the  the  fix  is  a  very  slight  change. 

Figs.  6-  8  show  sets  of  1-dimensional  reconstruc¬ 
tions  along  the  east-west  direction  for  a  wooden 
half-arch,  some  egg  carton-shaped  packaging  ma¬ 
terial.  and  a  crumpled  sheet  of  aluminum  foil, 
materials  with  very  different  reflectance  prop¬ 
erties.  For  clarity,  not  all  row  reconstructions 
are  shown.  The  images  of  the  original  objects 
have  illumination  incident  from  the  left  at  arc- 
cot(l/3).  The  aperture  was  generally  .set  a  little 
too  wide  to  facilitate  thresholding.  The  angles 
of  incident  light  used  were  7r/2.  arccot(l/3),  ar- 
ccot(2/3),  arccot(l),  and  arccot(lO). 

For  the  rows  tested  for  the  wooden  block,  there 
were  no  multiple-point  inconsistencies  for  some 
of  the  rows  north  and  south  of  the  block,  and 
from  2  to  7  such  inconsistencies  in  the  other  rows. 
On  a  lightly  loaded  Sun  4.  the  cpu  time  needed 
for  one  row,  as  measured  by  the  Unix  time  com¬ 
mand,  was  no  more  than  0.3  seconds.  The  real 
world  time  time  needed  was  in  all  cases  no  more 
than  2.3  seconds.  For  the  packaging  material, 
the  number  of  multiple-point  inconsistencies  de¬ 
tected  varied  from  31  to  1.51.  though  some  cycles 
were  found  m^re  than  once.  The  cpu  time  for 
processing  one  row  varied  from  2  to  15  seconds, 
and  the  real  world  time  varied  from  13  seconds 
to  1  minute  and  22  seconds.  For  the  foil,  the 
number  of  cycles  detected  varied  from  3  to  76. 
The  cpu  time  for  processing  one  row  varied  froni 
1  to  7  seconds,  and  the  real  world  time  varied 
from  9  to  42  seconds.  The  upper  bounds  are 
generally  more  accurate  than  the  lower  bounds, 
especially  for  certain  points  on  the  far  left  and 
far  right  sides  of  the  reconstruction.  Points  on 
one  side  of  the  image  are  sometimes  in  shadow 
for  all  light  positions  from  the  other  side.  Such 
a  point  will  not  be  constrained  by  any  forward- 
lower  con.">traint.  If  the  point  does  not  shadow 
any  points,  it  will  not  be  constrained  by  any 
backward-lower  constraint,  either,  in  which  case 
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the  lower  bound  will  remain  at  —  oo. 

It  is  unsurprising  that  the  packaging  material 
caused  the  most  trouble.  The  material  is  spongy, 
so  besides  the  general  sinusoidal  shape,  there  is 
roughness  along  the  entire  surface.  This  kind 
of  texture,  plus  the  darker  color  of  the  material, 
leads  to  more  thresholding  errors.  There  are  also 
more  local  maxima  which  shadow  reasonably- 
sized  areas.  Backward- upper  constraints  pro¬ 
duce  an  upper  bound  on  the  constrained  point 
that  is  at  least  as  high  as  the  bound  on  the  con¬ 
straining  point,  so  all  upper  bound  cycles  must 
include  at  least  one  shadower.  The  same  is  true 
for  forward-lower  constraints  and  lower  bound 
cycles.  In  the  case  of  sinusoidal  surfaces  like  the 
packaging  material,  the  sets  of  points  in  the  cy¬ 
cles  tend  not  to  intersect  since  the  shadowers  are 
not  themselves  shadowed  except  for  light  posi¬ 
tions  near  the  horizon.  Fixing  one  cycle  is  thus 
unlikely  to  fix  other  inconsistencies. 

5  Discussion  and  future  work 

The  most  immediate  goal  is  to  either  prove  that 
the  heuristic  in  fact  always  converges,  or  deter¬ 
mine  a  general  description  of  cases  where  it  does 
not.  As  it  turns  out,  the  decision  to  choose  the 
point  where  the  greatest  effect  can  be  achieved 
can  result  in  non-convergence.  This  is  possible 
since  an  adjustment  that  raises  the  upper  bound 
of  some  points  will  tend  to  raise  the  lower  bound 
as  well.  The  fix  for  an  upper  bound  cycle  may 
thus  create  a  lower  bound  cycle.  However,  we 
have  not  experienced  this  problem  in  our  exper¬ 
iments.  If  we  treat  the  set  of  suntraces  for  the 
given  images  as  a  state  within  a  state  space,  then 
the  current  approach  can  be  seen  as  a  kind  of 
depth-first  search  in  that  a  next  state  is  chosen 
by  fixing  the  first  cycle  found  and  considering 
only  one  of  the  possible  fixes  for  the  cycle.  Guar¬ 
anteeing  convergence  probably  requires  adding 
some  breadth  to  the  search. 

Ideally,  we  would  like  to  find  a  method  for 
quickly  determining  if  a  set  of  suntrace  data  con¬ 
tains  a  constraint  cycle  (i.e.,  without  actually 
propagating  constraints  until  a  cycle  is  found), 
and  if  it  does,  what  is  the  closest  set  of  suntrace 
data  which  does  not.  The  closest  set  would  be 
chosen  using  some  definition  of  minimal  adjust¬ 
ment  to  the  actual  data. 
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Figure  1:  One  possible  set-up  for  shape  from 
shadows 


Figure  2;  (a)  A  sample  curve  -  the  shaded  region 
is  the  shadowed  portion  for  the  given  angle  of 
illumination  (b)  The  corresponding  suntrace 


I  3 

0  3  1  S 


Figure  3:  (a)  Suntraces  for  western  illuminations 
-  top  to  bottom,  angles  of  illumination  =  7r/2, 
arccot(l/3),  arccot(2/3),  arccot(l)  (b)  Suntraces 
for  eastern  illuminations  -  top  to  bottom,  angles 
of  illumination  =  7r/2,  arccot(l/3)  (c)  Recon¬ 
struction  -  solid  line  is  upper  bound,  dashed  is 
lower  bound 


I  2 

0  3  1  5 


Figure  4:  (a)  Suntraces  for  western  illumina¬ 
tions  -  top  to  bottom,  angles  of  illumination  = 
w/2,  arccot(l/3),  arccot(2/3),  arccot(l)  [Intro¬ 
duced  errors  are  boxed  here  and  in  part  (b)]  (b) 
Suntraces  for  eastern  illuminations  -  top  lo  bot¬ 
tom,  angles  of  illumination  =  7r/2,  arccot(l/3) 
(c)  Constraint  cycle  revealing  error  (d)  Recon¬ 
struction 
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Figure  5;  Suntraces  and  surface  reconstruction 
of  errorful  synthetic  8x8;  angles  of  illumination 
=  ir/2,  arccot(l/3),  and  where  included,  arc- 
cot(2/3),  arccot(l)  [Introduced  errors  are  boxed] 


Figure  6:  Reconstruction  of  wood  half-arch  pro¬ 
files  [Viewing  angle  for  reconstruction  is  7r/4,  as 
in  the  next  2  figures] 


Figure  8:  Reconstruction  of  foil  surface  profiles 
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Abstract 

This  paper  introduces  a  technique  for  ex¬ 
tracting  structure  and  motion  using  direction- 
ally  selective  matches  between  linear  features. 

A  world-centered  coordinate  system  is  used 
to  make  these  computations  without  the  in¬ 
termediate  calculation  of  depth.  In  order 
to  constrain  the  possible  structure  and  mo¬ 
tion  configurations,  we  assume  that  the  three- 
dimensional  direction  of  gravity  relative  to 
each  image  frame  is  known.  The  direction  of 
gravity,  along  with  the  directionally  selective 
linear  feature  matches,  form  a  set  of  quadratic 
equations  which  can  be  used  to  determine 
structure  and  motion. 

1  Introduction 

The  extraction  of  environmental  structure  and  motion 
from  a  sequence  of  two-dimensional  images  is  a  com¬ 
mon  problem  in  computer  vision.  Typically  solutions  to 
this  problem  are  expressed  in  camera-centered  coordi¬ 
nate  systems  where  environmental  geometry  is  specified 
by  the  depth  along  an  image  feature’s  ray  of  projection. 
Unfortunately,  parameters  computed  from  this  camera- 
centered  representation  are  dependent  upon  the  depth  to 
environmental  features.  This  leads  to  erroneous  results 
for  objects  located  far  from  the  camera. 

The  recently  introduced  factorization  method  [Tomasi 
and  Kanade,  1990;  Tomasi  and  Kanade,  1992;  Boult  and 
Brown,  1992]  has  attempted  to  overcome  the  disadvan¬ 
tages  associated  with  a  camera-centered  representation. 
This  method  uses  a  world-centered  coordinate  system, 
along  with  an  orthogonal  projection  assumption,  in  order 
to  compute  shape  and  motion  without  the  intermediate 
calculation  of  depth.  A  matrix  of  image  measurements 
is  constructed  by  making  point  correspondences  between 
image  frames.  The  matrix  is  then  factored  into  a  shape 
matrix  and  a  motion  matrix  using  Singular  Value  De¬ 
composition. 

One  problem  with  the  factorization  method  is  that  it 
relies  upon  accurate  point  correspondences  between  im- 

’This  research  is  supported  by  the  Advanced  Research 
Projects  Agency  of  the  Department  of  Defense  and  is  mon¬ 
itored  by  the  U.  S.  Army  Topographic  Engineering  Center 
under  contract  No.  DACA76-92-C-0016 


age  frames.  This  paper  introduces  a  method  of  extract¬ 
ing  shape  and  motion  from  directionally  selective  lin¬ 
ear  feature  correspondences.  This  line-based  algorithm 
is  capable  of  reconstructing  shape  and  motion  without 
computing  depth  as  an  intermediate  step.  In  addition  to 
the  orthogonality  assumption,  we  assume  that  the  three- 
dimension2tl  direction  of  gravity  is  known  relative  to  each 
image  in  a  motion  sequence. 

The  algorithm  begins  by  searching  for  the  orientation 
of  one  of  the  lines  in  the  environment.  This  is  a  one 
dimensional  search  over  ISO",  constrained  by  the  projec¬ 
tion  of  the  line  on  one  of  the  image  planes.  Each  candi¬ 
date  line  orientation,  along  with  the  position  of  gravity, 
forms  a  set  of  quadratic  equations  which  constrain  all 
the  other  lines,  as  well  as  the  rotation  between  image 
frames.  An  error  measure  is  computed  from  the  derived 
line  orientations  and  used  to  evaluate  each  shape  and 
motion  configuration.  Once  the  line  orientations  and 
parameters  of  rotation  have  been  derived,  the  relative 
positions  of  the  lines  can  also  be  computed  from  simple 
linear  equations. 

The  remainder  of  this  section  introduces  the  notation 
used  throughout  this  paper.  Section  2  shows  how  to 
derive  line  orientation  and  camera  rotation  from  a  se¬ 
quence  of  two-dimensional  images.  Section  3  presents  a 
set  of  linear  equations  which  can  be  used  to  solve  for  the 
relative  line  positions.  The  algorithms  presented  in  the 
paper  are  applied  to  synthetic  data  and  the  results  are 
presented  in  Section  4.  Finally,  concluding  remarks  are 
given  in  Section  5. 

1.1  Notation 

The  notation  used  throughout  this  paper  is  shown  in 
Figure  1.  An  image  frame  at  time  /  is  delineated  by 
unit  vectors  i/,  jj,  and  kj.  A  three-dimensional  envi¬ 
ronmental  line  is  represented  by  a  unit  vector  d,  speci¬ 
fying  the  line  direction,  and  a  point  on  the  line  p,.  Line 
id,,p$)  is  projected  orthog  \phically  onto  image  frame 
/.  The  direction  of  the  projected  line  is  represented  by 
its  unit  normal  hj,.  pj,  refers  to  the  projection  of  p«. 
The  direction  of  gravity  will  be  referred  to  as  gj.  The 
two-dimensional  parameters  hj,  and  pj,  as  well  as  the 
three-dimensional  parameter  gj  are  all  expressed  in  the 
coordinate  system  of  image  frame  /.  All  other  parame¬ 
ters  are  specified  relative  to  the  world  coordinate  system. 
When  hjt  is  specified  in  the  world  coordinate  system  it 


Figure  1:  Coordinate  systems 


will  be  referred  to  as  nj,. 

In  the  following  section  we  present  a  method  of  solving 
for  the  line  orientations  d,,  as  well  as  the  parameters  of 
rotation  if,  jj,  and  kf.  Section  3  shows  how  these  initial 
quantities  can  be  used  to  fix  the  relative  positions  of  the 
lines  within  the  world  coordinate  system  by  solving  for 
a  point  p,  on  each  line. 

2  Line  Orientation  and  Camera 
Rotation 

In  this  section  we  present  a  method  of  solving  for  the 
three-dimensional  line  orientations  and  parameters  of  ro¬ 
tation  from  a  sequence  of  two-dimensional  images.  The 
algorithm  begins  by  searching  for  the  orientation  of  one 
of  the  lines.  This  is  a  one  dimensional  search  over  180°, 
constrained  by  the  projection  of  the  line  on  one  of  the  im¬ 
age  planes.  Each  candidate  line  orientation,  along  with 
the  position  of  gravity,  forms  a  set  of  quadratic  equations 
which  constrain  all  the  other  lines,  as  well  as  the  rotation 
between  image  frames.  An  error  measure  is  computed 
from  the  derived  line  orientations  and  used  to  evaluate 
each  shape  and  motion  configuration.  Section  2.1  shows 
how  to  solve  for  the  position  within  the  world  coordi¬ 
nate  system  of  the  line  normals  (ny,)  associated  with 
the  candidate  line.  Section  2.2  shows  how  the  candidate 
normals  can  be  used  to  solve  for  the  normals  to  all  the 
other  visible  lines.  These  line  normals  are  then  used  in 
Section  2.3  to  estimate  the  line  orientations  and  camera 
rotations. 

2.1  Candidate  Line  Normals 

The  algorithm  begins  by  searching  for  the  orientation 
of  one  of  the  lines.  A  candidate  line  is  used  to  con¬ 
strain  the  position  of  all  the  other  three-dimensional  lines 
so  that  a  particular  shape  and  motion  arrangement  can 
be  evaluated.  The  first  step  in  this  process  is  to  solve 
for  the  position  in  the  world  coordinate  system  of  the 
candidate  line’s  normals.  Let  d\  be  the  candidate  line. 


Figure  2:  Normals  are  determined  by  intersecting  a  plane 
with  a  circular  cone 


Since  the  line  normals  n/i  were  formed  by  orthographic 
projection,  they  must  be  perpendicular  to  the  line  d\. 
Therefore,  one  constraint  is  that  the  vectors  nji  must 
lie  within  the  plane  perpendicular  to  di .  An  additional 
constraint  is  provided  by  the  gravity  vector  gj.  The  an¬ 
gle  between  fi/i  and  gj  must  be  the  same  as  the  angle 
between  n/i  and  the  direction  of  gravity  in  the  world 
coordinate  system  These  two  constraints  can  be 

used  to  solve  for  n/i.  Figure  2  shows  the  geometry  of 
these  two  constraints.  Each  normal  (n/i)  is  determined 
by  intersecting  a  plane  with  a  circular  cone.  The  plane 
is  defined  by  di.  The  cone  is  constructed  by  rotating  a 
vector  about  the  direction  of  gravity  at  the  appropriate 
angle.  Since  the  origin  of  the  cone  lies  within  the  plane, 
the  intersection  of  the  plane  with  the  cone  results  in  two 
lines.  There  are  only  two  possible  solutions  since  the 
normals  are  known  to  be  unit  vectors. 

The  constraints  described  above  will  now  be  examined 
in  more  detail.  As  stated  earlier,  the  direction  of  gravity 
gj  relative  to  the  line  normals  fi/i  is  known.  This  results 
in  the  following  relationship 

n/i  ■  gw  =  nji  •  (1) 

where  g^  is  the  direction  of  gravity  in  the  world  coor¬ 
dinate  system.  Letting  g^  =  (0,— 1,0)  we  can  simplify 
Equation  1 

n/i,  =  -”/i  •  9J  (2) 

In  addition  to  the  angle  constraint  we  know  that  nji 
lies  within  the  plane  defined  by  di.  This  constraint  is 
expressed  as 

«/i  •  di  =  0  (3) 

Finally,  we  know  that  the  magnitude  of  each  normal  vec¬ 
tor  (n/i)  equals  one 

IKi||  =  l  (4) 

Equations  2,  3,  and  4  can  be  combined  into  a  single 
quadratic  equation,  resulting  in  two  feasible  solutions 
for  each  normal  vector. 

2.2  Additional  Line  Normals 

The  next  step  in  the  extraction  of  line  orientation  and 
rotation  is  to  solve  for  the  position  within  the  world  co¬ 
ordinate  system  of  the  rest  of  the  line  normals.  This  is 
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Figure  3:  Normals  are  determined  by  intersecting  two 
circular  cones 

accomplished  by  using  the  candidate  line  normals.  The 
idea  is  essentially  the  same  as  in  the  previous  section. 
Two  constraints  can  be  formulated  from  the  given  geom¬ 
etry.  The  first  constraint  is  given  by  the  gravity  vector 
gj,  and  is  identical  to  the  constraint  presented  in  the 
previous  section.  The  angle  between  nj,  and  gj  must 
be  the  same  as  the  angle  between  nj,  and  the  direction 
of  gravity  in  the  world  coordinate  system.  The  second 
constraint  is  that  the  angle  between  an  image  normal 
vector  and  the  candidate  image  normal  vector  n/i 
must  be  the  same  as  the  angle  between  the  associated 
world  coordinate  normal  vectors  nj,  and  njx.  These  two 
constraints  can  be  used  to  solve  for  all  the  additional 
normal  vectors  nj,.  The  constraints  are  shown  geomet¬ 
rically  in  Figure  3.  The  solution  for  a  normal  vector  nj, 
is  essentially  the  result  of  intersecting  two  circular  cones. 
One  cone  is  the  result  of  rotating  a  vector  about  the  di¬ 
rection  of  gravity.  The  other  cone  results  from  rotating 
a  vector  about  the  candidate  normal  vector  nji.  The 
intersection  of  two  circular  cones  which  share  the  same 
origin  is  two  lines.  Once  again,  the  normals  are  known 
to  be  unit  vectors,  resulting  in  two  solutions. 

The  following  equations  result  from  the  above  analy¬ 
sis.  The  constraint  resulting  from  the  gravity  vector  gj 
is  identical  to  the  one  presented  in  Section  2.  Therefore, 
from  Equation  2  we  can  write 

«/»,  =  •  9J  (5) 

The  second  constraint  relates  the  line  normals  n f,  to  the 
candidate  line  normals  as  follows 

ny*  ■  «/i  =  •  fiyi  (6) 

Finally,  we  know  that  the  magnitude  of  each  normal  vec¬ 
tor  (n/,)  equals  one 

l|n/.||  =  1  (7) 

Equations  5,  6,  and  7  can  be  combined  into  a  single 
quadratic  equation,  resulting  in  two  feasible  solutions 
for  each  normal  vector. 


2.3  Parameter  Estimation 

Once  the  normal  vectors  (n/,)  have  been  derived,  the 
process  of'estimating  the  line  orientations  and  rotational 
parameters  is  trivial.  The  line  orientations  (d,)  are  easily 
estimated  from  their  associated  normals  {nj,)  using  the 
following  equation 

d,  •  ny,  =  0  (8) 

d,  can  be  estimated  with  a  minimum  of  two  non-collinear 
normal  vectors.  When  more  vectors  are  available,  d, 
can  be  solved  for  using  a  linear  least-squares  technique. 
The  rotational  parameters  are  also  easily  obtained  from 
the  normal  vectors  nj,.  Three  linear  equations  can  be 
formulated  for  the  three  rotational  parameters  t/,  jj, 
and  kj 

if  ■  nj,  =  n/,. 

jf  ■  "/*  =  "/*, 

kf  nj,  =  0 

There  are  also  additional  constraints  available.  One  of 
these  constraints  is  that  the  vectors  must  be  orthonormal 

»■/  =  jj  ^  */ 

if  =  kjx  ij 

kj  =  ij  X  jj 

Iliyil  =  Iliyil  =  ll^yll  =  i 

Additionad  constraints  can  be  derived  from  the  relation¬ 
ship  between  the  rotational  vectors  and  gravity  as  was 
done  in  Sections  2.1  and  2.2.  These  constraints  are 

if  ■  gw  =  yy, 
if  •  9w  =  9ft 
kf  gw  =  9f, 

Of  course,  all  of  the  equations  presented  above  are  not 
independent,  and  all  are  not  necessary.  Currently  we  use 
the  following  subset  of  equations.  Initially  kj  is  deter¬ 
mined  using  a  least  squares  formulation  of 

kf  rif,  =  0  (9) 

The  technique  presented  in  Section  2.1  is  then  used  to 
solve  for  i j  with  the  following  equations 

iyty  =  0  (10) 

if  9yv  =  9f.  (11) 

Finally  if  and  kj  are  used  to  solve  for  jf 

if  =  *y  X  if  (12) 

Equation  8  is  used  to  solve  for  the  line  orientations 
d,.  Equations  9,  10,  11,  and  12  are  used  to  solve  for 
the  rotational  parameters  if,if,  and  k/.  The  following 
section  shows  how  to  use  these  derived  parameters  to 
solve  for  the  relative  positions  of  the  line  segments,  thus 
completing  the  spatial  reconstruction. 

3  Line  Position 

The  final  step  in  the  line  segment  reconstruction  is  to 
solve  for  the  line  segment  positions  relative  to  the  world 
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Figure  4:  The  first  and  last  frames  from  a  20  image 
sequence 

coordinate  system.  Initial  assumptions  about  the  posi¬ 
tion  of  the  image  frames  relative  to  the  world  coordinate 
system  are  made,  allowing  a  simple  linear  solution  to 
the  problem.  The  position  of  each  line  is  represented 
by  a  point  p,  which  is  chosen  arbitrarily.  The  world 
coordinate  system  will  be  positioned  at  the  center  of  im¬ 
age  frame  1.  The  points  pi,  are  then  chosen  arbitrarily 
pi,  =  (x,,y,).  We  assume  that  all  the  image  planes  in¬ 
tersect  along  line  dx-  This  means  that  the  position  of 
each  image  plane  is  given  by  pi  -1-  a/d,  where  a/  is  a 
parametric  scale  factor. 

Each  line  position  p,  =  {x,  ,y,,z,)  consists  of  one  un¬ 
known  z,.  The  solution  for  z,  is  trivial.  Each  point  p, 
is  constrained  to  lie  within  the  planes  perpendicular  to 
n /, .  These  planes  are  positioned  by  choosing  some  ar¬ 
bitrary  point  on  the  projection  of  each  line,  and  then 
determining  the  position  of  that  point  within  the  world 
coordinate  system.  Let  q  be  the  point  in  world  coordi¬ 
nates 

?  =  Pi  +  [»/  Jj]  ■  (Pfs  -P/i)  (13) 

The  equation  of  the  plane  is  then  written  as 

-  ajd,J+ 

+ 

nj,.(^s-qi-Qfd,,)  =  0  (14) 

The  two  unknowns  in  this  equation  are  z,  and  aj.  aj 
can  be  removed  from  the  equation,  and  a  least  squares 
solution  can  be  found  for  z, . 

4  Results 

The  algorithm  presented  in  this  paper  was  implemented 
and  tested  on  several  sequences  of  synthetic  data.  The 
first  and  last  frames  from  a  20  image  sequence  are  shown 
in  Figure  4.  Figure  5  shows  10  frames  from  the  sequence 
(every  other  frame  is  displayed).  This  data  was  pro¬ 
duced  by  random  rotations  and  translations.  The  ro¬ 
tational  parameters  ij  and  jj  associated  with  this  se¬ 
quence  of  motion  are  shown  in  Figures  6  and  7.  The 
correct  rotational  values  are  displayed  as  solid  lines,  and 
the  derived  values  are  displayed  as  dotted  lines.  All  er¬ 
rors  are  the  result  of  perspective  projection.  Notice  that 


Figure  5:  10  image  frames  from  a  20  image  sequence 
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Figure  6:  The  components  of  ij  for  a  20  frame  sequence. 
The  correct  values  are  shown  with  solid  lines,  and  the 
derived  values  are  shown  with  dotted  lines. 

the  Y-component  of »/  is  errorless.  This  is  because  this 
component  is  derived  from  the  relationship  between  the 
image  frames  and  the  gravity  vector  (y/)  as  shown  in 
Equation  11.  Thus  the  Y-component  is  unaffected  by 
the  perspective  projection  errors. 

The  derived  line  orientations  and  parameters  of  rota¬ 
tion  were  then  used  to  reconstruct  the  line  positions  as 
discussed  in  Section  3.  A  top  view  of  the  original  data 
is  shown  in  Figure  8.  The  reconstructed  data  is  shown 
in  Figure  9.  Once  again  the  errors  are  the  result  of  per¬ 
spective  projection. 

5  Conclusion 

The  technique  presented  in  this  paper  is  an  early  attempt 
at  constructing  linear  feature  based  depth-independent 
motion  algorithms.  The  work  has  only  been  tested  on 
synthetic  data,  and  it  is  not  clear  what  effect  perspective 
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From*  number  (f) 

Figure  7:  The  components  of  jf  for  a  20  frame  sequence. 
The  correct  values  are  shown  with  solid  lines,  and  the 
derived  values  are  shown  with  dotted  lines. 
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Figure  8:  Top  view  of  the  house  data 

projection  and  other  forms  of  noise  will  have.  However, 
since  the  formulation  involves  linear  least  squares  esti¬ 
mation,  it  appears  that  it  will  be  robust.  The  ability  to 
deal  with  occlusion  is  also  straight-forward  in  this  over¬ 
constrained  system.  Occluded  line  normals  (ny«)  are  null 
vectors  and  therefore  have  no  effect  on  the  least  squares 
solution.  Notice  that  the  first  frame  shown  in  Figure  4 
contains  occluded  lines. 

One  drawback  of  this  method  is  that  the  three- 
dimensional  direction  of  gravity  is  required.  This  mea¬ 
surement  can  be  provided  by  a  gravity  sensor,  but  we 
would  like  to  relax  this  restriction.  One  way  to  remove 
the  gravity  vector  from  the  algorithm  is  to  replace  the  di¬ 
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Figure  9:  Top  view  of  the  reconstructed  house 

rection  of  gravity  with  another  consistent  direction.  For 
example,  for  an  object  that  consistently  moves  in  one 
direction  (such  as  a  vehicle),  the  gravity  vector  can  be 
replaced  by  a  vector  specifying  this  direction  (the  for¬ 
ward  vehicle  direction). 

There  are  several  areas  for  future  work: 

•  Test  this  algorithm  on  noisy  data  and  if  necessary 
develop  a  more  robust  formulation  that  will  work 
well  in  the  presence  of  errors,  including  the  errors 
introduced  from  perspective  projection. 

•  Test  the  algorithm  on  real  image  sequences. 

•  Integrate  this  rotation  based  method  with  the 
translation  based  method  discussed  in  [Lawton, 
1982].  In  this  case  the  gravity  vector  is  replaced  by 
a  direction  of  translation  vector.  The  integration  of 
these  two  methods  will  probably  be  accomplished 
through  temporal  filtering  using  the  Kalman  Filter. 
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Abstract 

Our  goal  is  to  reconstruct  both  the  shape  and  re¬ 
flectance  properties  of  surfaces  from  multiple  images. 
We  argue  that  an  object-centered  representation  is 
most  appropriate  for  this  purpose  because  it  natu¬ 
rally  accomodates  multiple  sources  of  data,  multiple 
images  (including  motion  sequences  of  a  rigid  ob¬ 
ject),  and  self-occlusions.  We  then  present  a  spe¬ 
cific  object-centered  reconstruction  method  and  its 
implementation.  The  method  begins  with  an  ini¬ 
tial  estimate  of  surface  shape  (provided  by  trian¬ 
gulating  the  result  of  conventional  stereo  or  other 
means).  The  surface  shape  and  reflectance  prop¬ 
erties  are  then  iteratively  adjusted  to  minimize  an 
objective  function  that  combines  information  from 
multiple  input  images.  The  objective  function  is  a 
weighted  sum  of  “stereo,”  shading,  and  smoothness 
components,  where  the  weight  varies  over  the  sur¬ 
face.  For  example,  the  stereo  component  is  weighted 
more  strongly  where  the  surface  projects  onto  highly 
textured  areas  in  the  images,  and  less  strongly  oth¬ 
erwise.  Thus,  each  component  has  its  greatest  in¬ 
fluence  where  its  accuracy  is  likely  to  be  greatest. 
Experimental  results  on  both  synthetic  and  real  im¬ 
ages  are  presented. 

1  Introduction 

The  problem  of  recovering  the  shape  and  re¬ 
flectance  properties  of  a  surface  from  multi¬ 
ple  images  has  received  considerable  attention 
[6,  20,  35,  44).  This  is  a  key  problem  not  only  in 

'Support  for  this  research  was  provided  by  various 
contracts  from  the  Defense  Advanced  Research  Projects 
Agency. 


developing  general-purpose  vision  systems,  but 
also  in  specialized  areas  such  as  the  generation 
of  Digital  Elevation  Models  from  aerial  images 
[5,  12,  26,  53). 

In  this  paper,  we  view  the  recovery  problem 
as  one  of  finding  an  object-centered  description 
of  a  surface  from  a  set  of  input  im^es  that  is 
sufficiently  complete,  in  terms  of  its  geometric 
and  radiometric  properties,  that  it  is  possible 
to  generate  an  image  of  the  surface  from  any 
viewpoint.  In  particular,  the  description  should 
be  sufficiently  complete  to  reproduce  the  input 
images  to  within  a  certain  tolerance,  given  mod¬ 
els  of  the  cameras,  their  relative  locations,  and 
expected  noise. 

Our  surface  reconstruction  method  uses  an 
object-centered  representation,  specifically,  a 
triangulated  3-D  mesh  of  vertices.  Such  a  rep¬ 
resentation  accommodates  the  two  classes  of 
information  mentioned  above,  as  well  as  mul¬ 
tiple  images  (including  motion  sequences  of 
a  rigid  object)  and  self-occlusions.  We  have 
chosen  to  model  the  surface  material  using 
the  Lambertian  reflectance  model  with  variable 
albedo,  though  generalizations  to  specular  sur¬ 
faces  would  be  relatively  straightforward.  Con¬ 
sequently,  the  natural  choice  for  the  monocular 
information  source  is  shading,  while  intensity  is 
the  natural  choice  for  the  image  feature  used 
in  multi-image  correspondence.  Not  only  are 
these  the  natural  choices  given  a  Lambertian 
reflectance  model,  they  are  also  complementary 
[7,  30]:  intensity  correlation  is  most  accurate 
wherever  the  input  images  are  highly  textured, 
whereas  shading  is  most  accurate  when  the  in¬ 
put  images  are  untextured. 
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The  reconstruction  method  is  to  minimize 
an  objective  function  whose  components  de¬ 
pend  on  the  input  images  and  some  measure  of 
the  complexity  of  the  3-D  mesh.  The  method 
starts  with  an  initial  estimate  for  the  mesh 
derived  from  the  triangulation  of  conventional 
stereo  results,  and  uses  a  standard  optimization 
technique  called  conjugate  gradient  descent  to 
minimize  the  objective  function.  The  image- 
dependent  components  of  the  objective  func¬ 
tion  are  related  to  the  two  sources  of  informa¬ 
tion  mentioned  above.  We  take  advantage  of 
the  complementary  nature  of  the  information 
sources  by  weighting  the  components  at  each 
facet  of  the  triangulated  mesh  according  to  the 
amount  of  texturing  within  the  area  of  the  im¬ 
ages  that  the  facet  projects  to.  The  projection 
uses  a  hidden-surface  algorithm  to  take  occlu¬ 
sions  into  account. 

In  the  following  section,  we  describe  related 
work  and  our  contributions  in  this  area.  Fol¬ 
lowing  this  we  discuss  some  of  the  key  issues 
in  multi-image  surface  reconstruction  and  how 
to  combine  different  sources  of  information  for 
such  purposes.  We  then  describe  in  detail  our 
specific  procedure,  discuss  the  behavior  of  our 
procedure  on  synthetic  data,  and  show  some  re¬ 
sults  on  real  images. 

2  Related  Work  and  Contri¬ 
butions 

Three-dimensional  reconstruction  of  visible  sur¬ 
faces  continues  to  be  an  important  goal 
of  the  computer  vision  research  community. 
Initially,  much  of  the  work  concentrated 
on  2|-dimensional  image-centered  reconstruc¬ 
tions,  such  as  Barrow  and  Tenenbaum’s  Intrin¬ 
sic  Images  [6]  and  Marr’s  2^-D  Sketch  [35]. 
These  preliminary  ideas  have  been  the  basis  for 
quite  successful  systems  for  recovering  shape 
and  surface  properties.  Some  have  used  sin¬ 
gle  sources  of  information,  such  as  sequences  of 
range  data  or  intensity  images  [3,  25],  stereo 
[12,  26,  52,  53],  and  shading  [21,  24,  44].  Oth¬ 
ers  have  combined  sources  of  information,  such 
as  shading  and  texture  [8],  features  and  stereo 
[23],  focus,  vergence,  stereo,  and  camera  cali¬ 
bration  [1].  See  [2]  for  fiirther  discussions  on 


information  fusion. 

More  recently,  full  3-dimensional  models  have 
been  used,  such  as  3-D  surface  meshes  [46,  49], 
parameterized  surfaces  [40, 33],  particle  systems 
(42,  17],  and  volumetric  models  [36,  45,  37]. 

As  with  the  2|-dimensional  representations, 
3-D  representations  have  used  a  variety  of  sin¬ 
gle  image  cues  for  reconstruction,  such  as  sil¬ 
houettes  and  image  features  [9,  11,  47,  48,  50], 
range  data  [51],  stereo  [17],  and  motion  [41]. 
Liedtke[32]  first  uses  silhouettes  to  derive  an 
initial  estimate  of  the  surface,  and  then  uses 
a  multi-image  stereo  algorithm  to  improve  on 
the  result.  Their  approach  to  deriving  an  ini¬ 
tial  estimate  for  the  mesh,  as  with  Szeliski  and 
Tonneson’s  approach  [42],  is  significantly  more 
powerful  than  the  one  we  use  in  this  paper.  This 
is  an  important  topic  for  future  research. 

Of  special  relevance  to  this  paper  is  research 
in  combining  stereo  and  shape  from  shading. 
Using  2 1 -dimensional  representations,  Blake  et 
al.  [7]  is  the  earliest  reference  we  are  aware 
of  that  discusses  the  complementary  nature  of 
stereo  and  shape  from  shading,  but  their  exper¬ 
imental  results  are  almost  non-existent  in  this 
paper.  Leclerc  and  Bobick  [31]  discuss  the  in¬ 
tegration  of  stereo  and  shape  from  shading,  but 
only  use  stereo  as  an  initial  condition  to  a  dis¬ 
crete  height  from  shading  algorithm.  Cryer  et 
al.  [10]  combine  the  high-frequency  information 
from  a  shape  from  shading  algorithm  with  the 
low-frequency  information  from  a  stereo  algo¬ 
rithm  using  filters  designed  to  match  those  in 
the  human  visual  system. 

Using  fuU  3-D  representations,  Heipke  [22] 
integrates  the  two  cues,  but  assumes  that  the 
images  can  be  separated  beforehand  into  zones 
of  variable  albedo  (where  one  does  stereo)  and 
areas  of  constant  albedo  (where  one  does  shape 
from  shading).  This  is  in  contrast  to  our  ap¬ 
proach,  in  which  the  optimization  procedure  dy¬ 
namically  adapts  to  the  image  data. 

In  this  paper,  we  unify  the  idea  of  using  3-D 
meshes  to  integrate  information  from  multiple 
images  with  that  of  using  multiple  cues.  Our 
specific  approach  to  this  unification,  has  led  to 
a  number  of  important  contributions: 

•  We  correctly  deal  with  occlusions  by  using 
a  hidden  surface  algorithm  during  the  re- 
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construction  process. 

•  Our  technique  for  doing  stereo  avoids  the 
constant  depth  assumption  of  traditional 
correlation-based  stereo  algorithms,  effec¬ 
tively  using  variable-sized  windows  in  the 
images. 

•  Our  approach  to  shape  from  shading  is 
applicable  to  surfaces  with  slowly  varying 
albedo.  This  is  a  significant  advance  over 
traditional  approaches  that  require  con¬ 
stant  albedo. 

•  We  propose  a  dynamic  weighting  scheme 
for  combining  shape  from  shading  and 
stereo,  and  demonstrate  that  it  leads  to  sig¬ 
nificantly  better  results  than  using  either 
cue  alone  using  both  synthetic  and  real  im¬ 
ages. 

To  demonstrate  the  validity  of  the  overall  ap¬ 
proach,  we  have  implemented  a  computation¬ 
ally  effective  optimization  procedure,  and  have 
demonstrated  that  it  finds  good  minima  of  the 
objective  function  on  both  synthetic  and  real 
images. 

3  Issues  in  Multi-Image  Sur¬ 
face  Reconstruction 

In  this  section,  we  briefly  discuss  some  of  the  key 
issues  in  multi-image  surface  reconstructions, 
and  outline  how  we  address  the  issues  in  this 
paper.  These  outlines  will  be  expanded  upon  in 
Section  4. 

3.1  Surface  Shape  and  its  Represen¬ 
tation 

Since  the  task  is  to  reconstruct  a  surface  from 
multiple  images  whose  vantage  points  may  be 
very  different,  we  need  a  surface  representation 
that  can  be  used  to  generate  images  of  the  sur¬ 
face  from  arbitrary  viewpoints,  taking  into  ac¬ 
count  self-occlusion,  self-shadowing,  and  other 
viewpoint-dependent  effects.  Clearly,  a  single 
image-centered  representation,  such  as  a  depth 
map,  is  inadequate  for  this  purpose.  Instead, 
an  object-centered  surface  representation  is  re¬ 
quired. 


There  are  many  object-centered  surface  rep¬ 
resentations  that  are  possible.  However,  there 
are  some  practical  issues  that  are  important  in 
choosing  an  appropriate  one.  First,  the  repre¬ 
sentation  should  be  general-purpose  in  the  sense 
that  it  should  be  possible  to  represent  any  con¬ 
tinuous  surface,  closed  or  open,  and  of  arbitrary 
genus.  Second,  it  should  be  relatively  straight¬ 
forward  to  generate  an  instance  of  a  surface 
from  standard  data  sets  such  as  depth  maps  or 
clouds  of  points.  Finally,  there  should  be  a  com¬ 
putationally  simple  correspondence  between  the 
parameters  specifying  the  surface  and  the  actual 
3-D  shape  of  the  surface,  so  that  images  of  the 
surface  can  be  easily  generated,  thereby  allow¬ 
ing  the  integration  of  information  from  multiple 
images. 

A  hexagonally  connected  mesh  of  3-D  ver¬ 
tices,  as  in  Figure  2,  is  an  example  of  a  surface 
representation  that  meets  the  criteria  stated 
above,  and  is  the  one  we  have  chosen  for  this  pa¬ 
per.  Such  a  mesh  defines  a  surface  composed  of 
three-sided  planar  polygons  that  we  call  trian¬ 
gular  facets,  or  simply  facets.  Triangular  facets 
are  particularly  easy  to  manipulate  for  image 
and  shadow  generation,  since  they  are  the  ba¬ 
sis  for  many  3-D  graphics  systems.  Hexagonal 
meshes  can  be  used  to  construct  virtually  arbi¬ 
trary  surfaces.  Finally,  standard  triangulation 
algorithms  can  be  used  to  generate  such  a  sur¬ 
face  representation  from  real  noisy  data  [18, 42]. 

3.2  Material  Properties  and  their 
Representation 

Objects  in  the  world  are  composed  of  many 
types  of  material,  and  the  material  type  can 
vary  across  the  object’s  surface  in  many  ways. 
The  key  issues,  therefore,  are  the  type  of  mate¬ 
rial  we  wish  to  consider,  and  how  its  variation 
across  the  surface  is  to  be  representetl.  In  gen¬ 
eral,  one  can  represent  a  material  type  by  its  re¬ 
flectance  function,  which  maps  the  wavelength 
distribution  and  orientation  of  a  light  source, 
the  normal  to  the  surface,  and  the  viewing  di¬ 
rection  into  an  image  color.  This  function  is 
generally  quite  complex.  However,  there  are  re¬ 
flectance  functions  that  are  not  only  much  sim¬ 
pler,  but  are  al.so  quite  common.  Such  functions 
are  modeled  using  only  one,  or,  at  most,  a  few. 
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parameters.  Consequently,  one  can  accurately 
model  the  material  properties  of  a  surface  by 
representing  these  parameters  at  every  point  on 
the  surface. 

Probably  the  simplest,  and  most  common, 
such  function  is  the  Lambertian  reflectance 
function.  For  grey-level  images,  this  function 
not  only  has  a  single  parameter,  albedo,  which 
is  the  ratio  of  incoming  to  outgoing  light  in¬ 
tensity,  but  the  image  intensity  is  independent 
of  viewpoint.  For  this  reason,  we  have  chosen 
to  restrict  ourselves  to  Lambertian  surfaces  in 
this  paper.  However,  because  we  use  a  full  3- 
D  representation,  a  generalization  to  specular 
surfaces  would  be  fairly  straightforward. 

Having  chosen  a  speciflc  reflectance  function, 
the  remaining  issue  is  how  to  represent  the 
spatially- varying  parameter(s).  In  general,  one 
needs  to  be  able  to  represent  independent  pa¬ 
rameter  values  at  every  point  of  the  surface.  In 
terms  of  the  mesh  representation  of  the  surface, 
this  implies  some  type  of  spatial  sampling  of 
each  facet.  Given  the  finite  resolution  of  the 
images,  and  other  practical  considerations,  we 
have  chosen  to  use  two  types  of  spatial  sam¬ 
pling.  The  first  is  most  appropriate  when  the 
parameters  vary  quickly  across  the  surface,  and 
the  second  when  they  vary  more  slowly.  For 
the  former  case,  we  use  a  uniform  sampling  of 
each  facet,  where  the  inter-sample  spacing  cor¬ 
responds  roughly  to  no  more  than  one  or  two 
pixels  in  any  of  the  images.  For  the  later  case, 
we  use  a  single  value  associated  with  each  facet. 

As  we  shall  see  later,  the  two  different  repre¬ 
sentations  are  used  somewhat  differently,  and 
the  choice  of  which  representation  to  use  is 
made  on  a  facet-by-facet  basis  as  a  function  of 
the  images. 

3.3  Information  Sources  for  Recon¬ 
struction 

There  are  a  number  of  information  sources  that 
are  available  for  the  reconstruction  of  a  surface 
and  its  material  properties.  Here,  we  consider 
two  classes  of  information. 

The  first  class  are  those  information  sources 
that  require  a  single  image,  such  as  texture  gra¬ 
dients,  shading,  and  occlusion  edges.  When  us¬ 
ing  multiple  images  and  a  bill  3-D  surface  rep¬ 


resentation,  however,  we  can  do  certain  things 
that  cannot  be  done  with  a  single  image.  First, 
the  information  source  can  be  checked  for  con¬ 
sistency  across  all  images,  taking  into  account 
occlusions.  Second,  the  information  can  be  “av¬ 
eraged”  over  all  the  images,  when  the  source 
is  consistent  and  occlusions  are  taken  into  ac¬ 
count,  to  increase  its  sensitivity. 

The  second  class  are  those  information 
sources  that  require  at  least  two  images,  such 
as  the  triangulation  of  corresponding  points  be¬ 
tween  input  images  (given  camera  models  and 
their  relative  positions).  Generally  speaking, 
this  source  is  most  useful  when  corresponding 
points  can  be  easily  identified,  and  their  image 
positions  accurately  measured.  The  ease  and 
accuracy  of  this  correspondence  can  vary  sig¬ 
nificantly  from  place  to  place  in  the  image  set, 
and  depends  critically  on  the  type  of  feature 
used.  Consequently,  whatever  the  type  of  fea¬ 
ture  used,  one  must  be  able  to  identify  where  in 
the  images  that  feature  provides  rebable  corre¬ 
spondences,  and  what  accuracy  one  can  expect. 

The  image  feature  that  we  have  chosen  for 
correspondence  (though  it  is  by  no  means  the 
only  one  possible)  is  simply  intensity,  because 
the  Lambertian  reflectance  model  described  ear¬ 
lier  implies  that  the  image  intensity  of  a  surface 
point  is  independent  of  the  viewing  direction. 
Therefore,  corresponding  points  should  have  the 
same  intensity  in  all  images.  Clearly,  intensity 
can  only  be  a  reliable  feature  when  the  albedo 
varies  quickly  enough  on  the  surface  (and,  con¬ 
sequently,  the  images  are  highly  textured),  and 
the  search  space  is  sufficiently  narrow.  Other¬ 
wise,  there  would  be  significant  ambiguity  in  the 
correspondence  of  pixels  across  the  images. 

In  contrast  to  our  approach  traditional 
correlation-based  stereo  methods  use  fixed-size 
windows  in  images,  which  can  only  yield  correct 
results  when  the  siirface  is  tangential  to  the  im¬ 
age  plane.  Instead,  we  compare  the  intensities 
as  projected  onto  the  facets  of  the  surface,  which 
is  equivalent  to  having  variable-shaped  windows 
in  the  images.  Consequently,  if  the  original  sur¬ 
face  is  well  modeled  by  a  mesh  surface,  the  re¬ 
construction  can  be  significantly  more  accurate. 
The  Hierarchical  Warp  Stereo  System  [39]  is  an¬ 
other  example  of  a  method  that  takes  into  ac¬ 
count  the  variable  shapes  of  windows  required 
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for  accurate  reconstruction  of  a  surface,  though 
it  uses  only  an  image- centered  representation  of 
the  surface. 

As  for  the  monocular  information  source,  we 
have  chosen  to  use  shading.  There  are  a  number 
of  reasons  for  this.  First,  we  are  using  a  Lam¬ 
bertian  reflectance  model,  making  shading  a  rel¬ 
atively  simple  source  of  information.  Second, 
shading  is  most  reliable  when  the  albedo  varies 
slowly  across  the  surface,  which  is  the  natural 
complement  to  intensity  correspondence,  which 
requires  quickly  varying  albedo.  The  comple¬ 
mentary  nature  of  these  two  sources  should  al¬ 
low  us  to  accurately  recover  the  surface  geom¬ 
etry  and  material  properties  for  a  wide  variety 
of  images. 

In  contrast  to  our  approach  traditional  uses 
of  shading  information  assume  that  the  albedo 
is  constant  across  the  entire  surface,  which  is  a 
major  limitation  when  applied  to  real  images. 
We  overcome  this  limitation  by  improving  upon 
a  method  to  deal  with  discontinuities  in  albedo 
alluded  to  in  the  summary  of  [30,  31].  We  com¬ 
pute  the  albedo  at  each  facet  using  the  nor¬ 
mal  to  the  facet,  a  light-source  direction,  and 
the  average  of  the  intensities  projected  onto  the 
facet  from  all  images.  Since  we  use  the  aver¬ 
age  of  the  projected  intensities,  this  computed 
albedo  minimizes  the  mean  squared  error  be¬ 
tween  the  images  of  the  mesh  surface  and  the 
input  images.  The  variation  of  this  computed 
albedo  across  the  surface  is  the  actual  informa¬ 
tion  source  used  to  recover  the  surface.  For  ex¬ 
ample,  if  the  albedo  of  the  real  surface  were 
indeed  constant,  as  in  traditional  shape-from- 
shading  problems,  then  the  measured  variation 
in  albedo  will  be  zero  for  the  correct  mesh  sur¬ 
face,  and  we  will  have  recovered  both  surface 
shape  and  albedo.  The  distinct  advantage  of 
this  approach  over  the  traditional  one  is  that  it 
can  deal  with  surfaces  whose  albedo  is  not  con¬ 
stant,  but  instead  varies  slowly  over  the  surface. 

In  the  following  subsection,  we  describe  how 
these  two  sources  of  information  are  combined 
and  used  to  reconstruct  surfaces. 


3.4  Combining  and  Using  Informa¬ 
tion  Sources 

Simply  put,  our  approach  to  surface  reconstruc¬ 
tion  is  to  adjust  the  parameters  of  the  surface 
(in  the  case  of  the  mesh,  this  means  the  coor¬ 
dinates  of  the  vertices),  until  the  images  of  the 
surface  are  most  consistent  with  the  informa¬ 
tion  sources  described  above.  This  approach  re¬ 
quires  a  number  of  things.  First,  one  must  have 
an  initial  estimate  of  the  surface.  In  this  pa¬ 
per,  this  is  derived  from  a  standard  correlation- 
based  stereo  algorithm.  Second,  one  must  know 
the  light  source  direction,  camera  models,  and 
their  relative  positions  so  that  images  of  the  sur¬ 
face  can  be  generated  (we  assume  these  are  pro¬ 
vided  a  priori).  Third,  one  must  have  a  way 
of  quantifying  what  is  meant  by  “most  consis¬ 
tent  with  the  information  sources.”  Here,  we 
use  an  objective  function  that  is  a  linear  com¬ 
bination  of  components,  one  for  each  informa¬ 
tion  source,  whose  weights  are  determined  on  a 
facet-by-facet  basis  as  a  function  of  the  images. 
Finally,  one  must  have  a  computationally  effec¬ 
tive  means  of  flnding  a  surface,  given  the  initial 
estimate,  that  is  reasonably  close  to  the  best  of 
all  possible  surfaces  according  to  the  objective 
function. 

Our  combined  objective  function  has  three 
components,  two  of  which  were  mentioned 
above;  an  intensity  correlation  component,  and 
an  albedo  variation  component.  A  third  com¬ 
ponent  is  a  measure  of  the  smoothness  of  the 
surface.  The  first  two  components  are  weighted 
differently  at  each  facet  as  a  function  of  the  im¬ 
age  intensities  projected  onto  the  facet,  while 
the  surface  smoothness  component  has  the  same 
weight  everywhere,  but  is  typically  decreased  as 
the  iterations  proceed. 

Since  the  intensity  correlation  component  de¬ 
pends  on  the  difference  in  intensity  at  a  given 
point,  it  is  most  accurate  when  the  images 
are  highly  textured  in  the  areas  that  the  facet 
projects  to.  To  see  this,  consider  the  case  when 
the  images  have  constant  intensity  in  the  neigh¬ 
borhood  of  the  projected  facet:  the  difference 
in  intensity  will  be  a  constant,  independent  of 
small  variations  in  the  facet’s  position  or  ori¬ 
entation.  On  the  other  hand,  when  the  images 
are  highly  textured,  small  changes  in  the  lacet 
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can  significantly  change  the  value  of  this  com¬ 
ponent.  Thus,  we  weight  the  intensity  correla¬ 
tion  component  most  strongly  for  those  facets  in 
which  the  projected  image  intensities  are  highly 
textured. 

Conversely,  the  albedo  variation  component 
is  most  accurate  when  the  intensities  within  a 
facet  vary  slowly.  This  is  because  we  are  assum¬ 
ing  that  the  albedo  varies  slowly  enough  across 
the  surface  that  a  constant-albedo  facet  is  a 
good  model  for  the  surface.  Since  the  facets  are 
planar,  this  should  produce  images  whose  inten¬ 
sities  are  constant  within  the  projected  facet. 
Thus,  we  weight  the  albedo  variation  compo¬ 
nent  most  strongly  when  the  projected  intensi¬ 
ties  within  a  facet  vary  slowly. 

Since  rapidly  changing  albedoes  produce 
highly  textured  image  regions,  our  weighting 
scheme,  in  effect,  turns  off  the  shading  com¬ 
ponent  and  turns  on  the  stereo  component  in 
such  regions.  Thus,  it  provides  the  shape  from 
shading  component  with  implicit  boundary  con¬ 
ditions  at  the  edge  of  regions  of  constant  albedo. 

The  surface  smoothness  component  is  re¬ 
quired  as  a  stabilizing  term  because  neither  of 
the  above  components  is  likely  to  be  exactly  cor¬ 
rect,  the  surfaces  are  not  exactly  Lambertian, 
the  camera  positions  are  not  exactly  correct, 
there  is  noise  in  the  images,  and  so  on.  Cur¬ 
rently,  we  use  the  heuristic  technique  of  starting 
with  a  relatively  large  weight  for  the  smoothness 
component,  and  decrease  it  as  the  iterations 
proceed.  The  theoretically  optimal  point  at 
which  the  smoothness  weight  should  no  longer 
be  decreased  is  still  an  open  question,  although 
a  single,  empirically  determined,  value  has  been 
used  with  great  success  across  all  of  the  images 
presented  in  this  paper  when  using  all  of  the 
components. 

In  the  following  section,  we  describe  the  sur¬ 
face  representation  and  optimization  algorithm 
in  more  detail. 

4  Details  of  Surface  Model 
and  Optimization  Proce¬ 
dure 

As  discussed  in  the  previous  section,  our  ap¬ 
proach  to  recovering  surface  shape  and  re¬ 


flectance  properties  from  multiple  images  is  to 
deform  a  three-dimensional  representation  of 
the  surface  so  as  to  minimize  an  objective  func¬ 
tion.  The  free  variables  of  this  objective  func¬ 
tion  are  the  coordinates  of  the  vertices  of  the 
mesh  representing  the  surface,  and  the  process 
is  started  with  an  initial  estimate  of  the  surface. 
For  the  experiments  described  in  this  paper,  we 
have  derived  this  initial  estimate  by  triangu¬ 
lating  the  smooth  depth-map  generated  by  the 
correlation- based  stereo  algorithm  described  in 
(19,  15].  Figure  1  illustrates  the  output  of  this 
algorithm  on  a  synthetic  stereo  pair. 

Alternatively,  we  could  have  relied  on  more 
sophisticated  algorithms  that  can  triangulate 
noisy  laser  or  stereo  range-data  to  derive  our 
initial  estimates  [14,  18,  42].  All  these  meth¬ 
ods  tend  to  smooth  the  data  and  to  interpolate 
blindly  in  the  absence  of  data  so  that  their  out¬ 
put  needs  to  be  refined  by  algorithms  such  as 
ours. 

In  this  section,  we  describe  more  formally 
each  part  of  our  approach. 

4.1  Images  and  Camera  Models 

In  this  paper,  we  assume  that  images  are 
monochrome,  and  that  their  camera  models  are 
known  a  priori.  The  set  of  grey-level  images  is 
denoted  G  =  (<7i,p2,...,p„j,).  A  point  in  an 
image  is  denoted  u  =  («,  u)?  ^^d  the  intensity 
of  point  u  in  image  p,  is  denoted  Pi(u).  For  non¬ 
integer  values  of  u  we  use  bilinear  interpolation 
over  the  four  points  represented  by  the  floor  and 
ceiling  of  the  coordinates  of  u. 

The  projection  of  an  arbitrary  point  x  = 
(x,  y,  z)  in  space  into  image  gi  is  denoted  in,(x). 
There  are  well-known  methods  for  correcting 
both  geometric  and  radiometric  errors  in  im¬ 
ages,  as  surveyed  in  [4],  pp.  68-77.  Thus,  we 
assume  that  all  effects  of  lens  distortion  and  the 
like  have  been  taken  care  of  in  producing  the  in¬ 
put  images,  so  that  the  projection  of  a  surface 
into  an  image  is  well  modeled  by  a  perspective 
projection.  Thus,  u  =  mi(x)  can  be  written  as: 
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Figure  1:  (a,b)  A  synthetic  stereo  pair  generated  by  texture-mapping  a  real  image  of  the  Martin-Marietta 
ALV  test-site  onto  a  Digital  Elevation  Model  (DEM),  (c)  The  disparity  map  using  a  correlation-based 
algorithm.  The  black  areas  indicate  that  the  stereo  algorithm  could  not  find  a  match.  Elsewhere,  lighter 
greys  indicate  higher  elevations,  (d)  The  same  disparity  map  after  smoothing  and  interpolation. 


u  =  U/W 
V  =  V/W, 

where  Mi  is  a  three  by  four  projection  matrix. 

4.2  Surface  Representation 

We  represent  a  surface  «5  by  a  hexagonally- 
connected  set  of  vertices  V  = 
called  a  mesh.  The  position  of  vertex  Vj  is  spec¬ 
ified  by  its  Cartesian  coordinates  {xj,yj,2j). 
Figure  2  shows  such  a  mesh  as  a  wire  frame 
and  as  a  shaded  solid  surface. 

Each  vertex  in  the  interior  of  the  surface  has 
exactly  six  neighbors.  The  neighbors  of  vertex 
Vj  are  consistently  ordered  in  a  clock-wise  lash- 
ion.  Vertices  on  the  edge  of  a  surface  may  have 
anywhere  from  two  to  five  neighbors. 

Neighboring  vertices  are  further  organized 
into  triangular  planar  surface  elements  called 
facets,  denoted  F  =  (ft,  fi,  ■  •  • ,  fn^)-  The  ver¬ 
tices  of  a  facet  are  also  ordered  in  a  clock-wise 
fashion.  In  this  work,  we  require  that  the  initial 
estimate  of  the  surface  have  facets  whose  sides 
are  of  equal  length.  The  objective  function  de¬ 
scribed  below  tends  to  maintain  this  ecpiality, 
but  does  not  strictly  enforce  it.  The  representa¬ 
tion  can  be  extended  in  a  straight-forward  fash¬ 
ion  to  support  different  stirface  resolutions  by 
sub-dividing  facets  (which  we  have  done).  How¬ 
ever,  facets  of  a  given  resolution  will  still  be  re¬ 
quired  to  have  approximately  equal  sides. 


4.3  Objective  Function 

The  objective  function  £{S)  that  we  use  to  re¬ 
cover  the  surface  is  best  described  in  two  equa¬ 
tions.  In  the  first  equation, 

i^(5)  =  AofD(5)  +  i:G(5),  (1) 

S(S)  is  decomposed  into  a  linear  combination  of 
two  components.  The  first  component,  €d(S), 
is  a  measure  of  the  deformation  of  the  surface 
from  a  nominal  shape,  and  is  independent  of  the 
images.  For  this  paper,  the  nominal  shape  is  a 
plane.  Higher-order  measures,  such  as  deforma¬ 
tion  from  a  sphere,  are  also  possible.  This  nom¬ 
inal  shape  represents  the  shape  that  the  surface 
would  take  in  the  absence  of  any  information 
from  the  images. 

The  second  component, 

SciS)  =  Ac;:c(^)  +  A5^:.v(N)  (2) 

depends  on  the  images,  and  is  the  one  that 
drives  the  reconstruction  proce.ss.  It  is  further 
decomposed  into  a  linear  combination  of  the  two 
information  sources  de.scribed  in  the  previous 
section:  a  multi-image  correlation  component. 

J^nd  a  component  that  depends  on  the 
shading  of  the  surface,  f.s(N). 

These  comi)onents,  and  their  relative  weiglits. 
are  described  in  more  detail  below. 
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Figure  2:  The  top  row  shows  a  hexagonal  mesh  as  both  a  wireframe  and  a  shaded  surface.  The  bottom  row 
shows  several  images  of  a  scene.  In  our  approach,  these  images  are  projected  onto  the  mesh  using  earner  - 
models. 


4.3.1  Surface  Deformation  Component 

As  stated  earlier,  the  s\irface  deformation  (or 
smoothness)  component  is  a  measure  of  the  de¬ 
viation  of  the  mesh  surface  from  some  nominal 
smooth  shape.  When  the  nominal  shape  is  a 
plane,  we  can  approximate  this  as  follows. 

Consider  a  perfectly  planar  hexagonal  mesh 
for  which  the  distances  between  neighboring 
vertices  are  exactly  equal.  Recall  that  the  mesh 
is  defined  so  that  the  neighbors  of  a  vertex  t’i  are 
ordered  in  a  clock-wise  fashion,  and  are  denoted 
If  hexagonal  mesh  was  perfectly  pla¬ 
nar,  then  the  third  neighbor  over  from  the 
neighbor,  u;v,(;-i-3)i  would  lie  on  a  straight  line 
with  Vi  and  u;v,(7)-  Given  that  the  inter-vertex 
distances  are  equal,  this  implies  that  coordi¬ 
nates  of  V,  equal  the  average  of  the  coordinates 


of  and  for  any  j. 

Given  the  above,  we  can  write  a  Measure  of 
the  deviation  of  the  mesh  from  a  j  e  as  fol¬ 
lows; 

3  (2li  -  Xt  -  1*:-)^  + 

^d{S)  =  ^  ('2yi-yk-yk'f+  ■ 

'=>  (2z,-zk-zk'f 

k=N,{}) 

k'=N.U+:i) 

Note  that  this  term  is  also  ecpiivalent  to 

the  srpiared  directional  curvature  of  the  sur¬ 
face  when  the  sides  have  approximately  eciual 
lengths  [27].  Also,  this  term  can  accommo¬ 
date  multiple  resolutions  of  facets  by  normaliz¬ 
ing  each  term  by  the  nominal  inter- vertex  spac- 
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ing  of  the  facets. 

4.3.2  Multi-Image  Intensity  Correlation 

The  multi-image  intensity  correlation  compo¬ 
nent  is  the  sum  of  squared  differences  in  inten¬ 
sity  from  all  the  images  at  a  given  sample-point 
on  a  facet,  summed  over  all  sample-points,  and 
summed  over  all  facets.  This  component  is  pre¬ 
sented  in  stages  in  the  remainder  of  this  subsec¬ 
tion. 

First,  we  define  the  sample-points  of  a  facet 
by  taking  advantage  of  the  fact  that  all  points 
on  a  triangular  facet  are  a  convex  combination 
of  its  vertices.  Thus,  we  can  write  the  sample- 
points  x;t,{  of  facet  fk  as: 


visible  in  that  image,  otherwise,  it  is  not.  Let 
v,(x)  =  1  when  point  x  is  determined  to  be 
visible  in  image  gi  by  the  method  above,  and 
t>,(x)  =  0  otherwise.  Then,  the  correct  form  for 
the  sum  of  squared  differences  in  intensity  at  a 
point  X  is: 


(T(x) 


E"=1  «.(x)gi(m.(x)) 
Er=i  v.(x) 

Er=i  (</j(m.(x))  - 

Erii  «i(x) 


FinaUy,  summing  (7(x)  over  all  sample-points 
and  over  all  facets  yields  the  multi-image  inten¬ 
sity  correlation  component: 


xjfc,/  —  A/,1  Xjt,i  +  A;,2Xit,2  +  A/,3Xit,3,  /  =  3,4, . .  .n„ 

where  Xfc.i,  X/t.j,  and  Xfc,3  are  the  coordinates  of 
the  vertices  of  facet  fk,  and  A/,i  -|-  A/,2  +  A/,3  =  1. 

In  the  top  half  of  Figure  3(a),  we  see  an  example 
of  the  sample  points  of  a  facet. 

Next,  we  develop  the  sum  of  squared  differ¬ 
ences  in  intensity  from  all  images  for  a  given 
point  X.  Recall  that  a  point  x  in  space  is  pro¬ 
jected  into  a  point  u  in  image  gi  via  the  perspec¬ 
tive  transformation  u  =  m,(x).  Consequently, 
the  sum  of  squared  differences  in  intensity  from 
all  the  images,  is: 

”•  1=1 

"•  1=1 

Figure  3(a)  illustrates  the  projection  of  a 
sample-point  of  a  facet  onto  several  images. 

The  above  definition  of  o''(x)  does  not  take 
into  account  occlusions  of  the  surface.  To  do 
so,  we  use  a  “Facet-ID”  image  shown  in  Fig¬ 
ure  4.  It  is  generated  by  encoding  the  index 
i  of  each  facet  /,-  as  a  unique  color,  and  pro¬ 
jecting  the  surface  into  the  image  plane  using  a 
standard  hidden-surface  algorithm.  Thus,  when 
a  sample-point  from  facet  fk  is  projected  into 
an  image,  the  index  k  is  compared  to  the  in¬ 
dex  stored  in  the  Facet-ID  image  at  that  point. 

If  they  are  the  same,  then  the  sample-point  is 


^c{S)  =  ^Ck  53 

A:=l  1=3 

where  Ck  is  a  number  between  0  and  1  that 
weights  the  contribution  from  each  facet  differ¬ 
ently,  depending  on  the  average  degree  of  tex¬ 
turing  within  a  facet  (see  Section  4.3.4). 

When  the  original  surface  giving  rise  to  the 
images  is  sufficiently  textured,  this  component 
should  be  smallest  when  the  surface  S  closely 
approximates  the  original  surface.  However, 
when  the  surface  has  constant,  or  nearly  con¬ 
stant,  albedo  this  component  would  be  small 
for  many  different  surfaces.  As  an  extreme  ex¬ 
ample  of  this  ambiguity,  consider  a  planar  sur¬ 
face  with  constant  albedo.  This  produces  im¬ 
ages  with  constant  intensity.  Thus,  this  compo¬ 
nent  will  not  be  able  to  constrain  the  shape  of 
the  surface,  since  the  difference  in  intensity  will 
be  zero  for  all  surfaces. 

An  example  of  using  only  the  intensity- 
correlation  and  smoothness  components  on  the 
synthetic  stereo  pair  of  Figure  1  is  shown  in  Fig¬ 
ure  5.  The  top  row  of  the  figure  depicts  the 
initial  surface  estimate.  Figures  5(a)  and  (b) 
are  shaded  images  of  the  mesh.  Figure  5(c)  de¬ 
picts  the  error  from  ground-truth  elevation  for 
the  left  image,  where  black  indicates  zero  error, 
and  white  indicates  an  error  corresponding  to  a 
few  pixels  in  disparity.  Figure  5(d)  depicts  the 
squared  difference  in  intensity  between  the  left 
image  and  the  right  images  warped  using  the 
disparity  map.  Note  that  the  worst  errors  occur 
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Figure  3:  (a)  Facets  are  sampled  at  regular  intervals  as  illustrated  here.  We  use  the  grey  levels  of  the 
projections  of  these  sample  points  to  compute  the  stereo  score,  (b)  The  albedo  of  each  facet  is  estimated 

using  the  facet  normal  1^ ,  the  light  source  direction  L  and  the  average  grey  level  of  the  projection  of  the 
facet  into  the  images. 


Figure  4:  Illustration  of  the  projection  of  a  mesh,  and  the  “Facet-ID”  image  used  to  accomodate  occlusions 
during  surface  reconstruction,  (a)  A  shaded  image  of  a  mesh,  (b)  A  wire-frame  representation  of  the  mesh 
(bold  white  lines)  and  the  sample-points  in  each  facet  (interior  white  points),  (c)  The  “Facet-ID”  image, 
wherein  the  color  at  a  pixel  is  chosen  to  uniquely  identify  the  visible  facet  at  that  point  (shown  here  as  a 
grey-level  image). 


along  the  steep  ridge  of  the  terrain,  where  the 
constant-depth  assumption  of  correlation-based 
stereo  is  most  strongly  violated. 


The  bottom  row  of  Figure  5  illustrates  the  re¬ 
sult  of  the  optimization  procedure,  described  in 
Section  4.4,  using  only  the  intensity-correlation 
and  smoothness  components.  Note  that  the 
overall  error  in  both  elevation  and  intensity  is 
lower,  and  that  the  error  is  no  longer  concen¬ 
trated  along  the  ridge.  As  a  result,  the  ridge  is 
clearly  sharper  in  the  shaded  views. 


4.3.3  Shading 

The  shading  component  of  the  objective  func¬ 
tion  is  the  sum,  over  all  facets,  of  the  difference 
between  the  computed  albedo  of  the  facet  and 
the  computed  albedoes  of  all  of  its  neighbors. 
The  motivation  for  this  component,  and  its  pre¬ 
cise  form,  follow. 

Recall  that  the  Lambertian  reflectance  model 
defines  the  intensity  jr  at  a  point  on  a  surface 

with  a  unit  surface  normal  7^  as: 

g  =  a{a  +  bT^  •  L),  (3) 
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Figure  5:  (a,b)  Two  shaded  views  of  the  mesh  derived  from  the  smoothed  disparity  map  of  Figure  1(d).  (c) 
Deviations  in  altitude  from  the  elevation  data  used  to  generate  the  synthetic  pair,  (tl)  Intensity  error  image, 
created  by  warping  the  right  image  into  the  left  image  using  the  disparities  corresponding  to  the  elevations 
of  the  mesh  facets  and  computing  the  s(|uared  difference  between  these  two  images  (e,f,g,h)  Corresponding 
images  after  stereo  optimization.  Note  that  the  ridge  now  appears  much  sharper  in  the  shaded  views,  and 
that  the  overall  error  is  smaller  and  more  evenly  distributed. 


where  a  is  the  albedo  of  the  surface,  a  is  the 
magnitude  of  the  ambient  light,  b  is  the  mag¬ 
nitude  of  a  point  light  source,  and  L  is  the 
direction  of  the  point  light  source  as  depicted 
in  Figure  3(b). 

Note  that  g  is  independent  of  the  viewing  di¬ 
rection.  Consequently,  if  we  were  to  image  a 
planar  Lambertian  facet  from  several  points  of 
view,  its  intensity  would  be  the  same  for  all  pix¬ 
els  in  the  projection  of  the  facet.  Conversely,  if 
we  were  to  meastire  the  average  intensity  gk  of 
all  of  the  pixels  within  the  projection  of  a  facet 
fk,  we  could  compute  its  albedo,  o*,  as  follows: 


{a  +  bN  ■  L) 

This  assumes,  of  course,  that  the  facet  is  well- 
modeled  by  a  single  albedo,  and  that  the  vari¬ 
ation  in  intensity  is  due  only  to  noi.se.  In  this 
paper,  we  assume  that  the  ambient  and  direct 

illumination  (i.c.,  a.  b.  and  L  )  are  given,  al¬ 


though  some  of  these  parameters  could  be  in¬ 
cluded  in  the  optimization,  as  was  done  in  [31]. 

The  average  intensity  gk  of  a  facet  is  com¬ 
puted  by  scanning  over  all  the  Facet-ID  images 
for  index  k,  and  taking  the  average  of  the  inten¬ 
sities  at  matching  points  in  the  corresponding 
images.  This  method  provides  an  inexpensive 
way  of  computing  the  average  intensity  w'hile 
taking  occlu.sions  into  account. 

Now,  if  the  original  surface  had  exactly  con¬ 
stant  albedo,  and  if  our  mesh  surface  were 
a  good  approximation  to  the  original  s>irface, 
then  the  computed  albedoes  should  be  approx¬ 
imately  the  same  across  all  facets.  Thus,  some 
measure  of  the  variation  in  computed  albedoes 
would  be  a  good  measure  of  the  correctness  of 
the  mesh  surface.  If  the  albedo  varies  slowly 
across  the  surface,  we  propo.se  that  an  appro¬ 
priate  meastire  of  this  variation  the  difference 
between  the  computed  albedo  at  the  facet  ami 
the  comi)Uted  albedoes  of  all  of  its  neighbors: 
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fc=i  jeN,(k) 

where  Nf{k)  is  the  set  of  indices  of  the  facets 
that  are  neighbors  of  facet  ft,  and  Ck  and  Cj  are 
numbers  between  0  and  1  that  depend  on  the 
degree  of  texturing  within  facets  /t  and  fj. 

An  example  of  using  only  the  shading  and 
smoothness  components  is  illustrated  in  Fig¬ 
ure  6.  Figure  6(a)  shows  a  shaded  view  of 
the  original  surface,  a  hemisphere  with  constant 
albedo.  Figures  6(b)  and  (c)  show  shaded  views 
of  the  initial  surface  estimate,  which  was  de¬ 
rived  by  adding  white  noise  to  the  vertex  co¬ 
ordinates  of  the  original  surface.  Figures  6(d) 
and  (e)  are  the  shaded  views  of  the  result  af¬ 
ter  optimization,  and  Figure  6(f)  is  the  albedo 
map  for  the  surface,  i.e.  the  intensity  in  the  im¬ 
age  represents  the  albedo  of  the  surface.  Note 
that  the  albedo  and  shape  are  well  recovered  ex¬ 
cept  near  the  edge  of  the  hemisphere  where  the 
image  intensity  varies  rapidly  across  the  image. 
This  is  because  the  approximation  we  use  in  the 
derivatives  of  this  component  is  that  the  mean 
intensity  within  a  facet  does  not  vary  signifi¬ 
cantly  in  the  neighborhood  of  a  facet,  which  is 
violated  for  facets  that  straddle  the  boundary. 
This  does  not  hurt  us  when  combining  shading 
with  the  stereo  component  since,  as  expl<un  in 
the  following  subsection,  we  turn  off  the  shading 
component  in  such  areas. 

4.3.4  Combining  the  Components 

Recall  that  the  objective  function  £{S)  is  a  lin¬ 
ear  combination  of  three  components: 


Thus,  one  needs  to  specify  both  the  As,  defining 
the  relative  weights  of  the  components,  and  the 
CfcS,  defining  the  relative  weights  of  the  facets  in 
each  of  these  components. 

The  A  weights  are  defined  as  follows: 


II  ^^D(^)  II 
II  ^^c(5«)  II 

a:. 

II  if 


(6) 


where  is  the  initial  estimate  of  the  surface, 
and  the  A's  are  user  defined  weights.  Normal¬ 
izing  each  component  by  the  magnitude  of  its 
initial  gradient  allows  the  components  to  have 
roughly  the  same  influence  when  the  A's  are 
equal.  Thus,  the  user  can  more  easily  specify 
the  relative  contributions  of  each  component  in 
an  image-independent  fashion.  This  normaliza¬ 
tion  scheme  was  used  with  great  success  in  [16], 
and  is  analogous  to  standard  constrained  op¬ 
timization  techniques  in  which  the  various  con¬ 
straints  are  scaled  so  that  their  eigenvalues  have 
comparable  magnitudes  [34]. 

As  mentioned  earlier,  the  Ck  weights  are  a 
function  of  the  degree  of  texturing  in  the  in¬ 
tensities  projected  within  a  facet  fk.  A  sim¬ 
ple  measure  of  the  degree  of  texturing  within  a 
facet  is  the  variance  in  intensity  of  all  the  pixels 
projecting  onto  the  facet,  denoted  <rk{S)  (us¬ 
ing  the  Facet-ID  image  to  accommodate  occlu¬ 
sions).  We  have  found  that  using  the  logarithm 
of  (rik(5)  yields  the  most  stable  results: 


£{S)  =  AD£:D(5)-l-AcW)  +  A5f5(5), 

where  the  last  two  components  are  themselves 
linear  combinations  of  subcomponents  com¬ 
puted  on  a  per-facet  basis: 

f'c(S)  =  !;«  f;<T(x»j)  (5) 

fc=l  <=3 

W)  =  53(1 -Cfc)  53  (l-c,)(afc-a,)^ 

k=l  }GNi(k) 


Cfc  =  olog(l -|-(Tfc(5))-|- 6,  (7) 

where  a  and  b  are  normalizing  factors  chosen  so 
that  the  smallest  Ck  is  zero,  and  the  largest  is 
one. 

4.4  The  Optimization  Procedure 

The  purpose  of  the  optimization  procedure  is  to 
iteratively  modify  the  surface  5  so  as  to  mini¬ 
mize  £[S),  given  some  initial  estimate  and 
some  value  for  the  weights  A5,  A^,  and  X'p 
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Figure  6:  (a)  Shaded  image  of  a  hemisphere  of  contant  albedo.  (b,c)  Shaded  views  of  randomized  hemisphere 
used  as  a  starting  point.  (d,e)  Shaded  views  of  the  same  hemisphere  after  optimization  using  only  the  shading 
component  of  the  objective  function,  (f)  The  recovered  albedo  map. 


(where  A5  +  AJ;;  +  AJj  =  1)  defined  in  Equa¬ 
tion  7.  Ideally,  one  would  like  to  use  as  small  a 
value  of  the  deformation  weight  A^  as  possible 
so  as  to  minimize  the  bias  introduced  by  this 
term.  However,  in  practice,  \'q  serves  a  dual 
purpose.  First,  since  the  surface  deformation 
term  is  a  quadratic  function  of  the  vertex  co¬ 
ordinates,  it  “convexifies”  the  energy  landscape 
and  improves  the  convergence  properties  of  the 
optimization  procedure.  Second,  as  will  be  dis¬ 
cussed  in  the  results  section,  in  the  absence  of 
a  smoothing  term,  the  objective  function  may 
overfit  the  data  and  wrinkle  the  surface  exces¬ 
sively.  Furthermore,  the  Ck  weights  of  Equations 
6  and  7  are  computed  for  the  initial  position  of 
the  mesh  and  are  only  meaningful  when  it  is 
relatively  close  to  the  actual  surface. 

Consequently,  we  use  an  optimization  method 
that  is  inspired  by  the  heuristic  technique 
known  as  a  continuation  method  [43, 28, 29,  30]. 
We  first  “turn  off”  the  shading  term  by  setting 
X'g  (equation  7)  to  0  and  set  to  a  value  that 
is  large  enough  to  sufficiently  convexify  the  en¬ 
ergy  landscape  but  small  enough  to  allow  cur¬ 


vature  in  the  surface.  In  this  paper  we  take  the 
initial  value  of  to  be  0.5.  Given  the  initial 
estimate  5®,  a  local  minimum  of  this  approxi¬ 
mate  objective  function  is  found  using  a  stan¬ 
dard  optimization  procedure.  Then,  A^  is  de¬ 
creased  slightly,  and  the  optimization  procedure 
is  applied  again,  starting  at  the  local  minimum 
found  for  the  previous  approximation.  This  cy¬ 
cle  is  repeated  until  A^  is  decreased  to  the  de¬ 
sired  value.  Finally  we  “turn  on”  the  shading 
term,  compute  the  c*  weights  and  reoptimize. 
In  all  examples  shown  in  the  result  section  we 
use  X'q  —  X'g  =  .4  and  A^  =  .2. 

The  stereo  component  effectively  uses  only 
first  order  information  about  the  surface  (i.e., 
the  position  of  the  vertices),  whereas  shading 
uses  second  order  information  about  the  sur¬ 
face  (i.e.,  its  surface  normals).  Thus,  by  op¬ 
timizing  the  stereo  component  first,  we  effec¬ 
tively  compute  the  zero  order  properties  of  the 
surface  and  set  up  boundary  conditions  that  the 
shading  component  can  then  tise  to  compute  the 
first  order  properties  of  the  surface  in  texture¬ 
less  regions.  In  section  5,  we  will  show  that  this 
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leads  to  a  significant  improvement  over  using 
the  stereo  component  alone. 

When  dealing  with  surfaces  for  which  motion 
in  one  direction  leads  to  more  dramatic  changes 
that  motions  in  others,  as  is  typically  the  case 
with  the  z  direction  in  Digital  Elevation  Mod¬ 
els  (DEMs),  we  have  found  that  the  following 
heuristic  to  be  useful.  We  first  Ax  the  z  and  y 
coordinates  of  vertices  and  adjust  z  alone.  Once 
the  surface  has  been  optimized,  we  the  allow  all 
of  the  coordinates  to  vary  simultaneously. 

The  optimization  procedure  we  use  at  ev¬ 
ery  stage  is  a  standard  conjugate-gradient  de¬ 
scent  procedure  called  FRPRMN  (from  [38])  in 
conjunction  with  the  a  simple  line  search  al¬ 
gorithm.  The  conjugate-gradient  procedure  re¬ 
quires  three  inputs:  1)  a  function  that  returns 
the  value  of  the  objective  function  for  any  <5;  2) 
a  function  that  returns  the  gradient  of  5(<S), 
i.e.,  a  vector  whose  elements  are  the  partial 
derivatives  of  €{S)  with  respect  to  the  vertex 
coordinates,  evaluated  at  5;  and  3)  an  initial 
estimate  «S®. 

The  gradient  of  ^(«S)  is  conceptually  straight¬ 
forward,  but  is  fairly  complicated  to  derive  man¬ 
ually.  We  have  used  the  Maple  ^  mathematical 
package  to  derive  some  of  the  terms.  We  sum¬ 
marize  the  calculation  of  the  derivatives  below 
in  general  terms. 

The  derivatives  of  the  stereo  term  are  lin¬ 
ear  combinations  of  image  intensity  derivatives 
and  of  derivatives  of  the  3-D  projections  of 
points  onto  the  images.  Since  we  use  bilinear- 
interpolation  of  image  values,  the  flrst  deriva¬ 
tives  of  image  intensity  are  linear  combinations 
of  the  image  intensities  in  the  immediate  neigh¬ 
borhood  of  the  projection.  Since  sample-points 
are  linear  combinations  in  projective  space  of 
the  mesh  vertices,  their  projections  are  ratios 
of  linear  combinations  of  the  projections  of 
the  vertices,  which  themselves  depend  linearly 
on  the  vertex  coordinates.  Consequently,  the 
derivatives  of  these  projections  are  ratios  of  lin¬ 
ear  combinations  of  the  vertex  coordinates  and 
squares  of  linear  combinations  of  the  vertex  co¬ 
ordinates. 

Similarly,  the  derivatives  of  the  shading  term 
depend  of  the  derivatives  of  the  surface  nor- 
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mal,  which  can  be  easily  derived  analytically, 
and  from  the  derivative  of  the  mean  grey-level 
in  the  facets.  In  this  work,  the  shading  term  is 
used  mmnly  in  the  fairly  uniform  areas  where 
the  latter  derivative  is  assumed  to  be  smaU  and 
therefore  neglected. 

5  Behavior  of  the  Objective 
Function  and  Results 

In  previous  sections,  we  have  shown  results  of 
the  optimization  procedure  using  oidy  one  or 
the  other  of  the  image  components  of  the  objec¬ 
tive  function.  In  this  section,  we  Arst  illustrate 
the  behavior  of  the  complete  objective  function 
using  synthetic  data.  We  then  show  that  the 
same  behavior  can  be  observed  with  real  data, 
allowing  us  to  generate  accurate  3-D  reconstruc¬ 
tions  of  real  surfaces  from  multiple  images. 

5.1  Synthetic  Data 

To  demonstrate  the  properties  of  the  objective 
function  of  Equation  1  and  the  inAuence  of  the 
coefficients  deAned  in  Equation  4,  we  use  as  in¬ 
put  the  Ave  synthetic  images  of  a  shaded  hemi¬ 
sphere  with  variable  albedo  shown  at  the  bot¬ 
tom  of  Figure  7,  both  with  and  without  the  ad¬ 
dition  of  white  noise.  Each  column  of  the  Agure 
illustrates  the  steps  used  in  the  creation  of  the 
image  at  the  bottom  of  the  column.  We  be¬ 
gin  with  a  mesh  and  an  albedo  map,  shown  in 
the  top  row.  Then,  for  each  view,  two  images 
are  produced.  The  Arst  image  (second  row  of 
the  Agure)  is  the  albedo  map  texture-mapped 
onto  the  mesh  from  the  Anal  image’s  point  of 
view.  The  second  image  (third  row  of  the  Ag¬ 
ure)  is  a  shaded  view  of  the  mesh,  using  a  con¬ 
stant  albedo  equal  to  one.  The  Anal  image  is  the 
point-by-point  product  of  these  two  images  be¬ 
cause,  by  Equation  3,  the  imaged  intensity  of  a 
Lambertian  surface  is  the  product  of  the  albedo 
(Arst  image)  and  the  inner  product  of  the  Ught 
source  and  the  surface  normal  (second  image). 

Figure  8  depicts  graphically  the  result  of  our 
experiments.  In  each  experiment  we  random¬ 
ized  the  mesh  by  adding  random  numbers  to 
the  coordinates  of  the  mesh  vertices,  and  added 
different  amounts  of  noise  to  the  input  images. 
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Figure  7:  The  making  of  synthetic  images  of  a  shaded  hemisphere  with  variable  albedo  that  conforms  to 
our  Lambertian  model. 


We  then  used  our  optimization  procedure  to  es¬ 
timate  the  true  hemispherical  shape  and  true 
albedo  map.  More  precisely,  starting  from  our 
randomized  initial  estimate,  we  first  use  stereo 
alone  and  progressively  decrease  the  value  of 
the  parameter  of  Equation  7  from  0.5  to 
0.  We  then  turn  on  the  shading  term  by  set¬ 
ting  both  A^  and  A^  to  0.4,  compute  the  cj^s 
of  Equation  7  and  optimize  the  full  objective 
function.  To  show  the  stability  of  the  process, 
we  then  recompute  the  CkS  for  the  optimized 
mesh  and  perform  a  second  optimization  using 
the  updated  values. 

The  first  column  of  Figure  8  is  for  experi¬ 
ments  using  only  the  first,  second,  and  third  im¬ 
ages  from  Figure  7,  where  there  is  little  self  oc¬ 
clusion.  The  second  column  is  for  experiments 
using  the  first,  fourth,  and  fifth  images,  where 


there  is  a  significant  amount  of  self  occlusion. 
Finally,  the  third  column  is  for  experiments  us¬ 
ing  all  five  images.  In  this  particular  set  of  ex¬ 
periments,  we  fixed  the  boundaries  of  the  mesh 
and  allowed  only  the  z  coordinates  of  the  ver¬ 
tices  to  vary.  However,  the  same  overall  be¬ 
haviors  can  be  observed  without  the  boundary 
conditions. 

The  first  row  from  the  top  of  Figure  8  is 
a  graph  of  the  average  squared  error  in  eleva¬ 
tion  (the  abscissa)  versus  decreasing  A^  (the 
ordinate).  To  the  left  of  the  dotted  vertical 
line,  only  the  intensity  correlation  component  is 
used.  To  the  right,  both  the  intensity  correla¬ 
tion  and  shading  components  are  used.  The  dif¬ 
ferent  curves  are  for  different  amounts  of  noise 
in  the  input  images.  The  bottom  curve  is  when 
there  is  no  noise  (other  than  quantization  error). 


1111 


Figure  8:  Graphs  of  the  errors  and  objective  function  components  while  fitting  a  surface  model  to  the 
synthetic  shaded  hemisphere  images  of  Figure  7  (These  graphs  are  explained  in  detail  in  the  text.).  (a,b,c) 
Average  error  in  recovered  elevation.  (d,e,f)  Average  error  in  recovered  albedo.  (g,h,i)  Stereo  component  of 
the  energy.  (j,k,l)  Shading  component  of  the  energy. 


the  middle  curve  is  for  a  noise  variance  of  4% 
of  the  image  dynamic  range,  and  the  top  curve 
is  for  a  noise  variance  of  8%.  The  short  verti¬ 
cal  lines  along  the  curves  indicated  the  standard 
deviation  of  the  average  error  over  the  20  exper¬ 
iments  performed  to  derive  each  curve. 

The  second  row  of  Figure  8  is  a  graph  of  the 
average  error  in  compiited  albedo.  The  third 
row  is  the  average  value  of  the  intensity  corre¬ 
lation  component,  Sc{S),  and  the  fourth  row 
is  the  average  value  of  the  shading  component. 


Ss{S). 

Note  that,  as  decreases  and  stereo  alone 
is  used  (t.e.,  as  the  ordinate  is  traversed  right¬ 
wards  to  the  dotted  vertical  line),  the  average 
elevation  error  decreases  when  there  is  no  noise 
in  the  input  image  (bottom  curve),  as  does 
the  average  albedo  error  and  the  two  compo¬ 
nents  of  the  objective  function.  However,  when 
the  images  are  noisy,  the  elevation  error  (first 
row)  stops  decreasing  and  may  even  begin  to 
increase  as  we  start  fitting  to  the  grey-level 
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noise,  even  though  the  value  of  the  intensity 
correlation  component  (third  row)  continues  to 
decrease  (as  it  must).  Furthermore,  both  the 
albedo  error  (second  row)  and  the  shading  com¬ 
ponent  (fourth  row)  also  begin  to  increase  when 
the  elevation  error  does.  This  is  natural  since 
for  smaller  values  of  the  surface  becomes 
rougher  and  its  normals  less  well-behaved.  As 
a  result,  the  estimated  albedoes  of  Equation  4 
become  less  reliable  and  noisier. 

In  other  words,  an  increase  in  the  shading 
component  provides  us  with  a  warning  that  we 
are  starting  to  overfit  the  data.  This  is  a  valu¬ 
able  behavior  in  itself.  Furthermore,  by  turning 
on  the  shading  component  of  our  objective  func¬ 
tion  (those  parts  of  the  graphs  that  are  to  the 
right  of  the  vertical  dotted  line),  we  can  bring 
down  both  the  error  in  albedo  and  the  value 
of  albedo  component  with  at  worst  of  modest 
increase  in  the  value  of  the  stereo  component, 
resulting  in  an  overall  reduction  of  the  elevation 
error.  Even  when  there  is  nothing  but  quanti¬ 
zation  noise  in  the  image,  the  addition  of  the 
shading  component  can  make  a  small,  but  still 
noticeable  difference.  The  reasons  for  this  are 
twofold: 

1.  The  shading  component  averages  over 
whole  facets  and  is  therefore  less  sensitive 
to  uncorrelated  noise. 

2.  The  shading  component  uses  absolute  in¬ 
tensity  values  whereas  the  stereo  compo¬ 
nent  uses  intensity  differences.  Thus,  in  the 
presence  of  noise  in  textureless  areas,  the 
signal- to-noise  ratio  for  the  absolute  values 
(used  by  the  shading  component)  is  larger 
than  for  the  differences  (used  by  the  stereo 
component),  thereby  making  the  shading 
term  more  robust. 

However,  in  our  experience,  the  shading  term 
can  only  be  used  reliably  when  the  surface  is  rel¬ 
atively  close  to  the  correct  answer.  This  is  not 
surprising  since  the  stereo  deals  directly  with  el¬ 
evations  whereas  shading  deals  with  derivatives 
of  elevation.  Consequently  we  have  chosen  the 
optimization  “schedule”  described  above  where 
we  first  optimize  using  stereo  alone  and  turn  on 
shading  only  later. 


There  is  another  important  point  to  note 
about  these  results.  The  elevation  errors  in  the 
second  row,  i.e  those  generated  using  images  1, 
4,  and  5  with  a  lot  of  self  occlusion  are  very  close 
to  those  of  the  first  row,  i.e.  those  generated  us¬ 
ing  images  1,  2,  and  3  with  little  self  occlusion, 
while  those  in  the  final  row  (using  all  five  im¬ 
ages)  are  significantly  better.  Furthermore,  in 
this  particular  case,  the  results  for  images  1,4 
and  5  are  even  slightly  better  than  those  for 
images  1,2  and  3  in  the  presence  of  noise  be¬ 
cause  the  former  correspond  to  larger  baselines. 
In  other  words,  having  the  same  number  of  im¬ 
ages,  but  with  significant  self-occlusions,  does 
not  hurt  our  procedure.  However,  adding  new 
images  that  contain  significant  self-occlusions 
actually  improves  the  results. 

We  now  turn  to  real  images  and  show  that 
the  same  properties  can  also  be  observed  there. 

5.2  Real  Images 

In  Figure  9  we  show  the  result  of  running  the 
stereo  component  of  our  objective  function  on  a 
real  stereo  pair  corresponding  to  the  same  site 
as  the  synthetic  images  of  Figure  1,  Note  that 
the  radiometry  of  the  left  and  right  images  are 
actually  slightly  different.  We  correct  for  this 
by  first  band-passing  each  image  by  taking  the 
difference  between  the  image  and  its  gaussian 
convolution.  This  is  approximately  equivalent 
to  replacing  the  simple  correlation  that  our  ob¬ 
jective  function  uses  by  a  normalized  correla¬ 
tion,  but  is  computationally  more  efficient.  We 
then  applied  the  optimization  using  exactly  the 
same  schedule  and  parameters  as  in  the  syn¬ 
thetic  case,  with  the  exception  that  A5  is  not 
reduced  quite  as  much  for  the  real  images  as 
for  the  synthetic  ones  in  the  first  step  of  the 
procedure.  Note  that  the  recovered  ridge  is 
even  sharper  than  in  the  synthetic  case.  This 
is  because  the  Digital  Elevation  Model  used  to 
produce  the  synthetic  right  image  was  actually 
a  slightly  smoothed  version  of  the  terrain,  in 
which  one  side  of  the  ridge  is  an  almost  verti¬ 
cal  Uff.  Thus,  even  though  we  do  not  currently 
have  ground  truth  for  the  real  case,  the  sharp¬ 
ness  of  the  recovered  cliff,  which  matches  what 
is  seen  using  a  stereo.scope,  leads  us  to  believe 
that  the  algorithm  has  performed  well. 
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Figure  9:  (a,b)  A  stereo  pair  of  real  images  of  the  Martin-Marietta  ALV  test-site  used  in  Figure  1.  (c) 
Intensity  error  image  computed  using  the  method  described  in  Figure  1(c)  (d.e)  Shaded  views  of  the  mesh 
after  optimization,  (f)  Intensity  error  image  after  optimization.  Note  that  the  ridge  is  now  very  sharp.  This 
corresponds  accurately  to  the  almost  vertical  cliff  that  can  be  seen  when  viewing  the  stereo  pair  with  a 
stereoscope. 


In  Figure  10  we  show  three  triplets  of  images 
of  faces.  They  have  been  produced  using  the  IN- 
RIA  three  camera  system  [13]  that  provides  us 
with  the  3  by  4  projection  matrices  we  need  to 
perform  our  computations.  In  this  case  it  is  es¬ 
sential  to  have  more  than  two  images  to  be  able 
to  reconstruct  both  sides  of  the  face  because  of 
self-occlusions.  For  each  triplet,  we  have  com¬ 
puted  disparity  maps  corresponding  to  images  1 
and  2  and  to  images  1  and  3  and  combined  them 
to  produce  the  depth  maps  shown  in  the  right¬ 
most  column  of  the  figure  using  the  algorithms 
described  in  [19,  15]. 

The  depth  maps  have  then  been  smoothed 
and  triangulated  to  produce  the  initial  surfaces 
shown  in  the  upper  left  corner  of  Figures  11, 
12.  and  13.  In  the  first  row  of  the.se  three  fig¬ 
ures.  we  show  the  result  of  the  optimization 
using  stereo  alone  as  we  progressively  decrease 
the  smoothness  constraint  and  allow  all  three 
vertex  coordinates  to  be  adjusted.  Note  that 
for  the  first  two  triplets  (Figures  11  and  12). 
we  recover  more  and  more  detail  until  the  sur¬ 


face  eventually  starts  to  wrinkle,  without  appar¬ 
ent  improvement  in  accuracy.  The  third  triplet 
poses  an  even  more  difficult  problem:  there  are 
strong  specularities  on  both  the  forehead  and 
the  no.se  that  strongly  violate  our  Lambertian 
model.  Because  there  are  very  few  other  points 
that  can  be  matched  on  the  nose,  the  algorithm 
latches  on  to  these  specularities  and  yields  a 
poor  result. 

In  the  bottom  row  of  Figures  11.  12.  and 
13,  we  show  our  final  results  obtained  by  turn¬ 
ing  on  the  shading  term  and  reoptimizing  the 
meshes.  For  these  images  we  did  not  know  a- 
priori  the  light  .source-direction,  we  therefore  es¬ 
timated  it  by  choosing  the  direction  that  min¬ 
imizes  the  shading  component  of  the  objective 
function  given  the  surface  optimized  using  only 
the  stereo  component.  In  all  three  images,  the 
main  features  of  the  faces,  nose,  mouth  and 
eyes  have  been  correctly  recovered.  The  im- 
j»rovement  is  particularly  striking  in  the  ca.se  of 
the  face  in  Figure  13.  The  shading  comi»onent 
was  able  to  achieve  this  result  l)erause  it  uses 
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Figure  10:  Triplets  of  face  images  and  corresponding  disparity  maps  (courtesy  of  INRIA). 


the  monocular  information  around  the  specular- 
ities.  The  stereo  component  cannot  take  advan¬ 
tage  of  the  information  around  the  specularities 
because  very  few  points  are  visible  in  at  least 
two  images  simultaneously,  and  because  there  is 
little  texture.  Of  course,  the  effect  of  the  spec¬ 
ularities  has  not  completely  disappeared  (there 
is  indeed  still  a  small  artifact  on  the  nose)  but 


has  been  outweighed  by  the  surrounding  infor¬ 
mation.  A  more  principled  ai)j>roach  to  .solving 
this  problem  would  be  to  explicitly  include  a 
specularity  term  in  o>ir  shading  model. 

The  graphs  of  Figure  14  depict  the  behav¬ 
ior  of  the  stereo  and  shading  components  of  the 
objective  function  for  the  three  triplets.  The 
four  values  of  the  scores  to  the  left  of  tlie  thick 
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Figure  11:  Results  for  the  first  triplet  of  Figure  10.  (a)  Shaded  view  of  the  mesh  generated  by  smoothing  and 
triangulating  the  computed  disparity  map.  We  use  it  as  the  starting  condition  for  our  optimization  procechire. 
(b,c,d)  The  mesh  after  optimization  using  only  the  stereo  term,  with  progressively  less  smoothing.  (e,f,g) 
Several  views  of  the  mesh  after  optimization  using  both  stereo  and  shading,  (h)  The  recovered  albedo  map. 


dotted  line,  Sto  to  5<3,  correspond  to  the  re¬ 
sults  shown  in  the  top  row  of  Figures  11,  12, 
and  13.  The  fifth  value,  St  -|-  Sh,  corresponds 
to  the  final  results  when  shading  is  turned  on. 
These  values  have  been  scaled  so  that  5<o  is 
equal  to  one  for  all  triplets.  As  in  the  synthetic 
case,  when  using  stereo  alone,  the  stereo  com¬ 
ponent  always  improves,  but  as  the  recovered 
surface  becomes  rougher  the  shading  term  de¬ 
grades  dramatically.  However,  when  we  turn  on 
the  shading  component,  the  overall  results  im¬ 
prove  significantly,  even  though  the  stereo  com¬ 
ponent  degrades  slightly. 

6  Summary  and  Conclusion 

In  this  paper  we  have  presented  a  surface  recon¬ 
struction  method  that  ii.sos  an  object-centered 
representation  (a  triangulated  mesh)  to  recover 
geometry  and  reflectance  properties  from  mul¬ 
tiple  images.  It  allows  us  to  handle  .self¬ 


occlusions  while  merging  information  from  sev¬ 
eral  viewpoints,  thereby  allowing  us  to  elimi¬ 
nate  blindspots  and  making  the  reconstruction 
more  robust  where  more  than  one  view  is  avail¬ 
able.  The  reconstruction  process  relies  on  both 
monocular  shading  cues  and  stereoscopic  cues. 
We  use  these  cues  to  drive  an  optimization  pro¬ 
cedure  that  takes  advantage  of  their  respective 
strengths  while  eliminating  some  of  their  weak- 
nes.ses. 

Specifically,  stereo  information  is  very  ro¬ 
bust  in  textured  regions  but  potentially  unre¬ 
liable  elsewhere.  We  therefore  use  it  mainly  in 
such  areas  by  weighting  the  stereo  component 
most  strongly  for  facets  of  the  triaiigulation  that 
project  into  textured  image  areas.  The  compo¬ 
nent  comi)ares  the  grey-levels  of  the  points  in 
all  of  the  images  for  which  the  projection  of  a 
given  point  on  the  surface  is  visible,  as  deter¬ 
mined  using  a  hidden-surfare  algorithm.  This 
compari.son  is  done  for  a  uniform  sampling  of 
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Figure  12:  Results  for  the  second  triplet  of  Figure  10  presented  in  the  same  fashion  as  in  Figure  11, 


the  surface.  This  method  allows  us  to  deal  with 
arbitrarily  slanted  regions  and  to  discount  oc¬ 
cluded  areas  of  the  surface. 

On  the  other  hand,  shading  information  is 
mostly  helpful  in  textureless  areas.  Thus,  we 
weight  the  shading  component  most  strongly  for 
facets  that  project  into  such  areas.  The  com¬ 
ponent  uses  a  new  method  for  utilizing  shad¬ 
ing  information  that  does  not  need  the  tra¬ 
ditional  assumption  of  constant  albedo.  In¬ 
stead,  it  attempts  to  minimize  the  variation  in 
albedo  across  the  surface,  and  can  therefore  deal 
with  both  constant  albedo  surfaces  and  surfaces 
whose  albedo  varies  slowly.  However,  it  does  re¬ 
quire  the  boundary  conditions  that  are  provided 
by  the  stereo  information. 

We  have  developed  a  weighting  scheme  that 
allows  our  system  to  u.se  each  source  of  informa¬ 
tion  where  it  is  most  appropriate.  As  a  result, 
for  the  large  class  of  surfaces  that  roughly  sat- 
i.sfy  the  Lambertian  model,  it  performs  signifi¬ 
cantly  better  than  if  it  were  using  either  source 
of  information  alone. 

Our  surface  model  ran  be  naturally  aug¬ 


mented  to  include  specularities,  shadows  and 
self-shadows.  It  can  also  support  more  complex 
topologies,  multiple  resolutions  and  the  shrink¬ 
ing  or  growing  of  the  surface  of  interest,  though 
in  this  paper  we  concentrated  on  a  better  rinder- 
standing  of  the  Irehavior  of  the  objective  func¬ 
tion.  These  extensions  will  be  the  subject  of 
future  work. 
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Abstract 

A  new  approach  to  shape  from  shading  is  described, 
based  on  relating  this  problem  to  an  “equiTalent” 
optimal  control  problem.  The  approach  leads  natu¬ 
rally  to  an  algorithm  for  sorfiace  reconstruction  that 
is  simple,  fast,  provably  convergent,  and  (under  suit¬ 
able  conditions)  provably  convergent  to  the  correct 
surface.  The  theoretical  basis  of  the  approach  is  de¬ 
veloped  in  tUs  paper,  extending  some  earlier  partial 
results.  In  addition,  a  new  reconstruction  algorithm 
is  presented  that  unlike  earlier  ones  can  reconstruct 
a  surface  with  no  a  priori  information  about  it. 

1  Introduction 

We  have  recently  developed  a  new,  practical  ap¬ 
proach  to  shape  from  shading,  bas^  on  reb^ 
ing  this  problem  to  an  *equivaknt”  optimal  con¬ 
trol  problem  [Oliensis  and  Dupuis,  1992;  Dupuis 
and  Oliensis,  1992a;  Oliensis  and  Dupuis,  1991; 
Rouy  and  Tourin,  1991a;  Rony  and  Tonrin,  1991b; 
Bichsel  and  Pentland,  1992].  The  approa^  leads 
naturally  to  an  algorithm  for  surface  reconstruction 
that  is  nmple,  fast,  provably  convergent,  and  (under 
suitable  conditions)  provably  convergent  to  the  cor¬ 
rect  surface.  In  experiments  on  200  x  200  and  128 
X  128  real  and  synthetic  images,  convergence  took 
fewer  than  15  iterations,  and  less  than  10  seconds 
on  a  DECstation  5000  [Oliensis  and  Dupuis,  1992; 
Dupuis  and  Oliensis,  1992a;  Bichsel  and  Pentland, 
1992).  TypicaDy,  the  number  of  iterations  required 
for  reconstruction  is  expected  to  be  approximately 
constant  independent  of  the  image  sise.  It  has  also 
been  proven  that  the  reconstruction  converges  to  the 
correct  continuous  surface  in  the  limit  where  the 
pixel  spacing  is  infinitely  small  [Dupuis  and  Oliensis, 
1992b;  Kushner  and  Dupuis,  1992].  These  results  ate 
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to  be  contrasted  with  traditional  sh^>e  from  shad¬ 
ing  algorithms  which  typacaDy  require  thousands  of 
iterations,  and  for  whi^  no  convergence  results  are 
known  [Horn  and  Brooks,  1989]. 

In  this  paper  we  devdop  the  theoretical  basis  of 
our  approad,  extending  some  eadkr  partial  results. 
In  addition,  we  descrik  in  Section  4  a  new  algo¬ 
rithm  [Oliensis  and  Dupuis,  1993]  that  unlike  earlier 
ones  can  reconstruct  a  surface  with  no  a  priori  in¬ 
formation  about  it.  Our  previous  algorithms,  which 
we  discuss  in  sections  2-3,  required  a  small  amount 
of  information  about  the  nature  of  the  surface  at 
■ingnlav  poiute-defined  as  maximally  bright  image 
points — and  the  relative  heights  of  the  surfsce  at  a 
subset  of  these  points.  ([Rouy  and  Tourin,  1991a; 
Rouy  and  Tourin,  1991b]  actually  require  more 
data).  The  algorithm  presented  in  Section  4  com¬ 
putes  this  information  antomatkally.  It  is  capable 
of  fast,  robust  reconstruction  of  a  general  surface 
even  in  the  presence  of  ±10%  nmse  in  the  intensity. 

Specifically,  we  prove  in  section  2  that  (under  ap¬ 
propriate  conditions)  the  surface  corresponding  to  a 
shaded  image  has  an  exidicit  representation  in  tenns 
of  an  optii^  contnd  problem.  Uniqueness  of  the 
surface  is  an  immediate  consequence;  thus,  contrary 
to  previous  belief,  shape  from  shading  is  often  a  well- 
posed  problem  and  does  not  need  regnlarisation.  In 
Section  3,  we  derive  two  distinct  shape  reconstruc¬ 
tion  algorithms.  These  are  proven  to  converge  for 
both  the  Jacobi  and  significantly  faster  Ganss-Smdel 
iteration,  and  shown  to  give  the  same  snr&ce  recon¬ 
struction.  It  is  an  advantage  of  oar  ^>proach  that 
a  range  of  algorithms  can  be  easQy  derived.  Two 
algorithms  are  convenient  to  work  with  since  one  is 
more  eflident  computationally  while  the  other  has  a 
simpler  theoretical  interpretation. 

2  The  Representaition  Theorem 

The  purpose  of  this  section  is  to  prove  the  repre¬ 
sentation  theorem  that  connects  the  shape  to  be 
reconstructed  and  a  deterministic  optimal  control 
problem.  The  main  result,  given  in  (2.3)  and  The- 
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orem  2.1,  is  the  explicit  lepresentation  for  the  rat¬ 
face  coiresponding  to  a  sh^ed  image.  We  conaidet 
the  idealised  problem  of  shape  from  shading  nndet 
the  usual  assumptions.  Note  that  although  we  as¬ 
sume  Lambeitian  surface  reflectance  and  illumina¬ 
tion  from  a  single  direction,  our  results  can  be  ex¬ 
tended  easily  to  any  "convex”  reflectance  function 
as  described  below.  Under  these  assumptions,  the 
intensity  at  an  image  point  r  =  (z,  y)  is  given  by  the 
image  irradiance  equation 

/(z,y)  =  ir-n, 

where  £  is  a  unit  vector  in  the  light  source  direc¬ 
tion,  the  optical  axis  b  along  the  —x  direction,  and 
n  b  the  surface  normal  at  the  corresponding  surface 
point.  /(•)  b  assumed  to  be  defined  on  a  bounded 
open  sul»et  H  of  Representing  the  surface  by 
the  height  function  z(z,y),  and  assuming  that  s(-) 
b  continuously  differentiable  (though  thb  b  not  es¬ 
sential). 

When  the  illumination  b  from  a  general  direction, 
it  b  useful  to  represent  the  surface  by  its  height  /(•) 
measured  along  the  light  direction  L: 

/(x,y)  =  i-(x,y,s(z,y)). 

For  simplicity,  and  without  loss  of  generality,  we  as¬ 
sume  that  L,  =  0,  >  0.  In  terms  of  /(•),  the  im¬ 

age  irradiance  equation  can  be  rewritten  ai^r  some 
algebra  as  ir(r,  V/(x))  =  0,  where  the  Hamiltonian 

H{T,a)  =  /(r)  (1  ||a||’  -  -h  q,L,  -  1. 

Note  that  S{T,a)  b  a  strictly  convex  function  of  a. 
The  fact  that  the  image  uradiance  equation  can  be 
rewritten  in  terms  of  a  strictly  convex  H  b  the  es¬ 
sential  property  used  below  in  the  proof  of  the  repre¬ 
sentation  theorem,  and  also  in  deriving  algorithms. 
Our  results  can  be  extended  to  essentially  any  image 
irradiance  equation  which  can  be  written  in  terms  of 
a  strictly  convex  JST. 

Singular  points  have  been  recognbed  as  playing  a 
critical  role  in  constraining  the  surface  correspond¬ 
ing  to  a  shaded  image  [Oliensb  and  Dnpub,  1992; 
Oliensb,  1991b;  Saxberg,  1989b;  Bruss,  1982],  and 
they  are  crucial  in  the  discussion  here  as  well.  The 
singular  points  are  those  image  points  where  the 
intensity  achieves  its  maximal  brightness  /(•)  =  1. 
Only  at  these  points  b  the  local  surface  orientation 
determined  from  the  intensity  alone.  Let  5  denote 
the  set  of  singular  points  in  the  image. 

It  b  easy  to  see  from  the  form  of  ^(r,  a)  that  7=1 
implies  a  =  0,  and  therefore  V/  =:  0  on  the  set  of 
singular  points  5.  Thus  5  includes  idl  local  maxima 
and  minima  of  /(x,  y).  We  will  focus  on  those  sin¬ 
gular  points  corresponding  essentially  to  the  local 


minima,  and  use  these  in  determining  the  surface 
from  its  image.  (Alternatively,  our  results  could  be 
derived  using  the  local  maxima.)  To  specify  pre¬ 
cisely  the  conditions  under  which  our  results  hold, 
we  introduce  some  nonstandard  terminology.  We 
say  that  a  set  A  C  b  smoothly  connected  if  given 
any  two  points  r  and  r*  in  A  there  b  an  absolutely 
continuous  ("smooth”)  path  connecting  the  two.  We 
will  assume  that  the  set  of  ringnlM  points  b  a  finite 
collection  of  smoothly  connected  sets.  Then  since 
V/  =  0  on  5,  /(•)  b  constant  over  each  connected 
component  Sc  C  S. 

We  will  refer  to  a  connected  subset  5c  as  a  set 
of  local  minima  if  there  exbts  an  e  >  0  such  that 
d(r,5c)  <  e  implies  /(r)  >  /(r')  for  r'  C  Sc,  i.e.,  if 
the  "heights”  /  of  nearby  points  are  no  less  than  the 
value  of  /  on  5c.  We  will  refer  to  a  point  as  a  local 
minimum  of  /  only  if  it  b  contained  in  such  a  con¬ 
nected  subset  5c.  An  analogous  definition  b  used 
for  local  maxima.  A  connected  subset  that  b  neither 
a  set  of  local  maxima  or  local  minima  b  called  a  set 
of  saddle  points,  even  though  some  of  the  points  in 
thb  set  may  be  local  minima  according  to  the  nsual 
definition  (they  cannot  be  strict  local  minima).  Let 
M.  be  the  set  of  all  the  local  minima  in  the  above 
sense. 

The  Lagrangian  corresponding  to  H{-)  b: 

L{t,0)  =  sup„  [-a-0-B  (r,  a)] 

*  LJ  -  1,0,  -  L,  (/»(r)  -  |A|»  -  1/9^  + 

(2.1) 

if  lAI*  +  1^  +  ^*1’  <  Otherwbe,  !(  )  =  oo. 

Define 27(r)  =  {/?  :  |AI*  +  1^  +  <  ^*(^)} 

the  domain  on  which  L(r,  •)  b  finite. 

The  Lagrangian  I  serves  as  the  running  cost  in  the 
optimal  control  problem  that  provides  a  representa¬ 
tion  for  the  surfrtce,  which  we  now  define.  Consider 
an  arbitrary  path  ^  in  the  image  plane  starting  at 
some  r,  and  continuing  for  a  time  p.  More  precisely, 
the  path  b  defined  by  ^(0)  =  r,  ^  =  u(t),  where  the 
control  u  :  [0,  oo)  — »  b  any  integrable  function. 
For  each  such  path,  we  define  a  cost  which  b  the 
sum  of  two  terms;  1)  the  total  running  cost,  given 
by  the  integral  of  the  running  cost  L{^,  u(^))  over 
the  path,  and  2)  a  terminal  cost,  which  depends  only 
on  the  end  point  of  the  path,  i.e.  on  ^(p).  The  con¬ 
trol  problem  b  to  find  the  path  giving  the  minimal 
total  cost.  The  representation  theorem  states  that 
under  appropriate  conditions  the  infimal  value  of  the 
cost  for  starting  point  r  b  just  /(r). 

Assume  we  are  given  an  upper  bound  B  for  {/(r)  : 
r  €  P}.  Then  d^ne  the  terminal  cost 

The  terminal  cost  imposes  the  large  penalty  B  on 
any  path  terminating  at  a  point  r  ^  M..  Fin^y,  the 
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total  cost  U  the  som  of  the  rnnning  and  terminal 
costa 

V (r)  =  inf  I jf  «(*))ds  +  » (^(p  A  T))j  , 

(2.3) 

where  pAr  denotes  niin(p,  r),  r  =  inf {t :  ^(t)  €  9T>\J 
Ai}  and  the  infimnm  is  over  all  paths  ^  and  stopping 
times  p  €  [0,oo).  Thus,  V  is  the  “minimal*  coat 
over  aU  finite  time  paths,  where  the  path  terminates 
either  at  time  p  determined  by  the  controller,  or  else 
at  the  first  time  that  the  path  exits  V  or  enters  M. 
We  want  to  show  that  /(r)  =  V(r). 

Preliminaries.  For  any  H,  including  nonconvex 
H,  it  is  easy  to  show  that  the  definition  (2.1)  im¬ 
plies  that  the  running  coat  X(r,  •)  is  convex  on  f/(r). 
In  our  case,  it  is  strictly  convex.  Moreover,  a  dimt 
calculation  shows  that  L  is  always  nonnegative,  and 
L{r,0)  =  0  only  for  r  €  ^  and  ^  =  0.  Also,  since 
H (r,  •)  is  strictly  convex,  it  follows  by  standard  ar¬ 
guments  that 

ir(r,  a)  =  sup  (-a  •  /3  -  I(r,/3)] ,  (2.4) 

P€tt(r) 

and  for  each  d  €  there  exists  a  unique  vector 
u(r,  a)  such  that 

H{r,  a)  =  -a  •  u(r,  a)  -  £(r,  u(r,  o)). 

Define  u(r)  for  r  €  by 

0  =  Hir,  V/(r))  =  - V/(r).u(r)-X(r,  fi(r)).  (2.5) 
From  (2.4),  Q(r)  is  pven  by 

V.J5r(r,a)|^,(,j  =  -u(r). 

If  (as  we  assume)  V/(r)  is  continuous,  then  the  fact 
that  H{t,  •)  is  implies  u(r)  is  continuous  on  2>. 
An  explicit  calculation  shows  that  u(r)  is  propor¬ 
tional  to  the  projection  in  the  (s,y)  plane  of  the 
steepest  descent  direction  on  the  surface  [Oliensis, 
1991a],  where  “steepest  descent”  is  defined  with  re¬ 
spect  to  the  light  direction  L,  rather  than  the  verti¬ 
cal  direction  (0, 0, 1). 

We  consider  subsets  0  otV  satisfying  the  following 
assumption. 

A2.1  Assume  that  S  consists  of  a  finite  collection  of 
disjoint,  compact,  smoothly  connected  sets,  and  that 
V/(')  is  continuous  on  the  closure  ofV.  Let  0  QV 
be  a  compact  set,  and  assume  C  it  of  the  form  Q  = 
<  oo,  where  each  Qj  has  a  continuously 
differentiable  boundary.  Let  M  be  the  set  of  local 
minima  of  /(•)  inside  Q.  Then  we  assume  that  the 
value  of  /(•)  is  known  at  all  points  in  M.  Let  u 
denote  the  “steepest  descent"  direction  given  by  (t.5) 
above.  Define  nj{r)  to  be  the  inward  {with  respect  to 
Q)  normal  to  at  r.  Then  we  also  assume  that 
u(r)  ‘rij{r)  >  0  for  allr  £dQ  nflfif,-,  j  = 


As  noted  above,  the  minimising  trajectories  for  the 
control  problems  we  consider  ate  the  two  dimen¬ 
sional  projections  of  the  paths  of  steepest  descent 
on  the  sn^bce.  The  assumption  on  Q  above  states 
that  the  steepest  descent  direction  is  never  directed 
out  of  Q,  and  thns  gnarantees  that  any  minimising 
trajectory  that  starts  in  Q  stays  in  Q.  When  this  as¬ 
sumption  is  vkdated  tot  some  point  r  C  2>,  i.e.  when 
the  steepest  descent  trajectory  starting  at  r  exits  V, 
then  /(r)  cannot  be  represent^  as  the  minimal  cost 
F(r).  However,  if  a  steepest  ucent  path  starting 
at  r  does  rensain  in  2>,  then  /(r)  can  be  computed 
in  terms  of  a  maximum  cost  for  an  analogous  opti¬ 
mal  contr(d  problem.  If  neither  of  these  possibilities 
hold,  then  the  surface  at  r  is  not  well  determined. 
In  general,  this  is  expected  to  be  the  case  only  for 
small  sections  of  the  image  near  the  image  boundary 
[Oliensis,  1991a]. 

Theorem  2.1  Assume  At.l,  and  that  B  is  an  up¬ 
per  bound  for  /(•)  os  (J.  Define  L{-,  •)  by  (t.l),  9(*) 
by  (t.t),  and  V(r)  by  (i.S).  Then  V(r)  =  f(r)  for 
ollr  €  tl. 


Proof.  We  first  show  that  V(r)  >  /(r).  Let  u(')  be 
any  admissible  control  and  defijne 

^(t)  =  r  -f  /  u(s)ds,  T  =  inf{t :  ^(t)  £dT>r\  M}. 

Jo 

(2.6) 

Since  L  u  the  Legendre  transform  of  H  and  since 
H{t,Vf{r))  =  0fotteg, 

O>-Vf{t)fi-L{t,0) 

for  aU  €  S’,  and  in  particular 

-V/(^(t)).u(t)<  1(^(0,  u(t)) 

for  t  €  [0,  p  A  r].  This  implies  that 

-/(^(FAr))-|./(r)  =  -J^^Vmt))-n{t)di 

<  srumMm. 

and  thus 

L{4{t),  n{t))dt  +  /(^(p  A  r))  >  /(r). 

Since  g{4[p  A  r))  >  f{4{p  A  r)),  we  obtain  V(r)  > 
/(')• 

Next  we  show  V(r)  <  /(r).  In  order  to  do  so  we  will 
verify  that  for  each  e>  0  there  exists  a  control  u(-) 
such  that  for  4  *Bd  r  defined  by  (2.6)  we  have 

jT  L(^(t),  u(t))dt  +  p(^(r))  <  /(r)  -J-  e.  (2.7) 

Recall  thatW(r)  =  {{ff.,fi,)  :  |AI’  +  IA  +  ^I*  <  0 
and  L(r,0)  =  0  for  r  €  5.  Let  5c  be  a  maximiJ 
smootUy  connected  component  of  5.  For  any  two 
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points  r,  r'  in  Sc,  there  exists  s  path  ^(t)  and  a 
time  t*  <  oo  such  that  ^(t)  €  Sc  for  i  €  [0,  t*]  and 
r*  =  Write  ^(t)  =  r+ u(5)ds  in  terms  of  the 
control  u(t).  Define  a  new  control  UA(t)  =  AuftA), 
where  A  >  0  is  a  constant,  and  let  ^A(f)  =  ^ 

the  corresponding  path.  Since 


Hull 


OasIHI 


0. 


for  r  such  that  /(r)  =  1,  we  can  choose  A  such  that 


fXf  ft 

I  ^(^aW.uaW)*  =  / 

Jo  Jo 


£(0(t),Au(t))^  ^  g 
A  “  3' 


Farther,  since  |£y|  <  1,  there  exists  o  >  0  such 
that  for  any  component  Sc  &•  above,  and  r  such 
that  d(r,5c)  <  a,  we  have  the  following.  Let  r* 
be  the  point  in  Sc  closest  to  r.  Then  there  exists 
a  time  t.  €  [0,  oo),  constant  control  u(')  =  (r'  — 
r)/ta  and  corresponding  path  ^(t)  =  r  +  /q  u(s)ds, 
such  that  ^(ta)  =  r'  /o*  u(t))dt  <  e/3. 

Finally,  this  shows  that  for  any  Sc,  und  r,  r'  such 
that  d(r,5c)  <  a  and  d(r',5c)  <  a,  there  exists  a 
control  u,r'(f)  fu>ie  Wrr*  €  [0,oo)  such  that  for 
the  corresponding  path  ^rr'(t)  b&ve  ^rr'(O)  = 
*“|  ^Tr>{<rrr’)  = 

T”'  I(^„.(t),Ur,.(t))dt  <  e. 

Jo 

Since  /  is  constant  on  Sc,  then  by  choosing  a  >  0 
smaller  if  need  be  we  can  also  assume  that  |/(r)  — 
f(r’)\<c. 

We  now  construct  the  control  that  satisfies  (2.7).  If 
r  is  a  local  minimum  then  we  simply  take  r  =  0  and 
are  done. 

There  are  then  two  remaining  cases:  (1)  r  is  con¬ 
tained  in  some  Sc  with  5c  nM  =  0,  or  (2)  r  ^  5.  If 
case  (1)  holds  then  5c  n  Af  =  0  implies  the  ex¬ 
istence  of  a  point  r'  such  that  /(r')  <  /(^  and 
<^(^i^c)  <  u*  Since  A2.1  implies  5  C  ir,  we 
can  assume  that  r'  €  (7.  In  this  case  we  will  set 
u(t)  =  Urr>(t)  for  t  €  [0,<r„*). 

Next  consider  the  definition  of  the  control  for  t  > 
OTrr’-  For  c  >  0  let  6  =  inf{L(r,u)  :  r  €  l>,d(r,5)  > 
c,  u  €  K’}.  The  continuity  of  /(*)  and  the  Cut  that 
/(r)  <  1  for  r  ^  5  implies  6  >  0.  Consider  any 
solution  (there  may  be  more  than  one)  to 

m  =  u(^(t)),  ^(0)  =  r'.  (2.8) 

According  to  (2.5),  for  any  t  such  that  d(f)  €  (7\5 
and  d(^(t),5)  >  c 


A2.1  implies  ^(t)  cannot  exit  G.  Thus,  since  /(r)  is 
bounded  on  0,  (2.9)  implies  that  ^(t)  must  enter  the 
set  {r :  d(r,  5)  <  c}  in  finite  time,  for  any  c.  If  ^(t)  € 
5  for  some  t  <  oo  we  define  ijr'  =  inf{t :  ^(t)  €  5} 
and  w  =  ^(t}r')'  Otherwise,  let  t.  be  any  sequence 
tending  to  oo  as  t  -*  oo.  Since  G  is  compact  we  can 
extract  a  subsequence  (again  labeled  by  i)  such  that 
^(tj)  — »  V  for  some  «  €  5.  Let  i  be  large  enough 
that  ||^(t;)  -  «||  <  “•  Since  /(^(U))  i  /(»),  we  have 
fW^))  >  /(*)'  ^ur  this  case  we  define  f}r>  =  t;  and 

«  =  ^(»»r')- 
Integrating  (2.9)  gives 

f(T')-/(w)=  /”*'x(^(t),u(^(f)))dt. 

Jo 

We  then  define  the  control  u(f)  to  be  used  for  t  € 
krr',  +  IJr«)  ^  ^  u(^(l  -  ffrr'))- 

We  now  consider  the  point  in.  We  first  examine  the 
case  in  which  the  solution  to  (2.8)  does  not  enter  5  in 
finite  time.  Since  ||n>-v||  <  a,  Uv«(t)  gives  a  control 
such  that  the  application  of  this  control  moves  ^(•) 
from  in  to  V  with  accumulated  running  cost  leas  than 
or  equal  to  e.  We  define  u(t)  =  u««(t  —  (qr'  +ffrr')) 
for  t  €  [ffrr*  +  +  >?r*  +  w„).  If  the  solution 

to  (2.8)  reached  5  in  finite  time  we  define  in  =  v  and 
0*^9  —  0.  Let  *1“  ij]f*  *1“ 

Let  ns  summarise  the  results  of  this  construction. 
Given  any  point  r  €  5  that  is  not  a  local  minimum 
we  have  constructed  a  piecewise  continuous  control 
u(*)  and  <r  <  oo  such  that  if  ^(t)  =  r  +  u(s)(ls, 
then 

m-nm)  =  /(r)-/(r')  +  /(r') 

-/(w)  +  /(w)  -  /(w) 

£r»»+Vr' 

r> 

>  -2e+  /  I(^(t),u(t))dt. 

Jo 

We  have  also  shown  that  /(r)  >  f(v)  = 

^(a)  €  5.  Thus,  either  the  component  Sc  contain- 
ing  ^(w)  satisfies  5c  n  Ad  ^  0,  and  we  are  done, 
or  we  are  back  into  case  (1)  above,  and  can  repeat 
the  procedure.  Let  K  be  the  number  of  disjoint  com¬ 
pact  connected  sets  that  comprise  5.  Then  the  strict 
decrease  /(r)  >  /(^(w))  and  the  fact  that  /(*)  is 
constant  on  each  Sc  imply  the  procedure  can  be  re¬ 
peated  no  more  than  K  times  before  reaching  some 
5c  containing  a  point  from  Ad.  If  case  (2)  holds  we 
can  use  the  same  procedure,  save  that  the  very  first 
step  is  omitted.  Thus,  in  general,  we  have  exhibited 
a  control  u(')  such  that 


4/(^(t))  =  V/(^(t)).u(^(t)) 

=  -i(^(*).«(^(0)) 

<  -*• 


(2.9) 


r  I(^(t),u(t))dt  +  p(^(t))  <  /(r)  -I-  (2A-  +  l)c. 
Jo 

Since  e  >  0  is  arbitrary,  the  theorem  is  proved.  ■ 
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3  Shape  Reconstruction  Algorithms 

In  this  section,  we  describe  how  algorithms  for  shape 
reconstruction  can  be  derived  from  control  represen¬ 
tations  such  as  that  pven  in  the  previous  section.  It 
is  important  to  note  that  many  different  algorithn» 
can  be  derived,  corresponding  to  the  many  possibil¬ 
ities  for  rewriting  the  image  irradiance  equation  in 
terms  of  a  Hamiltonian.  Each  choice  for  the  Hamil¬ 
tonian  leads  in  general  to  a  different  control  repre¬ 
sentation  (at  least  formaUy),  and  a  different  algo¬ 
rithm.  Nevertheleu,  the  different  algorithms  com¬ 
pute  the  same  surface  approximation  from  the  im¬ 
age.  Thus,  for  example,  an  algorithm  can  be  gener¬ 
ated  horn  the  Hamiltonian  of  the  previous  section, 
which  we  henceforth  denote  by  Another  possi¬ 
bility,  used  previously  in  [Oliensis  and  Dupuis,  1992], 
is  to  write  the  image  irra^ance  equation  in  the  form 
J5r^*)(p,  V/(p))  =  0,  with  ^(*)(p,a)  given  by 

\  +  val  +  2(1  -  7’)X,a,  -  (1  -  /’)] , 

(3.10) 

where  t»(p)  =  /*(p)  —  Xj.  Note  that  when  i>(p)  <  0 
Ht*l(p,a)  is  not  a  convex  function  of  a.  Never¬ 
theless,  an  algorithm  can  be  derived  from  this  form 
of  the  Hamiltonian,  which,  although  it  differs  from 
the  algorithm  generated  from  reconstructs  the 
same  surface  approximation. 

The  algorithms  are  derived  using  a  discrete  approx¬ 
imation  of  the  continuous  control  representation.  In 
this  discrete  control  problem,  the  object  is  to  min¬ 
imise  the  cost  over  all  discrete  paths  on  the  grid  of 
pixels.  A  difficulty  in  doing  this  is  that  a  discrete 
trajectory,  where  at  each  time  step  the  path  jumps 
between  neighboring  pixels,  is  usually  a  poor  ap¬ 
proximation  to  a  continuous  trajectory.  In  order  to 
better  approximate  a  continuous  trajectory  on  a  dis- 
Crete  grid,  an  element  of  randomness  is  introduced. 
Thus  the  continuous  optimal  control  problem  is  ap- 
proximated  by  a  discrete  stochastic  optimal  control 
problem,  and  the  cost  of  the  continuous  problem  is 
approximated  by  the  expectation  of  the  cost  for  the 
discrete  problem.  Note  that  the  algorithms  them¬ 
selves  are  deterministic,  even  though  the  discrete 
control  problem  involves  a  stochastic  rather  than  de¬ 
terministic  process. 

This,  given  a  control  u,  we  define  the  probabilities 
for  the  path  to  jump  to  neighboring  pixels  so  that 
on  average  the  discrete  motion  approximates  the 
continuous  motion  ^  =  u.  Let  p(r,r'|u)  denote  the 
transition  probability  for  the  path  to  move  from  r  to 
a  4-nearest  neighbor  site  r'  in  the  current  time  step. 
We  define 

p(r,r-|-  sign(tt,)(l,0)|u)=  (g.n) 

p(r, r  +  sign(ti,)(0,  l)|u)  =  (3.12) 


with  all  other  probabilities  sero.  We  also  define  the 
rise  of  the  time  step  to  be  At(u)  =  l/(|us|  -t-  luyl)* 
With  this  definition,  and  assuming  for  example 
Us,  Uy  >  0,  the  average  motion  is 

which  approximates  the  continuous  motion.  This 
definition  actually  makes  sense  only  when  u  ^  0. 
For  u  =  0,  we  define  p(r,  r|0)  =  1,  and  At(0)  =  1. 

For  a  given  sequence  of  contrds  {uj},  let  :  (o  = 
r}  denote  the  path  starting  at  r  which  evolves  at 
e^  time  step  i  as  determined  by  the  control  se¬ 
quence  {u,-}  and  the  transition  probabilities.  Then 
for  the  representation  and  Lagrangian  (now  denoted 
X(*l)  of  the  previous  section,  the  infimal  cost  V(^l(r) 
of  the  approximating  stochastic  control  problem  is 
given  by 


inf  Es 


(NAir)-i 

5^  u,)At(uj)  -f  p(f(ArAjf )) 


•=0 


where  N  =  inf{i  :  ^  V  ot  (i  €  M},  and  the 

minimisation  is  over  aO  control  sequences  {u,-}  and 
stopping  times  M.  See  [Dupuis  and  Oliensis,  1992b] 
for  the  description  of  the  classes  of  allowed  controls 
and  stopping  times.  E,  denotes  the  expectation. 
Thus,  is  the  minimum  of  the  expectation  of  the 
cost  over  all  finite  length  contrcd  sequences,  where 
the  path  terminates  either  at  discrete  time  M  chosen 
by  the  controller,  or  else  at  the  first  time  that  the 
path  exits  2>  or  enters  Af. 


Suppose  that  instead  of  considering  paths  of  arbi¬ 
trary  length,  we  consider  paths  continuing  for  at 
most  n  time  steps.  The  infimal  cost  y»^\r)  is 


infE. 


\NAUAn)-l 

tti)At(u,)  -f  g{i{HAUAn)) 


i=0 


(3.13) 

Then  vi'\T)  is  clearly  nonincreasing  in  n  and 
yi*^r)  i  V(»)(  r)  as  n  — »  oo.  For  a  €  {1|2},  de¬ 
fine  W(*)(r,u,l^*^) 


=  X(-)(r,u)At(u)-l-X;p(r.r»yi-)(r'),  (3.14) 
r* 


where  the  sum  is  over  4-nearest  nrighbors  of  r.  As 
discussed  in  [Oliensis  and  Dupuis,  1992],  it  follows 
from  the  principle  of  dynamic  programming  that 
Vi»)(r)  and  are  related  by 


yi;\(r)  =  min  [j«f.iy<‘)(r,u,l^*)),l,(r) 


|U€«* 


(3.15) 


Clearly,  we  also  have  the  initial  condition  l^^*^(r)  = 
g{v).  This,  together  with  the  recursive  equation 
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(3.15),  gives  an  algorithm  which  converges  mono- 
tonically  down  to  V. 

For  the  second  control  problem,  we  get  a  similar  al¬ 
gorithm.  As  before,  =  p(r).  If  «(r)  >  0, 

V.%(r)  =  min  [jnf^  W(»(r,u,l^»>),p(r)]  , 

else  if  «(r)  <  0 

Vi+i(p)  =  min  I  sup  inf  W(*>(r,  u,  V^’)),  p(r)]  . 


The  Lagrangian  is  derived  &om  an  equation 
analogons  to  (2.1): 

=  supa  [-a  •  0  -  a)l  if  »(r)  >  0 

=  info,  snpo,  (-0  •  0  -  a)]  if  «(r)  <  0. 

(The  case  v(r)  =  0  is  given  by  the  appropriate 
limit  as  v(r)  — »  0  from  either  direction.)  The 
difference  from  the  previous  algorithm  is  due  to 
the  nonconvezity  in  the  Hamiltonian  for  image  re¬ 
gions  where  «(r)  <  0.  For  more  detail,  and  ex¬ 
perimental  results  obtained  with  the  second  algo¬ 
rithm  above,  consnlt  [Dnouis  and  Oliensis,  1992b; 
Olienris  and  Dupnis,  1992]. 

The  algorithms  described  above  ate  of  the  Jacoln 
type,  with  the  surface  updated  everywhere  in  par¬ 
allel  at  each  iteration.  The  algorithms  can  also  be 
shown  to  converge  if  implemented  via  Ganss-Seidel, 
with  updated  surface  estimates  used  as  soon  as  they 
ate  available.  In  fact,  we  show  below  that  the  Ganss- 
Seidel  algorithms  converge  for  any  sequence  of  pixel 
updates,  as  long  as  each  site  is  updated  a  sufficient 
number  of  times.  For  example,  it  is  possible  to 
change  the  direction  of  the  sweep  across  the  image 
after  each  pass  [Bichsel  and  Pentland,  1992].  Our 
experiments  show  that  this  produces  a  significant 
speedup,  changing  the  number  of  iterations  requited 
for  convergence  from  order  N  to  order  1  with  a  small 
constant,  where  N  is  the  linear  dimension  of  the  im¬ 
age. 

Proposition  3.1  Consider  eiiher  of  the  recursive 
algorithms  derived  in  (S.15)  or  (S-16).  Let  an  tni- 
tial  condition  where  a  €  {1|2},  be  given  and 

define  the  sequence  {Vf“\  i  €  N}  according  to  either 
the  Jacobi  iteration  [e.g.  (S-IS)J or  the  Gauss-Seidel 
iteration,  where  the  pixel  sites  are  updated  tn  an  ar- 

bitrary  sequenced  Assume  that  lo*^(r)  >  y(r)  for  all 
r  €  2>.  Then  the  following  conclusions  hold. 

1.  For  each  r  £  V,  is  nonineresuing  tn 

t  and  bounded  from  below.  Define  V(*)(r)  = 
lini»-.oo  Vt*l(r).  Then  the  function  V't*)(-)  ts  a 
fixed  point  of  (S.15)  (or  (S.16)  if  appropriate). 


t.  The  function  V^*)(-)  eon  he  uniquely  charac- 
terixed  as  the  largest  fixed  pond  of  (S.15)  (or 
(S.IS)  if  appropriate^  that  satisfies  P(*)(r)  < 
Vo^*>(r) /or  all  r€l>. 

Remark.  In  using  the  algorithm,  we  always  take 
1^*^(r)  =  y(r),  where  y(r)  is  defined  hy  (2.2).  It  is 
proven  in  [Dupuis  and  Oliensis,  1992b]  that  if  H  in 
(2.2)  is  an  upper  bound  tot  /(r),  then  the  surface 
reconstructed  by  the  algorithm  above  converges  to 
the  correct  snrface  as  the  psxd  spacing  goes  to  sero. 
Thus,  the  correct  surface  api»oximation  is  obtained 
by  taidng  the  largest  of  aU  the  fixed  pmnts  of  the 
iterations  in  (3.15)  m  (3.16). 

Proof.  For  each  fixed  r  £  D,  any  of  the  Jacobi 
and  Ganss-Seidd  iterations  we  have  defined  may  be 
written  in  one  oS  the  following  forms.  For  a  £  {1, 2}, 

ddiB. 

r* 

where  c(r,  u)  denotes  the  running  cost  for  the  given 
algorithm,  and  ^J|](‘)  represents  the  result  of  pre¬ 
vious  updates  of  the  algorithm.  Then  the  result 
•«»  a  new  update  is  given  either  by 

or 


(3.18) 

Note  that  for  both  (3.17)  and  (3.18)  the  right  hand 
sides  are  monotonkaUy  nondecreasing  in  V^jj(-)  if 
we  use  the  partial  ordering  of  real  valued  functions 
on  V  defined  by  wi(’)  <  ^{')  whenever  wi(r)  < 
W3(r)  for  all  r  €  V. 

For  either  the  Jacobi  or  Ganss-Seidd  iteration,  the 
first  update  at  the  site  r  will  result  in  Vnew(>^)  ^ 
l^*^(r),  since  1^*^(*)  >  9(’)-  Next,  consider  any 
snbs^nent  update  of  the  site  r.  By  induction,  we 
can  assume  that  all  updates  are  nonincreasing  up 
to  this  time.  Vnew(0  depends  on  l^|)|(r'  ^  r), 
which  by  the  induction  assumption  is  everywhere 
less  than  or  equal  to  its  value  at  the  previous  up¬ 
date  at  r.  Since  V^ew(0  depends  monotonicaOy 
on  ^|j(^)»  if  fdlows  that  l^ew(')  satisfies 

f^new(0  ^  l^|j(r)  for  the  current  iteration.  This 
establishes  the  monotonidty  of  part  1. 

Since  for  Control  Problem  1  the  running  costs  c  are 
nonnegative,  is  bounded  from  bdow,  and  the 
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monotonicity  proved  above  establisbei  the  exutence 
of  V(*)(r)  =  ]imi_oo  This  is  also  the  case  for 

Control  Problem  2  when  v(r)  >  0.  When  v(r)  <  0  in 
Control  Problem  2,  we  first  consider  (3.18)  assnming 

^Id  ~  ^  simple  calculation  shows  that 


sup  inf  u))  >  0, 


and  therefore 


sup 


inf  u)At(u)^  >  0. 


Since  the  probabilities  are  nonnegative,  and 
Ylr’  ~  implies  minr'  < 


sup 

u^Lg<0 


Thus  for  Control  Problem  2  and  v(r)  <  0,  is 

bounded  from  below  by  minr  VQ^^^(r).  This  gives 
part  1  of  the  proposition. 

We  next  turn  to  part  2.  Let  be  any  fixed  point 
of  (3.15)  or  (3.16)  that  satisfies  V^“)(r)  <  V^*^(r) 
for  all  r  €  V.  An  argument  very  similar  to  the  one 
used  to  prove  part  1  shows  that 

<  '^^‘^(r)  =>  '^‘“^(r)  <  Vi^l(r). 

Therefore  by  induction  V(*J(r)  <  V’(*)(r)  for  all  r  € 
V.  m 


Finally,  we  sketch  the  proof  that  the  algorithms  of 

(3.15)  and  (3.16)  converge  to  the  same  fixed  point. 
For  a  complete  proof,  see  [Oupuis  and  Olienns, 
1992b].  The  same  argument  generalises  to  show  that 
a  wide  range  of  algorithms  corresponding  to  differ¬ 
ent  choices  of  the  Hamiltonian  all  converge  to  the 
same  fixed  point. 

For  specificity,  consider  Control  Problem  1,  and  kt 
w{x)  be  a  fixed  point  of  the  corresponding  iteration 

(3.15) .  We  will  only  consider  the  case  r  5  and 
u(r)  <  p(r);  the  other  cases  can  be  handled  simi¬ 
larly.  Then 

0  = 


Er'P(r« ''!")(’"('')  -  «(*•))] 

Substituting  the  transition  probabilities  from  (3.11), 
(3.12)  gives 


0  =  min 


■  L(^)(r,u)  +  u(Vu>)*(r) 

|u.|  Kl 


,L(»)(r,0)] 

(3.19) 


where  the  second  expression  on  the  right  hand  side 
corresponds  to  u  =  0,  and  (Vu)*  is  a  forward  or 
backward  discrete  derivative  depending  on  u; 


f  u;(r  +  (1,0))  -  ii»(r)  if  u,  > 

(  ti>(r)  -  ii;(r  -  (1,0))  if  u,  <  0, 


w(r  +  (0, 1))  -  w(r) 
«.(r)-«(r-(0,l)) 


iftt,  >0 
if  Uy  <  0. 


Since  we  assume  r  ^  5,  L^^)(r,  •)  >  0  and  only  the 
first  expresdon  on  the  right  hand  side  of  (3.19)  need 
be  considered.  Then  (3.19)  is  equivalent  to 

0  =  [•t^'^(r,  u)  +  u  •  (Vio)“j  ,  (3.20) 


where  we  have  used  the  fact  that  has  a  strictly 
positive  lower  bound  for  r  ^  5,  and  that  it  is  finite 
only  on  a  bounded  domain.  Also,  the  minimisation 
has  been  extended  to  u  =  0  ndng  the  continuity  of 
the  bracketed  exprcsuon  in  u. 


Apart  from  the  dependence  of  (Vu)*  on  u,  eq.  3.20 
is  the  Legendre  transform 

a)  =  —  u)  -I-  u  •  aj  .  (3.21) 

Therefore  we  have  established  a  connection  between 
the  conditions  for  w(r)  to  be  a  fixed  point  and  the 
equation  (Vio)*')  =  0,  which  is  a  discrete 

vernon  of  the  origin^  image  irradiance  equation. 
The  plan  of  the  proof  is  to  relate  in  this  way  the 
fixed  point  conditions  for  any  algorithm  back  to  the 
image  irradiance  equation.  Since  all  algorithms  are 
derived  from  this  equation  in  the  first  place,  this  wiU 
allow  ns  to  prove  that  the  fixed  point  conditions  are 
the  same  for  aU  algorithms. 


For  simplicity,  we  consider  only  points  r  for  which 
v{t)  >  0.  A  mote  complete  discusmon  can  be  found 
in  [Dupuis  and  Oliensis,  1992b].  Then  ir(^)(r,a)  = 
0  if  and  only  if  a)  =  0,  since  both  correspond 

to  the  image  irradiuce  equation.  Moreover,  since 
both  Hamiltonians  are  convex  in  a  for  v(r)  >  0, 
ir^^)(r,o)  >  (<)0  if  and  only  if  Jlt*l(r,a)  >  (<)0, 
respectivdy. 


By  a  similar  argument  to  that  above,  it  can  be  shown 
that  the  algorithm  for  Control  Problem  2  has  a  fixed 
point  at  a  point  r  :  v(r)  >  0  if  and  only  if 

0  =  inin  [L(*>(r,  u)  u  •  (Vui)-]  .  (3.22) 

Divide  the  u  plane  into  four  quadrants  Qi  where 
each  quadrant  is  characterised  by  constant  values  of 
8ign(u,)  and  sign(uy).  Then  (3.20)  and  (3.22)  can 
be  rewritten  as 

' = [i?5.  ") + “  •  • 

where  a  €  {1|2},  and  (Vw)^  is  the  appropriate 
choice  of  the  discrete  derivative  for  the  given  quad¬ 
rant.  Without  loss  of  generality  consider  the  quad¬ 
rant  Qi  where  Us,Uy  >  0  and  (Vw  )*(r)  =  (u>(r  + 
(1, 0))  -  ui(r),  w(r  +  (0, 1))  -  w(r)).  ftom  a  simple 
argument  using  the  convexity  of  the  it  can  be 
proven  (see  for  example  [Rockafellar,  1970])  that 

min  fL^*^(r,u)  +  u*al  =  —  min  — 

I  '  ^  J  o;.«»;>o  '  ' 
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where  a  €  {1|2}.  Then,  since  the  signs  of  and 
J7(2)  correspond, 

^  [l<^)(r,u)  +  u  (Vttf)*]  I  J  jo 
if  and  only  if,  respectively, 

^  [ii(’>(r,u)  +  u  (Vti>)^]  I  ^  jo. 

Since  this  is  true  for  each  quadrant,  w(r)  is  a  fixed 
point  for  Control  Problem  1  if  and  only  if  it  is  one 
for  Control  Problem  2.  ■ 

4  A  General  Shape  Reconstruction 
Algorithm 

The  algorithms  above  are  in  a  sense  local.  They 
nse  the  fact  that  a  singular  point — defined  as  a 
maximally  bright  image  point — uniquely  determines 
the  surface  within  some  image  neighborhood  [Bmss, 
1982;  Saxberg,  1989a;  Saxberg,  1989b;  Oliensis, 
1991b].  More  precisely,  this  is  true  for  singular 
points  at  which  the  surface  height  has  a  local  min¬ 
imum  or  local  maximum.  The  algorithms  (in  the 
appropriate  minimum-  or  maximum-based  version) 
are  guaranteed  to  give  the  correct  surface  up  to  an 
overall  translation  in  depth  near  each  such  singular 
point. 

The  surface  over  the  whole  image  will  also  be  deter¬ 
mined,  and  correctly  reconstructed  by  the  local  al¬ 
gorithms,  if  it  is  known  which  of  the  singular  points 
in  the  image  correspond  to  local  minima  of  the  sur¬ 
face  height,  and  if  the  surface  heights  at  these  points 
are  available.  If  this  information  is  not  known,  then 
there  is  a  potential  convex-concave-saddle  ambigu¬ 
ity  for  the  surface  shape  at  each  singular  point. 

We  briefly  describe  and  present  experimental  re¬ 
sults  for  a  general  shape  from  shading  algorithm 
which  can  determine  this  information  automati¬ 
cally.  This  algorithm  reconstructs  shape  &om  shad¬ 
ing  with  no  a  priori  information  about  the  sur¬ 
face.  It  incorporates  the  algorithms  discussed  in 
the  previous  sections,  but  also  makes  use  of  a  global 
consistency  analysis  of  shape  from  shading  similar 
to  that  of  [Oliensis,  1991b].  A  detailed  descrip¬ 
tion  can  be  found  in  [Oliensis  and  Dupuis,  1993; 
Dupuis  and  Oliensis,  1992b].  Typical  reconstruc¬ 
tion  times  are  less  than  30  seconds  on  a  DECstation 
5000  for  128  x  128  images.  Our  experiments  with 
the  global  algorithm  have  so  far  been  carried  out 
only  for  synthetic  surfaces  and  images,  though  these 
are  rather  complex.  Nevertheless,  this  algorithm  is 
capable  of  reconstructing  shape  from  general  images 
with  some  degree  of  robustness,  and  is  orders  of  mag¬ 
nitude  faster  than  traditional  variational  algorithms 
[Horn  and  Brooks,  1989]. 


The  new  algorithm  requires  additional  assumption 
on  the  imaged  surface,  the  most  important  of  which 
is  that  the  surface  height  function  is  twice  dif¬ 
ferentiable  (these  assumptions  do  not  necessarily 
h<dd  in  our  experiments).  The  strategy  is  as  fol¬ 
lows.  Suppose  that  a  singular  point  s  =  (>•!/) 
is  known  to  be  an  isdated  local  minimum  of  the 
height.  The  results  of  [Dupuis  and  Oliensis,  1992b; 
Oliensis  and  Dupuis,  1992]  imply  that  the  algorithms 
of  the  previous  sections,  if  provided  x(s)  as  an  ini¬ 
tial  datum,  will  reconstruct  the  correct  surface  in 
some  neighborhood  A,  of  s.  In  general,  from  the  ar¬ 
guments  of  [Oliensis,  1991b;  Oliensis,  1991a],  there 
wUl  be  other  singular  pmnts  /  on  the  boundary  of 
the  domain  By  continuity  of  the  surface  [Olien- 
ris  and  Dupuis,  1992],  the  heights  of  these  points 
will  be  correctly  determined  by  the  local  algorithm. 

If  it  is  possible  to  identify  one  of  these  points  s' 
as  a  local  maxima,  then  we  are  in  the  same  situ¬ 
ation  as  before.  The  local  algorithm  can  again  be 
applied  with  the  height  provided  as  initial  datum  at 
the  new  ringnlar  point  s',  which  extends  the  sur¬ 
face  reconstruction  over  the  re^n  At>  U  A«.  The 
arguments  of  [Oliensis,  1991b;  Oliensis,  1991a]  show 
that  iterating  this  process  will  eventually  yield  2(-) 
at  all  local  minima  and  maxima  singular  points,  and 
a  correct  surface  reconstruction  over  the  entire  im- 
i^e  domain  0.  For  the  above  strategy  to  work, 
the  crucial  requirement  is  a  method  for  identifying 
which  of  the  pmnts  s'  ate  local  maxima  (or  min¬ 
ima).  This  is  nontrivial,  since  the  surface  recon¬ 
structed  by  the  local  algorithm  assuming  just  the 
one  rnwgiilM  point  s  will  in  general  reconstruct  all 
other  singular  points  as  inflection  points.  Such  a 
method  is  described  in  [Oliensis  and  Dupuis,  1993; 
Dupuis  and  Oliensis,  1992b]. 

A  128  X  128  synthetic  surface  is  displayed  in  Figure 
1.  The  surfsM  was  generated  by  creating  a  ran¬ 
dom  surface  in  frequency  space,  masking  it  with  a 
filter  so  as  to  reduce  the  hi^  frequencies,  and  then 
fonrier  transforming  to  obtain  the  surface.  The  im¬ 
age  was  generated  with  the  light  from  the  direction 
(0,  .3,  .95).  Using  no  boundary  data  other  than  /(•), 
the  snrfa^  shown  in  Figure  2  was  reconstructed. 
This  reconstruction  took  less  than  30  seconds  of 
CPU  time  on  a  DEC  5000  workstation.  The  sur- 
foce  shown  is  the  result  of  reconstructing  with  the 
algorithms  given  in  previous  sections,  based  on  the 
lo^  maxima  and  their  heights  obtained  with  the 
extended  algorithm. 

Figure  3  illustrates  the  reconstruction  error,  the 
magnitude  of  the  difference  between  the  original  sur¬ 
face  height  and  that  of  the  reconstruction.  The  re¬ 
construction  is  good  except  near  the  edges  of  the 
image.  As  with  out  earlier  algorithms  [Dupuis  and 
Olienris,  1992b;  Oliensis  and  Dupuis,  1992J,  this  is 
due  in  part  to  the  fact  that  the  imaged  surface  does 
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not  obey  the  boundary  condition  A2.1  (or  its  analog 
for  reconstrnction  ba^  on  local  maxiina) — that  is, 
the  steepest  descent  direction  for  the  snr&ce  at  the 
image  boundary  is  not  everywhere  into  (respectively, 
out  of)  the  image.  This  is  discussed  fother  below. 
The  average  reconstruction  error  in  the  interior  of 
the  image  with  20  <  X2  <  105  is  0.5  units,  or  about 
2%  of  the  height  range. 

We  have  also  studied  the  effect  of  image  noise.  A 
second  128  x  128  surface  obUuned  as  before  is  dis¬ 
played  in  Figure  4.  The  image  was  obtained  using 
the  same  illumination  as  before,  but  with  a  random 
noise  added  at  each  pixel  with  a  uniform  distribu¬ 
tion  and  a  maximum  amplitude  of  0.1.  Since  the 
maximum  image  intensity  is  only  7=1,  this  is  a 
large  noise  of  ±10%.  The  reconstrnction  based  on 
the  noisy  image  and  using  the  recovered  local  min¬ 
ima  is  shown  in  Figure  5.  Although  there  are  large 
errors  in  some  parts  of  the  image,  the  reconstuction 
is  still  good  over  much  of  the  image.  The  error  in 
the  height  is  displayed  in  Figure  6,  where  saturated 
white  represents  a  height  error  of  3.  The  error  is  less 
than  3  units  over  most  of  the  image,  in  comparison  to 
the  overall  height  range  for  this  surface  of  44  units. 
In  the  region  of  the  image  with  127  >  xi,]  >  40, 
the  mean  height  error  is  just  1.6.  This  represents  a 
surprising  immunity  to  the  large  image  noise. 

In  Figure  7  the  error  is  shown  for  a  different  recon¬ 
struction  from  the  same  noisy  image.  It  differs  from 
the  previous  one  in  being  based  on  the  recovered 
local  maxima.  As  expected,  near  the  boundary  of 
the  image,  the  region  of  accurate  reconstruction  for 
the  maxima-based  method  is  complementary  to  that 
of  the  minima-based  one.  Since  the  image  bound¬ 
ary  does  not  respect  A2.1  (for  either  method),  the 
maximum-based  method  does  better  at  those  points 
near  the  boundary  where  the  steepest  descent  direc¬ 
tion  is  outward,  while  the  minima-based  one  does 
better  where  this  direction  is  inward.  Together,  the 
two  methods  give  reconstruction  with  error  less  than 
three  units  over  most  of  the  image. 
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Abstract 

Work  on  the  Image  Understanding  Architecture  (lUA)* 
[Weems,  1989]  has  advanced  in  three  major  areas  in  the 
preceding  year:  hardware,  software,  and  applications.  In 
the  area  of  hardware,  the  first  generation  RJA  (proof-of- 
concept  prototype)  has  been  up  and  running  for  over  a 
year,  the  second  generation  is  nearing  completion,  and 
we  have  started  the  evaluation  and  design  process  fw  the 
third  generation  lUA.  With  regard  to  software,  we  have 
completed  the  compiler  for  the  parallel  C-f-t-  class 
library  for  the  low  level  of  the  lUA  (along  with  a 
sequential  version  for  workstations  and  a  parallel 
implementation  for  the  CM-2),  developed  most  of  the 
basic  software  for  the  intermediate  level,  and  transported 
the  Apply  compiler  for  the  low  level  to  the  second 
generation.  We  have  transported  severl  algorithms  and 
applications  to  the  first  generation  lUA,  developed  new 
parallel  algorithms  for  depth  from  motion  and  extraction 
of  straight  lines,  developed  a  parallel  multi-prefix 
algorithm  for  the  low-level  processor,  completed 
refinement  of  the  DARPA  lU  Benchmark,  and  started 
development  of  the  next  generation  of  the  benchmailt 

1.  Image  Understanding  Architecture 
Hardware  Development 

The  lUA  is  a  heterogenoeus  parallel  processor, 
consisting  of  three  different,  tightly-coupl^,  parallel 
processors.  The  processors  are  designed  to  address  the 
differing  requirements  of  the  low,  intermediate,  and  high 
levels  of  a  knowledge-based  computer  vision  system. 
The  low-level  processor  is  a  reconfigurable  mesh- 
connected  array  of  bit-serial  processors  operating  in 
Single  Instruction  Single  Data  (SIMD)  associative  and 
multiassociative  modes  (in  multiassociative  mode,  the 
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processors  divide  into  groups  that  can  perform 
associative  queries  with  independent  comparands  and 
results).  Its  purpose  is  to  process  image  data  and  extract 
features  and  image  data  that  can  be  represented  in  a 
symbolic  form.  The  intermediate-level  processor 
operates  in  Single  Program  Multiple  Data  (SPMD)  and 
Multiple  Instruction  Multiple  Data  (MIMD)  modes,  and 
consists  of  an  array  of  digital  signal  processors.  The 
low  and  intermediate  levels  communicate  through  a 
layer  of  dual-pmied  memory.  The  high-level  processor 
is  a  coarse-grained  MIMD  system  that  will  support 
knowledge-based  processing.  Again,  a  layer  of  dual- 
pcxted  memory  connects  the  intermediate-  and  high-level 
processors.  The  low-level  processor  receives  its 
instructions  from  the  Array  Control  Unit  (ACU),  which 
also  directs  the  intermediate  level  (when  it  is  operating 
in  SPMD  mode)  and  coordinates  interactions  between 
the  low  and  intermediate  levels.  The  ACU  is  directed  by 
the  host  and  the  high-level  processors  in  a  coarse¬ 
grained  maimer.  Figure  1  shows  an  overview  of  the  lUA 
hardware.  The  architecture  of  the  first  generation  lUA  is 
given  a  detailed  treatment  in  [Weems,  1989]. 

The  lUA  has  gone  through  two  generations  of 
development  over  the  last  six  years.  The  first  generation 
had  the  goal  of  developing  a  proof-of-concept  prototype 
hardware  and  software  system.  A  machine  with  4K  low- 
level  processors,  64  intermediate-level  processors,  and  a 
single  high-level  processor  was  constructed.  The 
software  system  included  both  FORTH  and  C  languages 
with  a  library  of  parallel  functions  and  the  ability  to 
write  expresssions  on  parallel  variables.  The  system  has 
been  up  and  running  for  more  than  a  year.  The 
prototype  is  controlled  by  a  very  simple  ACU  which 
was  only  intended  to  demonstrate  and  test  the 
functionality  of  the  system  —  it  was  not  intended  to  run 
full  applications.  However,  software  has  since  been 
developed  to  run  such  applications  via  the  simple 
controller.  The  applications  are  coded  in  the  C-f-t-  class 
library  that  is  being  developed  for  the  low  level 
processor  of  the  second  generation  lUA.  We  have  thus 
been  able  to  develop  applications  for  the  second 
generation  and  run  them  on  the  first  generation 
hardware. 
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•  Controls  low  and  intermediate  levels. 

•  Takes  commands  from  high  level. 

•  Receives  global  summary  info. 

•  Knowledge  base,  blackboard. 

•  RISC  processors  (MIMD). 

•  instantiation  of  schema  strategies. 

•  Constniction  of  scene  interpretation. 

•  Top-down  MIMO  control  of  {youping. 

•  Array  of  digital  signal  processors. 

•  SPMO/MIMO  operation. 

•  Executes  grouping  processes. 

•  Stores  extracted  image  events. 

•  Reconfigurable  array  of  bit-serial 
processing  elements. 

•  SIMD  Associative  /  Muiti-assodative. 

•  Processes  sensory  data. 

•  Stores  up  to  1 5  secotxis  of  imagery. 


Figure  1.  Overview  rtf  lUA  Hardware 


1.1  Second  Generation  llIA 

With  the  second  generation,  the  goal  has  been  to 
develop  an  updated  and  enhanced  lUA  that  can  be 
embedded  in  the  DARPA  Unmanned  Ground  Vehicle 
(UGV).  The  basic  three-level  structure  of  the 
architecture  has  been  retained,  but  the  architecture  of 
each  level  has  been  enhanced,  the  ACU  has  been 
designed  to  support  real  applications,  and  an  I/O 
subsystem  has  bwn  added. 

The  I/O  subsystem  has  been  designed  to  support  the 
equivalent  of  20  simultaneous  sensor  inputs  at  S 12  X 
512  X  8-bit  resolution  with  automatic  mapping  onto 
the  processor  virtualization  scheme  used  for  the  low 
level,  with  almost  no  latency.  The  I/O  subsystem  will 
also  support  the  selection  of  multiple  regions  of  interest 
from  an  image. 

At  the  low  level,  the  density  of  the  processors  has 
increased  so  that  each  chip  now  contains  256  elements 
instead  of  64.  The  increased  processor  density  has 
enabled  quadrupling  the  number  of  processors  while 
cutting  the  number  of  circuit  boards  in  half.  Memory 
for  the  low-level  processors  has  increased  by  a  factor  of 
four,  and  I/O  is  gready  simplified.  At  the  intermediate 
level  there  are  still  64  processors,  but  they  are  now  32- 
bit,  50  MFLOPS  elements  with  six  integral  20 
megabyte  per  second  communication  channels  in  place 
of  the  flrst  generation's  16-bit,  5  MIPS  processors 
which  had  only  two  5  megabit  per  second  channels.  In 
the  first  generation,  the  communication  channels  were 
to  be  connected  by  a  centrally  controlled  crossbar 
switch.  In  the  second  generation,  each  group  of  four 


processos  is  direcdy  connected  to  every  other  group.  In 
place  of  a  single  high-level  processor,  a  commercial 
multiprocessor  with  four  or  eight  elements  may  be 
enq)I(^ed. 

The  second  generation  system  is  now  nearing 
completion.  The  custom  chips  used  in  the  low-level 
processor  have  been  fabricated  and  tested.  The  256  bit- 
serial  processors  on  the  chip,  together  with  their  caches, 
consist  of  roughly  600,000  transistors.  The  ACU  for 
the  second  generation  has  been  assembled  and  tested.  It 
consists  of  a  macrocontroller  (a  SPARC-based  single 
board  computer)  that  directs  the  microcontroller,  which 
is  a  horizontal  microengine  and  microcode  store  with 
queued  interfaces  to  the  processor  array.  The  backplane, 
power  supply  and  chassis  for  the  system  have  been 
assembled  and  are  being  tested.  The  processtv  and 
memmy  boards  are  currently  being  fabricated.  A  design 
has  been  developed  for  an  optional  daughterboard  to 
enhance  shared-memory  access  in  the  intermediate-level 
processor.  Hughes  has  indicated  that  the  fust  machine 
should  be  assembled  by  the  end  of  March,  but  without 
die  VO  subsystem  (whose  construction  has  not  yet  been 
funded). 

1.2  Third  Generation  lUA 

Work  has  already  begun  on  the  analysis  and  design  for 
the  low-level  processor  of  the  third  generation  lUA. 
UMass  has  developed  a  system  for  capturing  traces  of 
programs  written  in  the  C-h-  class  library  as  they 
execute  on  an  abstract  parallel  machine.  The  traces  are 
then  fed  to  a  simulation  system  that  models  hardware 
architecuiies  with  different  features  and  parameters.  The 
system,  called  ENPASSANT  (Environment  for  ParaUel 
System  Simulation,  Analysis  and  Test),  allows  us  to 
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gather  real  perfcmnance  data  fw  different  architectural 
configurations,  and  to  analyze  the  data  statistically.  The 
performance  data  will  then  be  contrasted  with  cost 
estimates  for  the  different  configurations  to  produce  a 
specification  for  the  third  generation  lUA.  Figure  2 
^ws  the  structure  of  the  ENPASSANT  system. 


Connection  Machine  (CM-2)  implementation  using 
C-H-  and  PARIS.  The  implementation  was  carried  out 
by  a  student  who  had  no  experience  with  C-f-f,  the  class 
library,  or  PARIS  in  roughly  five  months  of  half-time 
effort 


Low-level  Vision 
i^lcatlon 
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iTarget  Machine 
Architectural 

Specification 


(  Input  A 

/^virtual  HacMns  A 

(Constructor  1 
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Figure  2.  ENPASSANT  System  Structure 


2.  lUA  Compilers  and  System 
Software 

Most  of  the  system  software  development  for  the  lUA 
is  taking  place  at  Amerinex  Artificik  Intelligence  Inc. 
(AAI),  although  the  University  of  Massachusetts 
^veloping  additonal  software  fOT  the  second  generadcm 
lUA. 

2.1  C-I-+  Image  Plane  Class  Library 

AAI  has  completed  development  of  the  C-h-  class 
library  for  the  low  level  of  the  lUA,  including  the 
incorporation  of  additirmal  optimization  code  into  the 
Gnu  C++  compiler.  They  continue  to  refine  the 
compiler  optimization.  The  class  library  supports  an 
image  plane  data  type.  Planes  may  be  qtecified  to  be  of 
any  size,  and  their  elements  may  be  bit,  byte,  16-bit 
intego*,  32-bit  integer,  16-  and  32-bit  unsigned  values, 
and  32-bit  floating  point  If  a  plane  is  larger  than  tire 
size  of  the  hardware  array,  it  is  automaticaily  mapped  to 
a  virtual  processor  array.  In  addtion  to  arithmetic 
expressions  on  planes,  neighborhood  operations,  mid 
reductions,  the  class  library  also  supports  general 
routing  with  combining,  and  multiassociative 
processing  within  regions  called  Coteries.  An  automated 
test  system  has  also  been  developed  for  the  machine's 
microcode  lilnrary,  to  facilitate  regression  testing  of  new 
releases  of  the  class  library  run-time  system. 

As  a  test  of  the  portability  of  the  C++  class  library  to 
other  parallel  architectures,  UMass  developed  a 


2.2  Intermediate  Level  Software 

For  the  intermediate-level  processor,  basic  operating 
system  sufqport,  multitasking,  and  messaging  have  been 
imidemented  on  a  TMS320C30  Single  Board  Computer 
(SBC),  and  recently  these  were  transported  to  another 
SBC  with  two  TMS320C40  processors  that  are 
configured  to  simulate  the  intermediate  level  of  the 
lUA.  The  operating  system  is  based  on  extending 
Spectron  Microsystems  SPOX™  real-time  OS  for  the 
TMS320  series  to  support  multiprocessing  and 
interprocessor  communication.  Programming  at  the 
intermediate  level  is  done  with  the  Texas  Instruments  C 
compiler  for  the  TMS320C40,  together  with  Ubrary 
routines  that  support  communication  between  processes 
on  different  iHOcessors.  A  multiprocessor  debugger, 
based  on  GDB,  has  also  been  implemented  for  the 
intermediate  level.  Work  is  now  under  way  to  transput 
the  KBVision™  system  to  the  lUA,  including  the 
Intermediate-level  Symbolic  Representation  (ISR) 
database. 

2.3  Apply  Compiler  for  the  Low  Level 

UMass  has  implemented  a  version  of  the  Apply 
language  [Harney,  1989],  for  the  low-level  processor  of 
the  second  generation  lUA.  An>ly  is  a  special-purpose 
mini-language  based  on  an  Ada-like  syntax,  which 
facilitates  the  programming  of  local-kernel  image 
operations.  The  compiler  generates  code  compatible 
with  the  C++  class  library.  It  permits  us  to  easily 
import  image  processing  operations  written  for  the 
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CMU  Warp  or  Intel  iWarp  machines.  As  part  of  testing 
the  compiler,  we  have  also  compiled  and  tested  the  Web 
library  of  image  processing  routines,  supplied  with 
Apply. 

3.  Applications 

Most  of  the  application  development  has  taken  place  at 
the  University  of  Massachusetts,  although  Hughes  has 
implemented  some  algorithms  for  ATR.  The  Hughes 
algorithms  are  proprietary  and  will  not  be  described 
here.  However,  they  do  demonstrate  some  of  the 
potential  of  the  lUA  when  it  is  applied  to  real 
problems. 

3.1  DARPA  Benchmark 

As  recommended  by  the  DARPA  lU  Benchmark 
Workshop  participants,  much  of  the  benchmark 
[Weems,  1988,  1990b]  has  been  recoded  as  a  set  of 
library  routines  which  are  called  by  the  core  of  the 
benchmark.  It  is  thus  possible  for  implementors  to 
make  use  of  the  image  processing  operations  in  other 
applications,  and  thereby  amtvtize  the  development  cost 
of  implementing  the  benchmark.  We  have  also  begun 
developing  the  second  level  benchmark,  which  will 
incorporate  tracking  of  moving  objects  over  a  sequence 
of  images. 

In  order  to  reuse  as  much  of  the  static  benchmark  as 
possible,  the  new  benchmark  will  operate  on  the  same 
type  of  scenes  ••  a  2.S  dimensional  mobile  of  rectangles 
with  chaff,  but  in  the  new  benchmark,  the  mobile  and 
chaff  will  be  blown  by  an  idealized  wind.  The  goal  of 
the  new  benchmark  is  to  test  system  performance  over  a 
longer  period  of  time  so  that,  for  example,  caches  and 
page  tables  will  be  filled,  llie  benchmark  will  also 
explore  I/O  and  real-time  capabilities  of  the  systems 
under  test,  and  involve  more  high-level  processing. 

3.2  Multi-Prefix  On  The  Low  Level 

The  communication  network  in  the  low-level  processor 
of  the  lUA  is  a  square  mesh,  augmented  with  a  second 
(reconfigurable)  mesh,  called  the  Coterie  Network 
[Weems,  1990a].  The  Coterie  Network  allows  the  mesh 
to  be  partitioned,  for  example,  into  areas  corresponding 
to  regions  in  an  image.  One  particularly  useful 
operation  is  the  ability  to  enumerate  elements  within  a 
partition  or  to  summarize  (reduce)  the  information  in  a 
partition  to  a  single  value.  Parallel  prefix  is  the  general 
form  of  this  type  of  operation.  It  is  especially  desirable 
to  carry  out  parallel  prefix  in  all  partitions  at  once,  i.e. 
to  perform  a  multi-prefix  operation. 

Since  feature  extraction  algorithms  typically  process 
thousands  of  regions  during  each  of  many  passes  over 
an  image,  the  efficient  computation  of  region 
parameters  by  a  massively  parallel  processor  requires 
that  those  regions  be  processed  simultaneously.  The 
difficulty  when  the  processor  is  a  SIMD  array  is  in 


orchestrating  non-uniform  data-dqiendent  communica¬ 
tion  using  a  single  thread  of  control.  In  this  section  we 
describe  briefly  how  the  method  of  "Coterie  Structures" 
can  be  used  to  overcome  this  difficulty  and  create 
efficient  region  reduction  and  multi-prefix  algorithms 
for  the  low  level.  This  work  is  presented  in  detail  in 
[Heibordt.  1992a]  and  [Herbordt,  1992b]. 

Previous  reduction  methods  either  use  the  standard  mesh 
connections  to  route  and  combine  packets  [Herbwdt, 
1993],  or  use  the  coterie  network  to  iteratively  merge 
rectangles  [Jenq  90].  The  problem  with  the  first  method 
is  that  it  requires  a  numter  of  data  transfers  between 
neighboring  processors  that  is  proportional  to  the  size 
of  die  largest  region.  For  many  images,  this  facux'  is 
the  diameter  of  the  entire  array.  The  second  method  only 
requires  a  number  of  broadcast  operations  proportional 
to  the  logarithm  of  the  size  of  the  array.  However,  the 
proportionality  constant  is  at  least  128,  and  broadcast 
operations  are  somewhat  more  costly  than  neighbor 
transfers.  Rectangle  merging  is  thus  even  slower  than  a 
combining  route  for  practical  applications. 

In  developing  efficient  reduction  algorithms  for  the 
CAAPP  it  is  necessary  to  use  the  coterie  network, 
because  it  reduces  the  complexity  from  being 
proportional  to  the  size  of  the  image  to  being 
proportional  to  the  log  of  the  size  of  the  image. 
However,  we  use  a  very  different  approach  than  the 
typical  method  of  iteratively  merging  partitions  of  the 
regions.  We  begin  by  characterizing  coterie  structures 
(configurations  of  contiguous  PEs  sharing  a  circuit)  by 
their  geometric  properties. 

Some  simple  coterie  structures  are  horizontal  and 
vertical  lines,  and  rectangles.  These  structures  are 
known  to  have  the  property  of  supporting  reduction  in 
log(N)  Ixoadcast  operations.  To  these  we  have  added  the 
simple  closed  contour,  boundary  contour,  singly 
vertically  (or  horizontally)  connected  contour,  spanning 
tree,  and  the  generic  coterie  structure,  the  coterie  itself. 
Examples  of  these  structures,  as  derived  from  an  actual 
image  segmentation,  are  shown  in  Figure  3.  We  have 
used  a  two  part  suategy:  to  create  efficient  reduction 
algorithms  for  whatever  structures  we  can,  and  to  create 
transformation  algorithms  to  partition  more  complex 
structures  into  simpler  ones.  Both  parts  of  the  strategy 
necessarily  depend  on  information  that  PEs  can  obtain 
about  the  network  configuration  in  constant  time,  so 
that  they  can  dynamically  repartition  the  array. 

Some  of  the  basic  results  we  have  obtained  in  our  study 
of  coterie  structures  are  as  follows: 

•  Reduction  can  be  computed  on  singly  vertically  (or 
horizontally)  connect^  contour  using  log(N) 
broadcast  operations,  thus  matching  the  performance 
of  this  algorithm  on  simple  rectangles. 

•  Reduction  can  be  computed  on  a  spanning  tree  using  k 
*  log(N)  broadcast  operations,  where  k  is  a  small 
constant. 
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Figure  3.  Example  Coterie  Structures,  a)  Image  of  a  road  scene,  b)  segmentation  of  the  image,  c)  32  x  32  pixel 
subimage  of  the  segmentation,  (d-h)  represent  coterie  structures  derived  from  the  subimage:  d)  coteries  corresponding 
to  regions,  e)  horizontal  lines,  0  boundary  contours,  g)  singly  vertically  connected  contours,  h)  spanning  trees. 


•  Any  coterie  can  be  partitioned  into  a  minimal  set  of 
singly  vertically  (or  horizontally)  connected  contours 
in  constant  time. 

These  results  have  been  integrated  into  three  efficient 
algorithms  for  the  simultaneous  computation  of 
reductions  on  all  contiguous  aggregates  (of  arbitrary 
shape)  in  an  image: 

•  A  deterministic  algorithm  using  log(N)  +  4S  broadcast 

operations  where  S  is  a  small  constant  for  most 
array  partitions. 

•  A  randomized  algorithm,  using  an  approach  similar  to 

that  of  Phillips  [Phillips  89]  and  of  Miller  and  Reif 
[MillerSS]  that  uses  k  *  log  (N)  broadcast 
operations  where  k  is  a  small  constant  with  high 
probability  for  all  images.  We  have  also  shown 
that  the  algorithm  is  much  more  practical  when  the 
coteries  can  be  prefx-ocessed  into  spanning  trees. 

•  A  hybrid  algorithm  that  combines  the  deterministic 

^gorithm  with  associative  techniques  and  requires 
only  log(N)  +  T  broadcast  operations,  where  T  has 
been  found  to  be  less  that  20  for  virtually  all 
images. 

These  algorithms  are  significant  in  that  they  are  likely 
to  be  the  fastest  available  for  reconfigurable  broadcast 
networks  for  many  images  derived  from  real-world 
scenes. 

3.3  Depth  From  Motion 

UMass  has  developed  a  parallel  algorithm  for  the  lUA 
that  computes  a  dense  depth  map  for  a  scene  from  a  pair 


of  images  taken  by  a  moving  sensor.  The  only 
restrictions  on  the  algorithm  are  that  the  motion  of  the 
sensor  must  be  roughly  in  a  forward  direction,  that  is, 
the  focus  of  expansion  must  be  in  or  near  the  area  of  the 
images.  It  is  also  assumed  that  the  rotational 
component  of  the  motion  is  small  (a  few  degrees) 
between  the  images.  This  would  be  typical  of  imagery 
obtained  from  a  fixed  camera  mounted  on  a  vehicle 
moving  forward  through  an  environment  The  algorithm 
has  an  average  error  of  about  8  percent  in  depth,  as 
computed  from  randomly  sampling  points 
corresponding  to  objects  in  the  scene  with  known 
distances  from  21  to  76  feel  from  the  camera.  The 
algorithm  also  seems  to  be  able  to  distinguish  depths  of 
objects  at  distances  beyond  what  was  measured  in 
collecting  the  data  -  trees  at  a  computed  distance  of  90 
feet  stand  out  clearly  from  the  background,  for  example. 

The  algorithm  begins  by  determining  pixel 
correspondences  using  a  3  X  3  correlation  applied  over 
the  entire  image  for  all  the  possible  displacements 
between  the  two  images,  within  a  specified  search 
window.  It  then  selects  the  best  image  displacements 
using  an  interest  operator,  and  partitions  the  image  into 
tiles.  Within  each  tile,  the  best  of  the  best 
displacements  is  selected  and  the  intersections  of  all  of 
the  displacement  vectors  are  computed.  A  Hough 
transform  is  applied  to  the  set  of  intersections  and  a 
focus  of  expansion  (FOE)  is  selected  from  the  Hough 
array.  Once  the  approximate  FOE  is  determined,  an 
optimization  process  is  used  to  estimate  the  best 
rotational  parameters  and  then  the  translational  motion 
is  determined.  Once  the  motion  parameters  are 
determined,  the  image  displacements,  together  with 
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(a)  (b)  (c) 

Figure  4.  Result  of  Depth  from  Motion  Algorithm:  a)  First  image  in  sequence,  b)  second  image,  c)  depth  map 


camera  parameters,  are  used  to  determine  the  depths  of 
points  in  the  scene. 

The  experiments  were  done  with  fairly  large 
displacements  (four  feet  of  forward  motion  between  the 
images)  so  that  a  large  (41  X  41  pixel)  search  window 
was  required  to  establish  correspondences.  This  results 
in  1681  image-to-image  correlations  being  performed. 
In  simulations  on  the  second  generation  lUA,  it  was 
determined  that  the  execution  time  wilt  be  about  0.54 
seconds,  of  which  0.53  seconds  is  taken  up  solely  by 
the  correlations.  We  are  thus  looking  into  approaches  in 
which  an  estimate  of  the  motion  is  available  or  in 
which  a  scries  of  frames  with  smaller  displacements  can 
be  used  (allowing  the  search  window  to  be  constrained). 

Figure  4  shows  two  images  from  a  motion  sequence 
taken  at  Carnegie  Mellon  University,  and  a  depth  map 
that  has  been  grey-coded  into  a  small  number  of  ranges. 
The  border  around  the  depth  map  corresponds  to  areas  in 
the  first  image  that  have  passed  out  of  view  in  the 
second  image.  The  other  black  areas  represent  regions  in 
which  there  is  a  low  confidence  in  the  correspondence. 
For  example,  the  area  in  the  upper  right  comer  is  near 
the  focus  of  expansion.  It  is  possible  to  distinguish  one 
of  the  cones  and  two  of  the  distant  trees  in  the  depth 


map.  Figure  5  shows  an  enlargement  of  the  area  around 
the  four  cones  in  the  right  half  of  the  .scene  in  Figure  4. 
The  grey-coding  of  the  depth  map  in  Figure  5  permits 
greater  resolution  of  depth  in  that  part  of  the  map.  Three 
of  the  cones  produce  distinct  features  in  depth.  The  cone 
in  the  foreground  is  about  36  feet  away,  and  the  two 
cones  in  the  middle  are  about  56  feet  away.  The  results 
of  this  algorithm  are  described  in  detail  in  an  article  by 
Outta  [Dutta,  1993),  elsewhere  in  this  proceedings. 

3.4  Line  Extraction 

UMass  has  also  developed  a  parallel  algorithm  for 
extracting  straight  lines  from  an  image.  The  algorithm 
begins  with  a  Sobel  operation  to  compute  a  gradient 
field.  The  gradient  orientations  exceeding  a  threshold  are 
then  divided  into  two  sets  of  overlapping  buckets. 
Using  these  buckets,  the  image  is  segmented  into  two 
sets  of  regions  using  a  connected  component  labelling 
operation.  The  regions  correspond  to  areas  with  a  high 
gradient  magnitude  and  similar  orientation.  Thus,  they 
tend  to  be  long  and  narrow  areas  surrounding  candidate 
lines.  These  regions  arc  naturally  represented  as  Coteries 
int  eh  rcconfigurable  mesh  of  the  lUA. 

The  two  sets  of  regions  are  then  merged  into  a  single 


(a) 


(b) 


Figure  5.  Enlargement  of  an  Area  in  Figure  4 
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(d) 


Figure  6.  Results  from  the  line  algorithm  executing  on  the  lUA  simulator:  a)  a  road  scene,  b)  the  scene  in  Figure  4, 
c)  a  low-resolution  (64  x  64)  image  of  part  of  a  house,  d)  an  indoor  hallway  scene  with  obstacles 

set  by  forming  the  intersection  of  the  two  sets.  Pixels  execution  on  the  lUA  simulator,  the  algorithm  executes 

in  these  overlapping  portions  of  regions  choose  to  in  31  milliseconds  for  images  that  map  to  the  array 

belong  to  the  larger  of  the  two  regions,  and  the  Coteries  with  a  1 : 1  virtualization  ratio.  The  execution  time  will 

are  appropriately  reorganized.  The  regions  are  then  split  decrease  once  special  optimizations  for  the  1 : 1  case  have 

into  three  sections  along  their  length.  Within  each  been  added  to  the  compiler.  However,  higher 

section,  the  three  points  with  the  greatest  gradient  virtualization  ratios  will  be  approximately  direct 

magnitude  are  selected.  Then  a  line  is  fit  to  the  nine  multiples  of  this  time  (i.e.  a  4:1  factor  results  in  a  time 

selected  points  within  each  region.  Finally,  the  of  approximately  124  milli.sccond.s).  Figure  6  shows  the 

endpoints  for  the  lines  are  computed  and  output.  results  of  applying  the  algorithm  to  six  different  images 
Optionally,  a  filter  ba.sed  on  eonpast  and  length  can  be  taken  from  a  variety  of  .scenes.  No  filtering  has  been 

used  to  select  different  sets  of  lines  for  output.  In  applied  to  the  lines  in  the  figure. 
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4.  Conclusions 

During  the  previous  year,  the  Image  Understanding 
Architecture  has  substantially  advanced  toward  the  goal 
of  developing  a  ruggedized,  embeddable,  reproducible 
hardware  and  software  system  for  real-time  image 
understanding  applications.  The  first  generation  system 
is  being  used  to  test  programs  that  have  been  developed 
with  the  second  generation  software  environment  The 
second  generation  hardware  is  nearly  complete,  as  is  its 
basic  software  environment  Porting  of  the  KBVision™ 
system  to  the  second  generation  has  begun.  Research 
has  continued  in  the  areas  of  benchmarking, 
architectural  analysis  and  design,  and  the  development  of 
both  basic  parallel  algorithms  and  vision  applications 
for  the  lUA. 

Acknowledgements 

The  authors  thank  David  Shu,  Lap  Chow,  Dennis  Finn. 
Howard  Neely,  and  David  Schwart"  Hughes  Research 
Laboratories,  and  Charles  '  itin  a).»  Mitchell  Smith  at 
VillaMar  Inc.  for  their  cot  ations  to  the  lUA  design 
and  development 

Bibliography 

[Dutta,  1993]  Dutta,  R.,  Weems,  C.C.,  Riseman, 
E.M.,  Parallel  Dense  Depth  from  Motion  on  the  Image 
Understanding  Architecture,  Proc.  of  the  DARPA  Image 
Understanding  Workshop.  April,  1993 

[Harney,  1989]  Harney,  L.G.C.,  Webb.  J.A.,  and  Wu, 
I-C,  An  Architecture  Independent  Programming 
Language  for  Low-Level  Vision,  Computer  Vision, 
Graphics  and  Image  Processing,  Vol  48, 1989,  pp.  246- 
264. 

[Herbordt.  1990]  Herbordt.  M.T.,  Weems,  C.C., 
Message  Passing  Algorithms  for  a  SIMD  Torus  with 
Coteries,  Proc.  ACM  Inti.  Symposium  on  Parallel 
Algorithms  and  Architectures,  Crete,  July,  1990. 

[Herbordt,  1992a]  M.C.  Herbordt,  C.C.  Weems,  MJ. 
Scudder  (1992):  Non-Uniform  Region  Processing  on 
SIMD  Anays  Using  the  Coterie  Network,  Machine 
Vision  and  Applications,  5  (2),  pp.  105-125. 

[Herbordt,  1992b]  M.C.  Herbordt,  C.C.  Weems  (1992): 
Computing  Reduction  and  Parallel  Prefix  Using  Coterie 
Structures.  Proceedings  of  the  4th  Symposium  on  the 
Frontiers  of  Massively  Parallel  Computation,  pp.  141- 
149. 

[Herbordt,  1993]  M.C.  Herbordt,  J.C.  Corbett,  J. 
Spalding,  C.C.  Weems  (1993):  Practical  Algorithms  for 
Online  Routing  on  SIMD  Meshes,  To  appear  in  the 
Journal  of  Parallel  and  Distributed  Computing. 


[Jenq,  1991]  J.-F.  Jenq.  S.  Sahni  (1991): 
Reconfigurable  Mesh  Algorithms  for  the  Area  and 
Perimeter  of  Image  Components,  Procedings  of  the 
20th  International  Conference  on  Parallel  Processing. 
Volume  III,  pp.  280-281. 

[Miller.  1985]  G.L.  Miller,  J.H.  Reif  (1985):  Parallel 
Tree  Contraction  and  its  Applications,  Procedings  of  the 
28th  IEEE  Conference  on  die  Foundations  of  Computer 
Science. 

[Phillips.  1989]  C.A.  Phillips  (1989):  Parallel  Gi^ 
Contraction,  Procedings  of  the  1st  ACM  Symposium 
on  Parallel  AIgtvithms  and  Architectures,  pp.  148-157. 

[Weems.  1988]  Weems.  C.C.,  Hanson,  Ail.,  Riseman, 
E.M..  Rosenfeld,  A.,  An  Integrated  Image 
Understanding  Benchmark:  Recognition  of  a  2-1/2D 
"Mobile",  Proc.  IEEE  Conf.  on  Computer  Vision  and 
Pattern  Recognition,  Ann  Arbor,  MI,  June  5-9. 1988. 

[Weems,  1989]  Weems,  C.C.,  Levitan.  S.P.,  Hanson, 
A.R..  Riseman,  E.M..  Shu.  D.B.,  Nash.  J.G..  The 
Image  Understanding  Architecture.  Inti.  Journal  of 
Computer  Vision,  2. 251-282  (1989),  Kluwer  Academic 
Publishers,  Boston,  MA. 

[Weems,  1990a]  Weems.  C.C.,  Rana,  D.,  Reconfig¬ 
uration  in  the  Low  and  Intermediate  Levels  of  the  Image 
Understanding  Architecture,  in  Reconfigurable  SIMD 
Parallel  Processors,  Hungwen  Li  (ed.),  Prentice  Hall, 
Englewood  Cliffs.  NJ.,  1990.  Also,  COINS  TR#  90- 
10. 

(Weems,  1990b]  Weems,  C.C.,  Riseman,  E.M., 
Hanson,  A.R.,  Rosenfeld,  A.,  The  DARPA  Image 
Understanding  Benchmark  for  Parallel  Computers, 
Journal  of  Parallel  and  Distributed  Computing,  Vol.  1 1 . 
No.  1  (January,  1991),  pp.  1-24. 


1140 


Integrating  the  Lisp/CLOS-C/C-h+  Environments 
An  Approach  to  Modular  Interface  Formats 


Jon  L  White 

Lucid,  Inc. 

tel:  415-329-8400  (ext.  5514) 
jonl@lucid.com 


Abstract 

Our  goal  is  to  produce  a  major  step  up  in  the  degree  of 
integration  of  the  C++-written  and  Lisp-written  com¬ 
ponents  of  a  large  software  project.  Towards  this  end, 
we  propose  to  make  a  sei  of  coordinated,  minor,  but  non¬ 
proprietary,  changes  to  the  Lucid  C  and  C-(-l-  compilers, 
to  the  Lucid  Common  Lisp  foreign  loader,  to  the  pub¬ 
licly  available  debugger  GDB,  and  to  extend  the  publicly 
available  EMACS  text  editor  for  an  increased  Editor-to- 
Lisp  communication. 

We  believe  it  is  preferable  to  co-develop  a  C-f-H 
(and/or  C)  compiler  and  debugger  (i.e.,  GDB)  along 
with  extending  a  Common  Lisp's  Foreign  Language  In¬ 
terface,  rather  than  trying  to  focus  the  burden  of  imple¬ 
mentation  entirely  on  one  side  or  the  other.  Lucid,  Inc. 
is  in  a  rather  unique  position  amongst  software  vendors 
to  carry  out  this  combined  designed  and  implementation, 
as  it  has  developed  and  is  currently  marketing  both  a 
very  high  quality  Common  Lisp  product  and  a  very  high 
quality  C/C++  compiler  product.  In  addition,  Lucid  is 
maintaining  a  version  of  GnuEmacs  and  is  cooperating 
with  the  evolution  of  GDB  towards  this  same  end. 

1  Overview 

First,  we  plan  to  alter  Lucid’s  C+-1-  compiler  in  a  way 
which,  while  not  altering  the  semantics  of  the  languages 
being  compiled,  will  greatly  increase  the  availability  of 
typing  and  structural  information  both  at  the  time  of 
loading  in  the  compiled  object  file,  and  at  the  time  of  ex¬ 
ecution  of  the  program.  This  will  provide  better  debug¬ 
ging  with  tools  such  as  GDB  and  the  Lisp  “Inspector”, 
and  for  increased  automation  in  generating  the  inter¬ 
face  descriptions  necessary  for  Lisp’s  “Foreign  Langui^e 
Interface”  (generally  referred  to  as  the  “Foreign  FNinc- 
tion  Interface”,  or  FFI,  but  for  years  it  has  been  more 
general  than  merely  a  Function-to-Function  protocol.) 
These  changes  will  add  new  information  to  a  section  of 


the  compiled  object-file  called  debug-info,  and  will  thus 
not  impact  standard  linkers  and  loaders. 

Coordinated  changes  in  the  Lucid  Common  Lisp  for¬ 
eign  loader  will  take  advantage  of  this  information  to 
provide  automatic  generation  of  Lisp-side  class  declara¬ 
tions.  The  loading  of  a  compiled  C-f-f  class  will  auto¬ 
matically  give  rise  to  a  CLOS  class  as  a  Lisp-side  handle, 
which  will  be  of  a  metaclass  belonging  to  FOREIGN- 
METACLASS.  These  CLOS  classes  will  have  an  in¬ 
stance  creation  and  initialization  protocol  on  the  Lisp 
side  which  invokes  the  corresponding  new  object  allo¬ 
cators,  and  constructors,  on  the  foreign  side.  Further¬ 
more,  they  will  support  optimizable  SLOT- VALUE  and 
MAKE-INSTANCE  protocols  (meaning;  under  condi¬ 
tions  similar  to  those  stated  for  metaclass  STANDARD- 
CLASS,  the  SLOT- VALUE  calls  can  be  open-coded  to 
achieve  an  access  timing  not  significantly  more  than  a 
few  memory  cycles  worse  than  optimal  open-coded  C- 
struct  access;  and  similarly  they  will  support  Lisp-side 
optimizations  such  as  are  currently  evident  in  the  Lucid 
4.0  product  for  the  metaclass  STANDARD-CLASS  etc.) 
C-(-i-  class  members  returned  by  calls  out  to  foreign  code 
will  be  easily  recognized  as  belonging  to  this  metaclass, 
and  as  being  members  of  their  own  direct  CLOS  class, 
upon  which  CLOS  methods  may  specialize. 

Additional  advantages  of  the  extra  debug-info  will 
be  taken  to  provide  consistency  checking  between  the 
Lisp-side  interface  declarations  existing  at  the  time  of 
object-file  loading;  and  under  an  option,  such  declara¬ 
tions  can  even  be  automatically  generated  also,  ^ce 
the  object  files  will  contain  a  sufficiently  rich  amount  of 
t3rpe-declaration  information  about  functions  and  their 
arguments.  Metaobject  information  will  be  available 
also  to  assure  a  Lisp-side  refiection  of  the  C-f-f  super¬ 
classes  chains,  and  also  a  dynamic  INSPECT  capability 
for  C-H-  objects. 

We  will  improve  and  automate  the  interface  between 
GDB  and  Lisp.  FVom  the  GDB  side,  GDB  will  keep 
and  extend  normal  symbol-table  information  for  each  file 
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that  is  “foreign-loaded”  into  Lisp;  and  GDB  will  support 
extended  commands  that  know  about  the  format  of  Lisp 
data  structures,  so  that  “poking  around”  from  GDB  into 
a  stopped  Lisp  image  will  be  relatively  easy  to  do.  For 
example,  GDB  will  mimic  the  part  of  the  Lisp  package 
system  and  PRINT  and  READ  functions,  to  offer  a  Lisp- 
style  interface  instead  of  merely  octal  or  hex  “dumps"; 
it  will  be  possible  to  set  GDB  breakpoints  using  Lisp 
function  names.  Lisp  functions  names  will  show  up  on 
the  stack  backtrace,  and  Lisp  data  will  be  printed  out 
in  a  format  similar  to  what  PRINT  would  do.  From  the 
Lisp  side,  the  foreign  loader  will  prepare  the  necessary 
incremental  symbol-table  information  for  GDB,  and  the 
Lisp  debugger  will  integrate  the  foreign  function  frames 
into  its  backtrace. 

We  will  solidify  the  existing  practice  of  Emacs-to-Lisp 
interfaces  by  extending  the  current  style  of  such  inter¬ 
faces  and  embedding  it  into  our  Lisp  product.  A  pro¬ 
grammer  desires  of  his  development  environment  some 
aids  during  the  step  of  program  preparation  (for  Com¬ 
mon  Lisp,  this  usually  done  in  using  a  text-editor  tool). 
A  number  of  different  approaches  towards  this  end  have 
already  been  tried  out  by  end-users  and  by  other  Lisp 
vendors;  we  will  integrate  in  the  better  ideas,  and  try  to 
work  for  some  sort  of  de  facto  standard  with  the  other 
vendors,  in  that  the  protocol  on  the  EMACS  side  will 
be  freely-available  EmacsLisp  code  using  a  network-like 
protocol.  Indeed,  any  program  -r  her  vendor’s  Lisp,  or 
random  application — can  respond  to  such  a  protocol. 

2  Partitioning  into  “Protocols” 

Since  this  work  involves  the  integration  of  famil¬ 
iar  tools— Emacs,  GDB,  Common-Lisp,  and  C-H-l- 
compilers — the  question  arises  as  to  whether  the  commu¬ 
nications  between  tools  is  an  open  protocol.  We  would 
like  to  make  it  so,  and  plan  the  following  steps  to  as¬ 
sure  this.  In  addition,  we  are  taking  this  opportunity  to 
see  how  much  similarity  we  can  find  between  the  “For¬ 
eign  Function  Interfaces”  of  some  of  the  major  vendors 
of  Common  Lisp  on  Unix  workstations. 

•  We  plan  the  specification  of  a  minimal,  de  facto 
standard  for  a  Foreign  Language  Interface  for  Com¬ 
mon  Lisp.  Primarily,  this  will  be  worked  out  with 
Harlequin,  Ltd.,  whose  basic  FFI  resembles  Lucid’s 
closely.  We  will  work  together  to  extend  this  as 
much  as  possible  for  C-f-f-  and  for  areas  yet  lack¬ 
ing  in  Lucid’s  published  designs;  in  particular,  the 
notion  of  a  FOREIGN-METACLASS  for  CLOS  will 
be  spelled  out. 

•  We  plan  the  specification  of  a  minimal,  de  facto 
standard  for  a  user-interface  for  the  Emacs/Lisp 
communication  necessary  for  an  integrated  debug¬ 
ging  environment.  Primarily,  this  will  be  worked 


out  with  Franz,  Inc.,  who  has  already  placed  some 
EmacsLisp  code  into  the  public  domain  for  a  part 
of  an  Emacs  side  of  such  an  interface.  The  intention 
is  that  extending  Emacs  protocols  will  provide  for 
a  common  interface  for  symbolic  debugging  both  of 
Lisp  and  of  C++. 

•  We  plan  the  specification  of  a  GDB  protocol  for  im¬ 
proved  interfacing  and  communications  with  a  Lisp 
subprocess,  and  for  making  incremental  additions  to 
its  symbol  table.  This  work  will  generally  be  done  in 
conjunction  with  GDB  improvements  being  concur¬ 
rently  made  by  Lucid,  Inc  (for  improvements  to  the 
Energize  product),  by  Cygnus,  Inc  (under  contract 
from  Lucid),  and  by  the  Free  Software  Foundation. 

•  The  specification  of  additional  debug-info  to  be  out¬ 
put  by  a  C  and/or  C-H-  compiler,  in  support  of 
improved  typing  at  runtime  for  use  such  as  by  stan¬ 
dard  debugging  tools  like  GDB  and  DBX  or  by 
a  Common  Lisp  environment.  Furthermore,  tech¬ 
niques  will  be  documented  for  specific  extensions  to 
Lucid’s  C  and  C++  compilers  which  will  guarantee 
a  necessary  minimum  of  such  debug-info  in  a  com¬ 
piled  object-file,  such  that  Lucid’s  object-file  loader 
within  the  Lisp  environment  (the  so-called  “Foreign 
Loader”)  can  mechanically  extract  necessary  inter¬ 
facing  information  for  the  Lisp  side  of  the  Foreign 
Language  Interface.  These  would  be  implemented 
in  a  product-level  version  of  Lucid’s  C-h-f  compiler, 
as  well  as  in  a  product-level  version  of  the  Common 
Lisp  product. 

3  Coordinated  Extensions  of  the 
C++  compiler  and  Lisp  Envi¬ 
ronment 

Why  extend  the  debug-info  from  a  C/C++  compiler, 
rather  than  tracking  the  compiler's  decisions  in  a  par¬ 
allel,  possibly  Lisp-written,  program?  This  question  al¬ 
most  answers  itself  when  one  considers  the  complexity  of 
fully  correct  parsing  of  C-t-+  and  of  the  current  turblence 
in  C-h-i-  compiler  technologies.  It  is  safe  to  say  that  in 
genera]  one  may  not  inter-mix  the  object  files  from  two 
independent  C-f-f  compilers;  although  one  may  gener¬ 
ally  do  this  for  C  compilers  on  any  particular  platform 
(the  so-called  common  object  file  formats  work  because 
there  is  a  common  notion  of  how  to  represent  nomencla¬ 
tures  in  the  object  files).  It  is  easy  and  straight-forward 
for  a  C-f-f  compiler  to  emit  a  modest  amount  of  extra 
information  into  a  pre-defined  debug-info  format;  it  is 
easy  for  a  Lisp’s  foreign  loader  module  to  continue  pars¬ 
ing  and  remembering  this  extra  information.  But  trying 
to  track  the  compiler’s  decisions  in  a  parallel  parsing 
program  will  certainly  be  subject  to  the  usual  problems 
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of  paxallel  versions — trying  to  keep  them  lock-in-step  for 
every  internal  decision  about  runtime  representations. 
An  additional  debug-info  protocol  emitted  by  ike  com¬ 
piler  is  the  safe  step  here. 

Furthermore,  persons  working  totally  in  the  C-k-b 
world  often  wish  that  their  debugger  had  access  to 
more  of  the  kinds  of  runtime  typing  information  that 
Lisp  users  are  accustomed  to.  One  of  Lucid’s  compiler 
wizards — himself  a  member  of  X3J16,  the  standardiza¬ 
tion  body  for  C-I--1 — notes  that  there  are  continuing 
suggestions  to  that  body  requesting  more  of  the  run¬ 
time  typing  structures  and  capabilities  in  compiled  C-f-l- 
modules  (see  also  Stroustrup  1992.)  In  short,  exten¬ 
sions  to  C-k-k  for  increased  availablilty  at  runtime  of 
typing  information,  at  least  in  some  sort  of  debugging 
mode,  will  likely  be  of  as  much  use  and  desirability  to 
the  general  C-k-k  community  as  to  the  Lisp  community. 
While  this  goal  is,  in  the  long  run,  much  greater  than 
one  project  and  one  compiler  vendor  can  handle,  we  feel 
that  the  Common-Lisp  community  could  have  much  to 
contribute  to  these  discussions,  especially  in  view  of  a 
compiled  Lisp/CLOS-C/C-k-k  integration. 

Finally,  the  use  of  a  conventional  debugger  like  GDB 
has  until  now  been  at  a  disadvantage  because  it  (1)  has 
no  access  to  the  symbols  of  the  dynamically  foreign- 
loaded  modules,  and  (2)  has  no  understanding  of  Lisp 
data  formats  including  stack  layout.  We  have  a  pilot 
version  of  a  communication  mechanism  between  Lucid 
Common  Lisp  and  GDB  whereby  GDB’s  symbol  table 
may  be  extended,  by  explicit  user  request,  to  incorpo¬ 
rate  the  Lisp  Foreign  Loader’s  symbol  table  information 
also.  In  the  pilot,  the  Lisp  user  explicitly  requests  the 
dumping  of  the  entire  “foreign”  symbol  into  a  file;  and 
then  from  GDB  the  user  explicitly  overloads  GDB  sym¬ 
bol  table  with  this  file.  Lucid  is  currently  developing 
protocols  in  GDB  for  it  to  communicate  with  Lucid’s  En- 
ergize(Tm)  product  through  a  network  socket  protocol; 
we  plan  to  extend  these  protocols  so  that  it  can  auto¬ 
mate  the  preparation  and  acquisition,  incrementally,  of 
the  symbol-table  information  kept  by.  Lisp  foreign  loader. 

4  Other  Research  in  Related 
Areas 

While  most  other  vendors  of  Lisp  have  some  form  of 
defined  techniques  for  interfacing  to  foreign  code,  typi¬ 
cally  with  limitations  similar  to  that  found  in  the  Lu¬ 
cid  product  for  the  past  five  years,  there  have  always 
been  nagging  gaps:  for  example,  incomplete  ability  to 
interface  to  foreign  data.  Indeed,  Lucid’s  earlier  imple¬ 
mentations,  like  several  other  vendors’,  only  provided  an 
interface  to  functions;  by  extending  the  work  found  in 
dec’s  VAXLISP  product.  Lucid  came  up  with  a  foreign 
types  interface  [see  Sexton  1988],  and  automated  more 
of  the  necessary  data  representation  conversions. 


About  three  years  ago.  Lucid  and  Sun  Microsystems 
investigated  what  it  would  take  to  do  what  might  be 
called  a  “seamless”  integration  of  Lisp  and  C  coding, 
by  re-implementing  a  Lisp  system  firom  the  beginning 
with  the  goal  in  mind.  The  effort  was  known  by  the 
acronym  NCL,  meaning  “Native  Common  Lisp”,  and  the 
outline  of  such  a  design  was  submitted  to  DABPA  in 
response  to  BAA  90-15.  Although  the  NCL  work  was 
not  continued,  ideas  similar  to  it  abound  today  in  the 
form  of  one-  or  two-person  research  projects  which  are 
geared  towards  the  demonstration  of  the  feasibility  of 
one  or  more  components  of  it.  See  for  example  Bartlett 
(1988);  see  also  Muller  and  Rose  (1992)  and  Hennessey 
(1992). 

Except  for  the  work  of  Hennessey,  most  of  these  re¬ 
search  projects  have  been  undertaken  in  Scheme  rather 
than  in  Common  Lisp.  The  main  attractiveness  of 
Scheme  as  a  vehicle  for  r  experimentation  is  that  it 
can  be  taken  to  be  a  very  s„.all  language;  although  most 
commercially  available  Scheme  implementations  have  ex¬ 
tensive  development  beyond  the  formal  standzird  speci¬ 
fication  (e.g.,  macros,  etc.)  a  very  small  defined  set  of 
capabilities  is  within  the  scope  of  a  one-  or  two-man- 
year  project.  By  contrast,  a  commercially  rugged  Com¬ 
mon  Lisp  effort  is  well  beyond  that  in  development  cost. 
(The  successful  Common  Lisp  vendors  probably  have 
more  than  20  man  years  each  invested  in  their  prod¬ 
ucts;  and  probably  a  guess  of  100  man  years  might  not 
be  off  the  mark.)  Simply  because  of  the  added  com¬ 
plexity  of  real  Common  Lisp  systems,  and  the  variety  of 
factors  competing  for  trade-offs  in  the  efficiency  arena, 
it  is  difficult  to  say  how — and  even  if — the  limited  suc¬ 
cess  of  these  Scheme  prototypes  would  generalize  to  an 
industrial-quality  Common  Lisp. 

Although  such  radical  re-implementation  approaches 
are  a  fertile  ground  for  research  projects  in  advanced 
programming  language  features,  we  do  not  believe  this 
approach  to  be  suitable  to  the  needs  of  the  RADIUS 
projects.  Our  current  proposal  is  an  evolutionary  ap¬ 
proach  with  much  less  associated  risk. 

A  similar,  evolutionary  approach  has  been  started 
by  Harlequin,  Ltd.,  which  offers  a  modest  extension  of 
their  foreign-function  interface  to  do  some  aspects  of  a 
C-k-k  interface,  including  an  extension  to  their  CLOS 
class  system  to  be  able  to  reflect  C-k-k  instances  into 
the  Lisp  class  hierarchy.  The  generation  of  such  inter¬ 
faces  manually  (as  required  by  the  Harlequin  approach) 
is  well  within  currently  accepted  FFI  technology;  but 
still  there  are  a  good  many  unresolved  problems  with 
this  approach,  and  with  the  Harlequin  facility  in  par¬ 
ticular.  The  most  serious  mismatches  and  unresolved 
problems  are  due  to  the  lack  of  uniform  runtime  typ¬ 
ing  on  the  C-k-k  side,  and  due  to  the  need  to  write  Lisp 
programs  which  try  to  second-guess  what  a  increasing 
number  of  incompatible  C-k-k  compilers  are  doing  about 
name  mangling,  about  representation  changes  for  opti- 
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mizing  efficiency,  etc. 
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Abstract 

This  paper  describes  the  design  and  implementa¬ 
tion  of  a  Single  Instruction  Multiple  Data  (SIMD) 
depth-from-motion  algorithm  on  the  Image  Un¬ 
derstanding  Architecture  Simulator.  Correspon¬ 
dences  are  established  in  parallel  for  two  tempo¬ 
rally  separated  images  through  correlation.  The 
correspondences  are  used  to  determine  the  trans¬ 
lational  and  rotational  motion  parameters  of  the 
camera  through  a  parallel  motion  algorithm.  This  is 
done  by  first  determining  the  approximate  transla¬ 
tional  parameters  and  then  constraining  the  search 
for  the  exact  translational  and  rotational  param¬ 
eters.  Finally,  the  dense  depth  map  is  computed 
from  the  image  correspondences  amd  the  computed 
motion  parameters.  Results  are  analysed  for  three 
image  sequences  acquired  from  mobile  vehicles  (the 
Autonomous  Land  Vehicle,  the  Carnegie-Mellon 
NAVLAB,  and  the  UMass  Denning  Robot).  Depths 
are  obtuued  at  an  average  accuracy  of  about  8%  in 
outdoor  image  sequences.  The  depth  maps  are  pro¬ 
cessed  to  locate  relatively  small  obstacles  like  cans 
and  cones  to  a  distance  of  about  60  feet.  Larger 
obstacles  like  hills  are  located  even  when  they  are 
much  further  away.  Issues  related  to  the  speedup 
and  accuracy  of  the  computationally  intensive  prob¬ 
lem  of  motion  analysis  are  explored  in  the  context 
of  the  algorithm. 

1  Introduction  and  Motivation 

Motion  analysis  is  one  of  the  most  computationally 
intensive  tasks  in  computer  vision.  Usually  motion 
algorithms  have  relied  on  some  form  of  point  or 
feature  correspondences  between  two  or  more  per¬ 
spective  views  [1-9].  These  correspondence-based 
approaches  take  advantage  of  the  image  displace¬ 
ments  induced  by  egomotion.  Most  such  methods 
match  a  few  hundred  points  or  features  in  two  tem- 

*  Tbit  work  wat  rupported  in  part  b;  the  DcfcnM  Adranced 
Rencarch  Projeett  Agency  (Tia  Harry  Diamond  Labi)  by  con- 
tract  no.  DAAL03-91-K-(Xt47  and  NSF  grant  CDA-8933572. 


porally  separated  images  and  quantitatively  mea¬ 
sure  the  image  displacements.  A  consistent  set  of 
motion  parameters  is  then  determined  to  explain 
these  displacements.  Once  the  motion  parameters 
have  been  determined,  the  depth  of  environmental 
points  can  be  found  by  using  their  individual  image 
displacements. 

For  autonomous  navigation  it  is  not  enough  to 
compute  depth  at  a  few  hundred  isolated  points  in 
the  image.  In  order  to  detect  and  avoid  obstacles  it 
is  necessary  to  find  depth  at  a  dense  set  of  points. 
In  addition,  for  practical  scenarios  it  is  desirable  to 
process  a  large  number  of  high  resolution  images 
within  a  smaU  period  of  time.  The  example  below 
calculates  the  number  of  pairs  of  frames  that  must 
be  processed  per  second  in  a  typical  scenario  for 
motion  analysis  (through  the  use  of  point-based  or 
feature-based  correspondence  methods). 

Let  us  assume  that  our  camera  has  a  resolution 
of  256  X  256  pixels  and  a  field  of  view  of  45° 

Let  us  further  assume  that  the  vehicle  is  moving 
with  a  speed  of  50  km. /hour  and  it  is  necessary 
to  determine  depth  to  a  distance  of  about  10-30 
metres  at  about  10%  accuracy  (except  in  the  re¬ 
gion  immediately  surrounding  the  focus  of  expan¬ 
sion  (FOE)^  where  errors  in  depth  are  necessarily 
high).  For  correspondence-based  methods  it  can 
be  proved  theoreticaUy  [16]  that  in  order  to  achieve 
this  accuracy  the  vehicle  must  move  about  2  m. 
forward  between  the  processing  of  successive  pairs 
of  frames.  Since  50  km/hr  is  about  14  m/sec,  the 
motion  processing  system  should  therefore  be  able 
to  process  a  maximum  of  about  7  pairs  of  frames 

^Por  brtter  recorerr  of  rotational  paramrters  it  it  best  to 
have  large  field  of  -iew  camerat  with  high  image  resolution. 
However,  large  field  of  view  lenses  give  rise  to  various  distor¬ 
tions  and  lower  the  effective  resolution  of  the  image.  Our  choice 
of  camera  parameter*  are  typical  of  commonly  available  image 
processing  system*. 

^The  FOE  it  the  location  in  the  image  plane  where  the  trans¬ 
lational  component  of  the  image  displacement  it  aero.  If  the 
camera  move*  straight  ahead  along  its  optical  axis  with  no  ro¬ 
tation*.  then  the  POE  it  at  the  centre  of  the  image. 
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per  sec.  and  thus,  the  computation  associated  with 
each  pair  of  frames  should  be  approximately  140 
milliseconds. 

In  spite  of  this  severe  requirement  for  speed, 
the  problem  of  time  complexity  has  not  been  ad¬ 
equately  emphasized  in  the  area  of  depth  determi¬ 
nation.  Almost  any  reasonable  motion  algorithm 
needs  to  solve  complicated  non-linear  equations  re¬ 
lated  to  the  5  independent  parameters  of  motion. 
The  sequential  algorithms  have  concentrated  on  the 
the  use  of  approximations  and  search  space  reduc¬ 
tion  for  the  solution  of  these  equations.  Further¬ 
more,  they  have  not  attempted  to  compute  depth 
at  more  than  a  few  hundred  locations  in  the  image. 
However,  when  7  complete  pairs  of  frames  need  to 
be  processed  per  second  and  the  computation  asso¬ 
ciated  with  each  frame  pair  involves  many  floating 
point  computations  a  parallel  design  and  implemen¬ 
tation  is  essential.  If  we  could  completely  paral¬ 
lelize  the  problem,  theoretically  we  could  achieve  a 
speedup  by  a  factor  of  65,536  times  over  the  equiv¬ 
alent  sequential  version  with  a  256  x  256  array  of 
processors. 

None  of  the  earlier  work  in  parallel  motion  pro¬ 
cessing  [10-15]  attempt  a  comprehensive  solution 
to  the  problem  of  dense  depth  determination  from 
a  sequence  of  images.  In  the  last  few  years  a  vari¬ 
ety  of  processor  arrays  have  emerged  for  solving  low 
level  processing  in  computer  vision  [17].  The  fine 
grained  SIMD  array  computers  have  proved  par¬ 
ticularly  versatile  for  solving  such  tow  level  tasks. 
The  AMT  DAP  series,  the  Connection  Machine 
and  the  lowest  level  of  the  lUA  are  three  machines 
which  embody  this  computational  paradigm.  The 
SIMD  class  of  machine  makes  it  feasible  to  attempt 
real-time  solutions  to  the  dense  depth-from-motion 
problem.  The  algorithm  presented  in  this  paper  is 
implemented  on  the  Image  Understanding  Archi¬ 
tecture  (lUA)  simulator.  The  lUA  is  a  three  lay¬ 
ered  parallel  machine  specifically  designed  for  im¬ 
age  analysis.  The  lowest  layer  of  the  lUA  is  the 
CAAPP,  a  two  dimensional  grid  of  1-bit  serial  pro¬ 
cessing  elements,  operating  in  the  SIMD  mode. 

A  common  scenario  in  ground-based  navigation 
occurs  when  the  vehicle  moves  forward  by  undergo¬ 
ing  primarily  translational  motion  along  with  small 
rotations.  In  this  case  the  FOE  is  within  the  field  of 
view.  The  algorithm  presented  in  this  paper  is  de¬ 
signed  to  take  advantage  of  this  situation.  It  first 
determines  the  approximate  translation  and  then 
constrains  the  search  for  the  exact  translational  and 
rotational  parameters.  In  contrast  to  other  methods 
which  have  not  demonstrated  their  ability  to  recover 
dense  depth  maps  and  locate  obstacles,  our  algo¬ 
rithm  is  fast,  simple  and  robust. 

We  start  by  providing  a  brief  description  of  the 
lUA  which  emphasizes  the  features  most  pertinent 
to  our  application. 


2  The  Image  Understanding  Archi¬ 
tecture  (lUA) 

We  provide  a  brief  description  of  the  lUA  [18]  with 
particular  emphasis  on  the  lowermost  level  of  the 
machine.  The  lUA  is  made  up  of  three  levels,  each 
having  a  particular  type  of  processor: 

1.  Low  Level  consisting  of  the  Content  Address¬ 
able  Array  ParaUel  Processor  (CAAPP). 

2.  Intermediate  Level  consisting  of  the  Interme¬ 
diate  Communications  Associative  Processor 
(ICAP). 

3.  High  Level  consisting  of  the  Symbolic  Process¬ 
ing  Array  (SPA). 

The  CAAPP  and  ICAP  levels  are  controlled  by 
a  dedicated  Array  Control  Unit  which  is  directed 
from  the  SPA  level.  The  low  level  processors  are 
ideal  for  fine-grained  SIMD  computing,  whereas  the 
intermediate  and  high  level  processors  are  ideal  for 
Multiple  Instruction  Multiple  Data  (MIMD)  com¬ 
puting.  Our  algorithm  uses  only  the  low  level  of  the 
lUA  because  of  the  nature  of  the  task.  The  low  level 
or  CAAPP  level  is  a  256  x  256  square  grid  array  of 
custom  1-bit  serial  processors  with  local  memory, 
one-bit  registers,  backing  store,  an  ALU  and  data 
routing  circuitry.  The  bit-serial  processing  elements 
are  linked  through  a  four  way  (North,  South,  East, 
West)  communications  grid.  Intra-level  communi¬ 
cation  within  the  CAAPP  can  take  place  in  several 
ways  [18]. 

3  Depth  from  Image  Displacement 

This  section  discusses  the  mathematical  formula¬ 
tion  for  the  algorithm.  Figure  1  shows  a  right- 
handed  coordinate  system  fixed  with  respect  to  the 
camera.  Let  us  also  2issume  the  right  hand  rule  for 
rotations  and  consider  the  case  where  the  camera 
is  undergoing  motion.  As  can  be  seen  from  Fig¬ 
ure  1  the  environmental  point  P,  with  world  coor¬ 
dinate  (X,  Y,  Z),  is  projected  onto  point  p,  in  the 
image  plane  with  image  coordinates  (z,y).  Let  / 
be  the  focal  length  of  the  camera,  and  denote  by 

T  =  {Ti,T2,T3),  n  =  (01,02,03)  the  transla¬ 
tional  and  rotational  rigid  motion  of  the  camera 
(This  implies  that  P'  =  —RP  -  T  where  R  is 
the  rotation  matrix  and  P'  is  the  new  position  of  P 
after  undergoing  rigid  motion  to  the  next  frame). 

We  shall  use  the  small  rotation  ■*  motion  equa¬ 
tions,  and  for  simplicity  use  the  following  abbrevi- 

*This  mnanii  that  the  magnitude  of  rotation  |  S  |C  1.  Also. 
$in{8)  a  B  and  cos(S)  a  1  to  order  0{8^).  Using  the  approxima- 
tion  1  9  IC  1  wc  note  that  c»en  if  |  |=  0.1  radiang  (i.e.  6®). 

the  relatWc  error  incurred  u  0.2%  for  i«n(^)  and  0.5%.  for 
co«(9).  The  umall  an^^lr  a^ttumption  iit  not  a  rentrirtirc  one  in 
practical  itituations  becauiu*  large  rotatitmit  induce  utich  large  im¬ 
age  dipplacementA  that  correspondence  algorithms  are  unable  to 
handle  them  reliably. 
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Figure  1:  Coordinate  System 


ations: 

7  -  1  +  _  Oil' 

/  / 

J  -n,/  +  n„ 

Jt  = 

O!  =  -fTi  +  xTz 


P  =  —fT2  +  yTz. 

With  the  above  abbreviations  the  image  displace¬ 
ment  / ,  induced  by  the  motion  of  the  camera,  is 
given  by 


1  =  u  * -f-u  y  =  (u,  v)  (1) 

where  z  and  y  are  unit  vectors  along  the  x-axis  and 
y-axis  respectively,  and 


J  -I-  {a/Z) 
“"/-(Ta/Z) 
K  +  {PIZ) 
^  /-(Ta/Z)- 


(2) 

(3) 


The  depth  Z  of  an  environmental  point  P  can  be 
determined  from  either  equation  (2)  or  equation  (3). 
We  denote  by  Zx  the  depth  determined  from  equa¬ 
tion  (2)  and  by  Zy  the  depth  determined  from  equa¬ 
tion  (3).  Hence, 


,  _  Tau  +  o 
'*  “  Ju- J 
,  _  Tav  -f 
“  lv-K‘ 


(4) 

(5) 


Of  course  for  perfect  data,  these  would  be  equal; 
however,  in  the  presence  of  noise,  they  will  in  gen¬ 
eral  be  unequal. 

It  is  also  possible  to  write  the  depth  in  terms  of 

the  displacement  vector  I  and  the  motion  param¬ 
eters  by  using  equations  (1),(2),  and  (3).  However, 
the  resulting  expressions  are  cumbersome  to  manip¬ 
ulate. 

In  this  experimental  scenario  the  vehicle  is  mov¬ 
ing  forward  into  the  scene.  The  motion  is  mostly 


translational  with  small  rotations.  For  this  kind 
of  “approximate  translational  motion”  the  approxi¬ 
mate  location  of  the  focus  of  exprinsion  [19,  20]  can 
be  found  quite  easily.  This  gives  an  initial  approx¬ 
imate  estimate  for  motion  parameters.  If  the  focal 
length  of  the  camera  is  unknown,  but  the  focus  of 
expansion  is  known,  then  the  time-to-adjacency  [19] 
relationship  is  sometimes  used  to  compute  rough 
depth  from  the  approximate  estimate  for  motion 
parameters.  The  time-to-adjacency  relationship 
gives 


where 


_Z  _  O 
Ts  ~  d 


(6) 


•  D  =  distance  in  pixels  of  point  pi,  from  the 
FOE. 

•  d  =  displacement  of  the  point  pi. 

To  find  better  depths  a  more  elaborate  scheme 
is  needed.  From  equations  (4)  and  (5)  we  can  see 
that  there  are  two  ways  of  determining  depth  for 
a  particular  set  of  motion  parameters  and  image 
displacement.  Zx  is  the  depth  that  is  determined 
from  the  x-component  (u)  of  the  image  displace¬ 
ment  and  Zy  is  the  depth  that  is  determined  from 
the  y-component  (v)  of  the  image  displacement.  As 
u  becomes  close  to  0,  Zx  becomes  hard  to  deter¬ 
mine.  Similarly,  as  v  becomes  close  to  0,  Zy  be¬ 
comes  hard  to  determine.  When  the  movement  of 
the  vehicle  is  mostly  forward  a  major  portion  of  the 
image  has  significant  magnitudes  of  both  u  and  v 
and  either  Zx  or  Zy  can  give  good  depths. 

For  any  point  i  in  the  image  the  depth  Z^  is  com¬ 
puted  as  the  average  of  Zx  and  Zy  (except  in  patho¬ 
logical  cases  where  either  u  or  v  is  sero).  It  is  pos¬ 
sible  to  arrive  at  an  estimate  of  the  reliability  of  Z^ 
by  noting  the  difference  of  Zx  and  Zy  from  each 
other.  The  reliability  (  is  defined  as 

where 


U(x,  y)  =|i-^y|  i/x<0  and  y  <  0 

=  I  X  —  y  I  otherwise  (8) 

The  reliability  scale  varies  from  0  to  \/2  (no  mat¬ 
ter  what  the  values  of  the  two  depths)  with  0  being 
the  best  reliability.  Qualitatively,  we  are  testing 
how  well  the  depths  computed  from  the  two  com¬ 
ponents  of  the  displacement  vectors  match.  If  both 
depths  are  positive  and  equal  then  (  is  sero  (i.e. 
high  reliability).  If  both  depths  are  negative  then 
then  (  is  high  (low  reliability).  In  general,  when 
one  depth  is  positive  and  the  other  depth  is  nega¬ 
tive  then  reliability  is  poor,  although  not  as  poor 
as  the  case  when  both  are  negative. 

Let  Ci  be  the  reliability  for  depth  obtained  at 
point  i  in  the  image.  Let  n  be  the  total  number  of 
displacement  vectors.  Zx,  and  Zy,  are  the  depths 
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computed  by  using  the  x  and  y-components  respec¬ 
tively  of  the  displacement  vector  and  the  hy¬ 
pothesized  values  of  the  motion  parameters.  Then 

n 

The  motion  parameters  for  wrhich  k  is  minimum 
is  the  best  set  of  motion  parameters.  Unlike  several 
other  approaches  this  optimization  criterion  allows 
us  to  avoid  the  rather  difficult  problem  of  eliminat¬ 
ing  the  depths  from  the  error  functions  (which  has 
to  be  done  somehow  when  the  deviation  between 
the  actual  and  predicted  image  displacements  must 
be  minimized).  We  can  call  the  error  function  k, 
given  by  equation  (9)  as  the  normalized  absolute 
deviation  in  directional  depths.  The  nice  property 
of  K  is  that  it  varies  between  0  and  v^.  The  lower 
the  value  of  k  the  better  the  minimization. 

We  follow  this  section  with  the  algorithm  for 
depth  determination. 

4  Depth  Determination  Algorithm 

The  objective  is  to  recover  reliable  depths  of  envi¬ 
ronmental  points  over  as  large  a  part  of  the  image 
as  possible.  The  parallel  algorithm  works  in  the 
following  stages: 

1.  Determination  of  image  correspondence. 

2.  Selection  of  the  best  image  displacements  be¬ 
tween  frames. 

3.  Determination  of  the  approximate  transla¬ 
tional  motion  parameters. 

4.  Determination  of  the  exact  translational  and 
rotational  motion  parameters. 

5.  Depth  determination. 

4.1  Image  Correspondence 

The  algorithm  for  establishing  image  correspon¬ 
dence  takes  as  input  two  temporally  displaced  im¬ 
ages  and  the  maximum  possible  displacement  at 
any  pixel.  Restricting  the  maximum  displacement 
does  not  significantly  limit  the  efficacy  of  the  imple¬ 
mentation  since  it  is  usually  possible  to  predict  it  in 
advance.  The  row  and  column  displacements  at  ev¬ 
ery  pixel  are  computed  through  correlation  match- 
ing. 

The  parallel  algorithm  for  computing  image  dis¬ 
placements  works  as  follows 


Parallel  Correlation 
Store  Frame- 1  and  Frame-2  of  the  im¬ 
age  sequence  in  local  processing  ele¬ 
ment  (PE)  memory 

FOR  hypothesized  displacements  within 
the  maximum  displacement  range  DO 
BEGIN 


Shift  each  pixel  in  Frame-2  by  the  negative 
of  the  hypothesized  displacement 

Compute  sum-of-absolute-differences  cor¬ 
relation  between  the  pixels  of  frame- 1 
and  the  shifted  frame-2  in  a  spiral  pat¬ 
tern  [21]. 

If  the  correlation  at  a  pixel  is  the  mini¬ 
mum  obtained  until  'now,  then  store  the 
hypothesized  displacement  and  the  cur¬ 
rent  minimum  correlation  value  in  the 
local  PE  memory. 

END 

After  all  hypothesized  displacements  have 
been  considered  the  displacement  cor¬ 
responding  to  the  minimum  correla¬ 
tion  at  each  pixel  is  the  image  displace¬ 
ment  at  that  pixel.  The  value  of  the 
correlation  gives  a  measure  of  the  reli¬ 
ability  of  the  match. 


It  should  be  noted  that  using  sum-of-absolute  dif¬ 
ferences  rather  than  the  sum-of-squared  differences 
as  the  correlation  measure  saves  time  by  avoid¬ 
ing  multiplications.  The  repeated  correlation  com¬ 
putations  are  the  most  computationally  intensive 
part  of  the  algorithm  and  this  is  an  important  ef¬ 
ficiency  consideration.  Integer  representations  are 
also  preferable  because  floating  point  computations 
are  costly  in  a  bit-serial  processor. 

4.2  Selection  of  best  image  displacements 

The  coterie  network  of  the  CAAPP  is  used  to  par¬ 
tition  the  image  into  equal  divisions  [18].  For  ex¬ 
ample,  in  one  experiment  the  256  x  256  image  was 
divided  into  64  sub-regions  of  32  x  32  pixels  each. 
In  each  sub-region  the  pixel  which  has  the  most 
reliable  displacement  vector  is  selected  in  parallel. 
Hence  there  are  as  many  selected  pixels  as  there 
are  sub-regions.  The  displacement  vectors  at  these 
selected  pixels  are  called  “selected”  displacement 
vectors.  The  FOE  and  motion  parameters  are  de¬ 
termined  with  the  selected  displacement  vectors. 

4.3  Approximate  Translation 

In  dynamic  imaging  situations  where  the  sensor  is 
undergoing  primarily  translational  motion  with  a 
relatively  small  rotational  component,  approximate 
translational  motion  algorithms  may  be  effective  in 
determining  approximate  depth  [20].  By  restrict¬ 
ing  the  processing  to  the  two  dimensions  of  trans¬ 
lational  motion,  there  is  a  tremendous  reduction 
in  complexity  from  the  five  dimensions  (excluding 
the  scaling  component  of  sensor  velocity)  of  general 
motion. 

The  FOE  recovery  algorithm  works  as  follows. 
First  it  draws  a  line  through  each  selected  displace¬ 
ment  vector  in  the  image.  Let  these  lines  be  called 
“extended”  displacement  vectors.  Then  it  finds  all 
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the  possible  intersections  of  the  extended  displace¬ 
ment  vectors  and  votes  in  a  Hough  transform  array 
corresponding  to  each  intersection  [22].  The  pa¬ 
rameter  space  for  the  Hough  transform  is  given  by 
the  image  coordinates  where  the  extended  displace¬ 
ment  vectors  can  intersect  each  other.  The  number 
of  votes  for  each  intersection  is  an  increasing  func¬ 
tion  of  the  length  and  confidences  of  the  displace¬ 
ment  vectors  forming  the  intersection.  This  ensures 
that  longer  displacement  vectors  are  more  heavily 
weighted  and  that  more  reliable  vectors  are  also 
weighted  more  heavily.  The  point  where  the  max¬ 
imum  number  of  intersections  occur  is  the  approx¬ 
imate  FOE.  A  contiguous  region  which  surrounds 
the  approximate  FOE  and  includes  at  least  p  frac¬ 
tion  of  the  votes  is  the  region  where  the  exact  FOE 
is  likely  to  be  present. 

The  parameter  space  for  the  Hough  transform 
is  spread  over  all  the  processing  elements  of  the 
CAAPP.  The  intersections  have  an  i-coordinate 
and  a  y-coordinate.  Since  the  CAAPP  is  a  two 
dimensional  array,  it  is  easy  to  map  each  possible 
intersection  into  a  distinct  PE.  The  local  memory 
of  each  PE  contains  the  number  of  votes  for  an  ap¬ 
proximate  FOE.  The  contiguous  region  where  the 
exact  FOE  might  be  located  is  determined  by  sum¬ 
ming  up  in  parallel  the  votes  gathered  by  neighbors 
[22]. 


4.4  Exact  Translation  and  Rotation 

Once  the  region  in  which  the  exact  FOE  can  lie  is 
determined  the  exact  translational  and  rotational 
parameters  can  be  computed  by  the  optimization 
method  stated  in  equation  (9). 

For  example  let  the  approximate  FOE  be  at 
(70, 55)  with  the  exact  FOE  lying  within  10  pixels 
of  the  approximate  FOE.  Then  a  square  whose  sides 
are  20  pixels  is  formed  with  (70, 55)  as  the  intersec¬ 
tion  point  of  its  two  diagonals.  Then  FOE’s  are 
hypothesized  at  each  pixel  bounded  by  the  square 
(i.e.  starting  at  (60,45)  and  ending  at  (80,65)). 
At  each  hypothesized  FOE  the  normsdized  absolute 
deviation  in  directional  depths  k,  is  computed  for 
the  three  dimensions  of  rotations.  The  rotations 
corresponding  to  the  minimum  k  are  the  optimal 
ro  ational  parameters  corresponding  to  the  hypoth¬ 
esized  FOE.  The  minimum  k  is  determine'd  among 
all  the  hypothesized  FOE’s.  The  translation  and 
rotation  corresponding  to  minimum  k  are  the  exact 
motion  parameters. 

4.5  Depth  Determination 

Once  the  motion  parameters  are  obtained  the  depth 
at  each  point  in  the  image  can  be  found  by  using 
the  image  displacements  and  the  intrinsic  camera 
parameters.  Since  each  pixel  of  the  image  is  rep¬ 
resented  by  one  processing  element  in  the  CAAPP, 
this  stage  is  totally  parallelized.  The  equations  used 
for  this  purpose  have  been  described  in  Section  3. 


Figure  2:  First  frame  of  the  Sequence. 
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Figure  3:  Computed  depth  map. 


Object 

True 

depth 

(feet) 

Computed 

Depth 

} 

#1 

>ampU 

#2 

#3 

conel 

21 

22 

23 

25 

cone2 

36 

35 

44 

33 

cone3 

56 

45 

53 

54 

cone4 

56 

54 

55 

56 

can 

46 

41 

44 

43 

cone5(*) 

76 

96 

66 

70 

cone6 

76 

62 

67 

55 

Treel 

- 

50 

51 

51 

Tree2 

- 

58 

59 

57 

Tree3 

- 

90 

91 

91 

Table  1:  Depth  Values  of  some  environmental 
pointB.(*)  Cone5  is  near  the  FOE  and  its  depths 
are  erroneous. 
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5  Experiments 

The  algorithm  for  parallel  depth  determination  was 
coded  on  the  lUA  simulator.  This  section  contains 
the  results  obtained  on  several  image  sequences. 

Carnegie  Mellon  NAVLAB  sequence  -  The 
first  set  of  experiments  used  a  sequence  of  twenty 
images.  The  images  were  coUected  on  the  Carnegie 
Mellon  NAVLAB.  The  first  image  of  the  sequence  is 
shown  in  Figure  2.  The  vehicle  was  made  to  move 
in  an  approximately  straight  line  such  that  the  dis¬ 
tance  between  frames  was  2  feet.  The  field  of  view 
of  the  camera  was  45°  and  256  x  256  images  images 
were  collected.  In  order  to  determine  the  ground 
truth  for  environmental  objects,  traffic  cones  and 
a  can  were  placed  at  measured  distances  (ranging 
from  21  to  76  feet).  Obviously  with  a  tot^  move¬ 
ment  of  the  vehicle  of  40  feet,  some  of  the  cones 
disappeared  from  the  scene  in  later  frames.  Figure 
2  shows  environmental  objects  whose  depths  had 
been  hand  measured  (black  dots),  along  with  the 
rest  of  the  scene  with  objects  whose  distances  were 
not  measured  (e.g.  trees).  It  should  be  noted  that 
in  general  the  scene  is  quite  complex  because  of  the 
presence  of  large  homogeneously  textured  regions 
like  road  and  grass,  and  the  occlusion  of  the  dis¬ 
tant  buildings  through  trees. 

The  quantitative  results  for  the  known  envi¬ 
ronmental  objects  and  some  unknown  objects  are 
shown  in  Table  1.  Three  visually  selected  pixels 
were  marked  on  each  object.  The  value  of  depth 
returned  by  the  algorithm  at  each  of  these  pixels 
are  recorded  in  Table  1.  These  recordings  are  re¬ 
ferred  to  as  “samples”  in  the  table.  From  the  table 
the  average  error  in  depth  for  the  known  objects 
is  computed  to  be  about  8%  This  corresponds 
quite  well  to  the  theoretical  limits  on  depth  deter¬ 
mination  presented  in  our  previous  work  [16]. 

The  depth  map  obtained  by  using  the  algorithm 
between  the  first  and  third  frame  of  the  sequence 
is  shown  in  Figure  3  Complete  separation  of  ob¬ 
stacles  (e.g.  cones,  can)  is  not  possible  from  the 
depth  map  because  part  of  the  surroundings  of  an 
obstacle  in  the  depth  map  is  at  almost  the  same 
depth  as  the  obstacle.  This  is  very  different  from 
an  intensity  image  where  all  regions  of  an  obsta¬ 
cle  usually  have  a  different  image  intensity  from  its 
surrounding.  The  darker  the  color  the  closer  the  en¬ 
vironmental  point  is  to  the  camera  (Since  the  gray- 
level  representation  of  depth  on  a  printed  page  is 
poor  ,  the  results  will  be  presented  in  various  rep¬ 
resentations.  On  a  high  resolution  display  device 
the  depth  maps  appear  a  lot  better  to  the  human 
eye.  ).  It  should  be  noted  that  white  on  the  depth 

®The  mean  depth  of  each  known  object  in  the  average  of  the 
three  (maples  that  have  been  shown  in  Table  1 

*The  torrenpondrncf^ii  hi  thr  ^xp^rinimtal  noction  arc?  from 
a  hierarchical  correlation  based  algorithm  which  has  not  been 
completely  parallelised  on  the  lUA.  The  ctimpletcly  parallel  lUA 
implementation  giset  somewhat  inferior  results  for  depth. 


map  indicates  points  at  which  depths  are  over  120 
feet.  Black  indicates  points  where  depth  cannot  be 
computed  reliably  (e.g.  periphery  of  the  image  that 
is  not  visible  in  both  images,  points  where  displace¬ 
ment  vectors  are  small  or  erroneous  thereby  giving 
rise  to  wrong  depths  etc.).  Parts  of  cones,  can  and 
trees  stand  out  in  the  depth  map  in  subsequent  rep¬ 
resentations. 

Figure  4(a)-(l)  shows  the  two  frames,  some  inter¬ 
mediate  results  and  several  depth  maps  at  different 
regions  of  the  image.  An  explanation  is  necessary 
for  the  legends  which  illustrate  the  depth  maps. 
The  legends  are  histograms  of  depth  maps.  The 
x-axis  shows  the  depth  values  and  the  y-axis  indi¬ 
cates  the  number  of  pixels  with  a  particular  depth 
value.  The  shading  of  the  histogram  corresponds  to 
the  shading  of  the  depth  map.  For  example,  in  fig¬ 
ure  4(e)  (whose  histogram  is  shown  in  figure  4(f)), 
the  lightest  region  show  depths  of  over  80  feet.  The 
top  of  cones  and  can  stand  out  in  the  depth  maps. 
Observers  unaccustomed  to  seeing  depth  images  of¬ 
ten  attempt  to  compare  them  with  intensity  images 
and  draw  erroneous  conclusions.  In  the  case  of  a 
cone  standing  upright  on  the  ground  in  3-D,  the 
2-D  image  will  have  the  following  characteristics: 

•  The  top  of  the  cone  is  surrounded  in  the  image 
by  locations  which  are  much  further  away  than 
the  cone. 

•  The  bottom  of  the  cone  has  almost  the  same 
depth  as  the  ground  in  front  of  it  and  to  its 
sides. 

Hence,  in  the  depth  map  the  bottom  of  the 
cones  and  can  merge  with  the  surrounding  ground 
whereas  the  top  clearly  stands  out  from  its  sur¬ 
roundings.  Virtually  all  the  major  obstacles  at  least 
partially  stand  out  in  a  magnified  depth  map.  We 
have  not  magnified  all  the  obstacles  in  Figure  4  be¬ 
cause  of  space  limitations.  Nevertheless,  it  is  quite 
clear  from  the  figures  that  portions  of  the  two  trees, 
the  can  and  the  cones  clearly  stand  out  in  the  depth 
map.  Thus  in  addition  to  the  quantitative  depths 
the  obstacles  are  also  detected.  It  can  be  seen  that 
even  small  obstacles  like  cones  and  can  are  detected 
quite  robustly  by  this  algorithm  to  a  distance  of 
about  60  feet.  Larger  structure  like  trees  can  obvi¬ 
ously  be  detected  more  easily  even  when  they  are 
quite  far  away. 

Autonomous  Land  Vehicle  sequence-  A  sec¬ 
ond  experiment  was  done  on  a  sequence  collected 
via  the  Autonomous  Land  Vehicle.  The  data  collec¬ 
tion  process  is  detailed  in  [23].  Preliminary  results 
on  this  sequence  are  shown  in  Figure  5.  It  can  be 
seen  that  the  mountain  which  is  rather  far  away  is 
clearly  identified. 

Umass  Denning  Robot  sequence-  A  third  ex¬ 
periment  was  done  on  a  sequence  collected  via  the 
Denning  robot  at  the  University  of  Massachusetts 
at  Amherst.  This  image  was  taken  indoor  under 
poor  lighting  conditions.  The  robot  moved  1.95  feet 
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(a)  First  image. 


(b)  Second  image. 


(c)  Image  displacements. 


(d)  Hough  Transform. 


(e)  Depth  map  of  full  image  . 

I  with  legend  in  fig.  f.  (0  Legend  for  fig.  e. 

£fote  the  gradual  Increase  in  depth  with  the  trees  standing  out 


I  (g)  Magnified  can  from  fig.  a.  (h)  Depth  map  of  fig.  g.  (i)  Legend  for  fig.  h. 

The  top  of  the  marked  can  stands  out  whereas  the  bottom  merges  with  the  ground. 


I  (i)  Magnified  cones  from  fig.  a.  (k)  Depth  map  of  fig.  j.  (1)  Legend  for  fig.  k. 

The  tops  of  the  three  cones  stand  out  whereas  the  bottoms  merge  with  the  ground. 
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between  frames.  The  results  for  the  depth  maps  are 
shown  in  Figure  5. 

Timings 

Before  presenting  the  timings  let  us  caution  the 
reader  that  these  are  results  obtained  by  running 
our  algorithm  on  the  simulator  of  an  experimen¬ 
tal  parallel  machine  under  development  .  For  the 
Carnegie  Mellon  NAVLAB  sequence  the  PE  cycles 
taken  for  the  various  stages  are  as  follows: 

CYCLES 


Correspondence  stage  5315983 

Trans,  stage  after  vector  selection  78513 

Trans,  and  Rot.  stage  with  depth  35318 


Sum  of  the  above  three  stages  5429814 


This  gives  us  an  estimate  of  about  0.54  sec.  for 
running  the  cdgorithm  on  an  lUA  running  at  10 
MHe.  The  majority  of  the  time  in  our  algorithm  is 
consumed  by  the  computation  of  correspondences 
in  searching  over  a  41  x  41  window  for  possible  dis¬ 
placements.  Reducing  the  size  of  the  search  win¬ 
dow,  for  example  by  assuming  contraints  on  the 
translational  magnitude  and  direction,  along  with 
some  rudimentary  knowledge  of  depth,  can  make 
the  algorithm  much  faster. 

It  should  be  noted  that  with  sophisticated  me¬ 
chanical  devices  like  land  navigation  systems,  gyro¬ 
scopes,  inertial  navigation  systems  etc.  it  is  is  often 
possible  to  constrain  the  estimates  of  the  motion  pa¬ 
rameters.  If  such  knowledge  is  available  then  this 
algorithm  can  be  speeded  up  considerably  because 
the  range  for  hypothesized  FOE’s  and  the  range  of 
rotational  parameters  become  smaller. 

6  Concl'isions 

This  research  demonstrates  the  feasibility  of  fast 
depth  determination  through  the  use  of  motion  al¬ 
gorithms  under  the  SIMD  mode  of  processing.  The 
algorithm  is  relatively  simple  because  it  has  been 
designed  specifically  for  the  case  of  a  vehicle  moving 
mostly  ahead  with  small  rotations.  It  is  a  common 
scenario  in  navigation.  Even  though  it  is  algorith¬ 
mically  simple,  the  depth  computations  are  robust. 
Furthermore,  there  is  usually  no  manual  selection 
among  multiple  minima  for  computing  motion  pa¬ 
rameters.  Some  widely  used  algorithms  (like  that 
proposed  by  Horn  [4])  frequently  require  manual 
selection  of  the  optimal  motion  parameters  because 
of  the  presence  of  multiple  minima.  In  this  algo¬ 
rithm  only  the  region  around  the  approximately  de¬ 
termined  FOE  is  searched  for  computing  optimal 

'The  timingn  giTen  are  eetimatee  and  are  nubjeet  to  change. 
The  eatimatcn  include  only  the  time  taken  by  the  CAAPP.  The 
time  rt'quired  {<•!  front>rnd  procnuiing  ih  iH»t  available.  Since,  the; 
rotation  computation  part  in  ntill  in  the  procenn  of  development, 
there  are  nome  sequential  aspects  involving  front-end  processing. 
This  will  be  reduced  in  subsequent  implementations. 


motion  parameters.  This  does  not  normally  give 
rise  to  multiple  minima  for  small  rotations. 

The  system  can  compute  approximate  depth  even 
when  the  focal  length  of  the  camera  is  not  known. 
This  is  done  by  using  the  time-to-adjacency  rela¬ 
tionship  after  determining  the  approximate  FOE. 
The  drawback  of  the  system  is  that  the  large  num¬ 
ber  of  floating  point  operations  are  expensive  for  a 
SIMD  machine.  However,  the  large  number  of  pro¬ 
cessors  more  than  compensate  for  this.  The  algo¬ 
rithm  is  being  refined  for  getting  faster  and  better 
correspondences.  Methods  for  taking  care  of  larger 
rotations  are  are  also  being  investigated. 

To  summarize,  the  key  contribution  of  this  system 
is  that  it  is  able  to  recover  dense  depth  maps  and 
locate  obstacles  guickly,  simply  and  robustly.  With 
reasonable  assumptions,  the  algorithm  can  run  in 
real  time  on  the  lUA. 
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Abstract^ 

Experimental  systems  which  must  function  in  leal- 
time  application  domains  place  signiflcant 
constraints  on  the  underlying  infrastnictuie  and 
support  tools  and  on  the  design  of  these  tools.  One 
of  t^  more  important  tradeoffs  involves  efficiency 
versus  flexibility.  This  paper  discusses  the  design 
philosophy  of  MoPLSE  -  the  Mobile  Perceptitm 
Laboratory  Software  Environment  designed  to 
support  real-time  autonomous  navigation. 

1.  Introduction 

The  convergence  of  computer  vision  and  robotics 
has  created  the  need  for  a  new  kind  of  software 
environment  which  supports  real-time  computing. 
The  need  is  becoming  particularly  acute  as 
applications  in  autonomous  vehicle  and  flexible 
manufacturing  become  more  sophisticated. 
Although  development  enviroranents  for  computer 
vision,  such  as  KBVision'*^,  ImageCalc  and 
Khoros  (and  soon  the  DARPA  lUE)  do  exist,  these 
environments  are  not  well-suited  to  real-time 
applications.  Conversely,  robotic  environments, 
such  as  VxWorks,  do  rxH  supply  much  support  for 
computer  vision. 

This  paper  describes  the  preliminary  design  of  a 
portion  of  MoPLSE,  the  software  envirotunent 
suf^rting  computer  vision  research  on  board  the 
University  of  Massachusetts'  Mobile  Perception 
Laboratory  (MPL;  see  below).  More  important 
than  MoPLSE  itself,  however,  are  the  design 
decisions  that  underlie  it.  Designing  a  software 
environment  for  real-time  vision  involves 
cost/benefit  tradeoffs.  On  the  one  hand,  because 
the  environment  is  for  research,  it  should  be 
general  enough  to  support  not  only  currently 
available  vision  algorithms,  but  unanticipated 
future  advances  as  well.  On  the  other  hand,  one 
cannot  afford  a  profligate  generality:  mobile 
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robots  and  autonomous  vehicles  must  act  and  react 
in  real  time,  so  their  software  environments  must 
be  highly  efficient. 

The  challenge  of  MoPLSE,  therefore,  was  to  build 
an  environment  that  satisfies  the  needs  of  the 
growing  real-time  vision  community  without 
getting  "bogged  down"  with  rarely-used  features 
or  excessive  overhead.  This  requit^  making  hard 
choices  about  what  features  are  necessary  for  real¬ 
time  vision.  To  some  extent,  these  decisions 
depended  on  our  application  arid  on  our  getieral 
approach  to  computer  vision.  We  therefore 
describe  MoPLSE  within  the  context  of  MPL  and 
its  algorithms.  In  a  larger  sense,  however,  the 
intuitions  behind  MoPLSE  are  based  on  fifteen 
years  of  experience  in  building  experimental 
vision  systems  at  the  Univ.  of  Massachusetts 
[Draper  89],  and  in  building  tools  to  support  that 
research  [Kohler  82,  Brolio  89].  MoPLSE  differs 
from  previous  vision  environments  because  the 
constraints  of  real-time  vision  forced  us  to 
prioritize  needs  more  strongly  than  in  previous 
systems,  and  although  the  details  of  MoK^E  will 
probably  change  over  time,  the  need  to  prioritize 
will  remain. 

2.  The  Mobile  Perception  Laboratory 

The  Mobile  Perception  Laboratory  (MPL)  is  a 
modified  U.S.  Army  HMMWV  with  computer- 
controlled  steering,  acceleration  and  braking.  MPL 
was  built  as  part  of  DARPA's  Unmanned  Ground 
Vehicle  (UGV)  project  to  test  algorithms  for 
directing  an  autonomous  military  vehicle  under 
real-world  conditions.  Because  of  the  generality 
of  the  autonomous  operation  task,  the  algorithms 
being  tested  on  the  Mn.  include  many  of  the  basic 
visud  tasks  of  interest  to  the  image  understanding 
community.  In  particular,  the  MPL  was  designed 
to  support  UMass  research  in  landmark-based 
navigation  [Fennema  91,  Kumar  92,  Thomas 
93a,b,  Dutta  93] ,  obstacle  detection  and  automatic 
model  acquisition  [Sawhney  92a,b,  93],  and 
massively  parallel  computation  on  the  Image 
Understand  Architecture  [Weems  89].  The 
landmark  navigation  task  requires  that  the  MPL 
recognize  landmarks  (stationary  recognizable 
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features  and  objects)  from  a  map  and 
detennine  the  position  and  orientation  of 
the  vehicle  relative  to  the  landmark  (i.e. 
pose  determination). 

The  obstacle  detection  task,  whether  on¬ 
road  or  off-road,  requires  that  MPL  model 
the  surfaces  in  front  of  the  vehicle  and 
determine  whether  that  surface  is  flat 
enough  to  traverse.  The  model 
acquisition  task  has  two  sub  tasks.  The 
first  is  to  automatically  acquire  new 
landmarks  from  an  initial  (and  possibly 
sparse)  set  of  landmarks.  The  second  is 
to  improve  the  vehicle's  map  of  an  area  as 
it  drives  through  (e.g.  for  use  by 
subsequent  vehicles)  or,  in  the  extreme 
case,  to  build  up  a  map  of  an  area  from 
scratch.  In  addition,  MPL  also  serves  as  a 
secondary  test  vehicle  for  research  being 
conducted  at  other  sites  on  road 
following,  off-road  navigation  and  stereo. 


Figure  1.  The  UMass  Mobile  Perception  laboratory 


Physically,  MPL  is  a  modification  of  the 
ambulance  (996)  model  of  the  HMMWV  (see 
Figure  1).  In  front  it  has  a  pair  of  forward- 
looking,  black-and-white  CCD  cameras  for  road 
following  and  obstacle  detection.  On  top  it  has  a 
pan-and-tilt  controllable  stabilized  platform,  with  a 
360  degree  field  of  view,  that  carries  two  color 
CCD  cameras  and  an  infra-red  sensor.  One  of  the 
color  cameras  has  a  wide-angle  lens  for  rapidly 
finding  landmarks,  while  the  other  has  a  tele^oto 
lens  for  focusing  on  a  landmark  and  determining 
its  pose.  In  the  back,  MPL  contains  a  complete 
computer  laboratory,  including  power,  air 
conditioning  and  facilities  for  two  researchers  (in 
addition  to  the  safety  driver).  On-board 
computing  resources  include  a  Motorola  68030 
processor  for  directly  controlling  speed  and 
direction,  a  Datacube  MaxVideo  20  image 
processor,  and  a  four-processor  340GX 
workstation  from  Silicon  Graphics.  Space  and 
power  is  also  provided  for  the  Image 
Understanding  Architecture,  a  massively-parallel 
heterogeneous  processor  being  develop^  jointly 
by  the  University  of  Massachusetts  and  Hughes 
Research  Laboratory  [Weems  89]. 

3.  Software  Environment  Specifications 

The  first  step  in  designing  any  software 
environment  is  to  determine  what  features  are 
needed.  In  the  case  of  the  MPL,  we  needed  to 
integrate  many  different  styles  of  visual  algorithms 
within  a  single  real-time  environment.  The 
ALVINN  road- following  program  [Pomerleau 
90,92],  for  example,  is  a  neural-net  algorithm 
developed  at  CMU  that  takes  a  reduced  (30x32) 


image  as  input  and  returns  a  steering  angle.  The 
centerpiece  of  the  landmark  recognition  system,  on 
the  other  hand,  is  a  geometric  model  matcher  that 
matches  2D  line  segments  at  the  highest  resolution 
available,  extracted  from  an  image,  to  3D  model 
edges  and  determines  the  pose  of  the  camera 
relative  to  the  landmaik.  It  was  obvious,  therefore, 
that  we  needed  to  support  a  wide  range  of 
algorithms  and  data  structures. 

The  decision  was  made,  therefore,  to  create  an 
environment  based  on  the  idea  of  vi.sual  modules. 
ALVINN,  for  example,  would  be  one  module, 
while  the  geometric  model  matcher  [Beveridge  92] 
would  be  another.  Additional  modules  were 
quickly  defined  for  camera  control,  line  extraction 
(Bums  86],  pixel-based  tracking,  symbolic  line 
tracking  [Williams  88,  Sawhney  92a],  and  many 
other  visual  tasks.  Each  module  is  defined  in  terms 
of  its  input  and  output,  and  a  set  of  tuning 
parameters  that  allow  the  on-board  researchers  to 
modify  the  performance  of  a  module.  The  primary 
function  of  the  on-board  software  environment, 
then,  is  to  facilitate  the  assembly  of  existing 
modules  into  aggregates  which  perform  high  level 
functions  and  to  “optimize”  the  performance  of 
this  new  function  by  tuning  the  parameters  of  the 
component  processes. 

Pan  of  the  reason  for  this  decision  was  the 
realization  that  the  on-board  environment  did  not 
have  to  suppon  the  development  of  new  modules. 
In  general,  basic  algorithm  development  occurs  in 
the  indoor  laboratory  using  data  collected  from 
previous  experiments.  The  long  hours  of  algorithm 
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design,  implementation  and  initial  debugging  do 
not  occur  on  the  vehicle.  Only  when  modules  have 
been  implemented  and  initially  debugged  are  they 
put  on  the  MPL,  where  they  can  be  tested,  revised 
and  have  any  newly  discovered  bugs  removed.  As 
a  result,  the  module  development  environment  is 
different  from  the  on-board  real-time  environment, 
and  is  not  constrained  by  its  real-time  demands. 

Within  the  laboratory,  therefore,  vision  researchers 
are  free  to  use  any  of  the  existing  computer  vision 
environments.  At  the  University  of  Massachusetts 
the  most  common  choice  for  module  development 
is  KBVision,  although  the  Khoros  environment  is 
also  available.  As  discussed  below,  tools  have 
been  developed  to  help  researchers  convert  their 
algorithms  from  KBVision  or  Khoros  to  nin  in  the 
MoPLSE  enviroiunent. 

The  design  of  MoPLSE  therefore  focused  on  how 
to  support  the  controlled  invocation  of  visual 
modules,  and  how  to  pass  data  from  one  module  to 
another.  Before  discussing  these  decisions, 
however,  it  is  helpful  to  review  the  KBVision  and 
Khoros  development  environments  to  see  how  any 
of  MoPLSE's  features  are  actually  real-time 
adaptations  of  features  already  available  in  these 
products. 

3.1.  KBVision 

KBVision  [Williams  90]  is  a  computer  vision 
software  development  environment  developed  by 
Amerinex  Artificial  Intelligence  (AAI).  Although 
KBVision  supports  the  development  of  visual 
modules  is  several  ways,  its  most  unique  and 
salient  features  are  the  ISR  visual  database  and  the 
Image  Examiner  graphics  tool. 

The  ISR  is  a  symbolic  database  for  visual  data 
originally  developed  at  the  Univ.  of  Massachusetts 
Brolio89].  ISR  was  motivated  by  the  belief  that 
computer  vision  requires  more  than  image-like 
arrays  of  numerical  data;  it  depends  on  symbolic 
representations  of  abstract  image  events  such  as 
regions,  lines,  and  surfaces^  and  on  mechanisms 
for  efficiently  accessing  the  objects  under  various 
types  of  constraints  (such  spatial  proximity)  Visual 
modules  operate  on  these  symbolic  records,  called 
tokens,  or  on  sets  of  tokens,  and  generally  produce 
either  new  tokens,  fill  in  fields  of  existing  ones,  or 
both. 

The  ISR  serves  as  the  data  repository  at  the  center 
of  the  system.  Visual  modules  access  data,  in  the 


^  The  data  exchange  format  of  the  lUE  reflects  a  similar 
belief. 


form  of  tokens,  in  ISR  and  create  new  tokens 
which  they  add  to  the  database.  (Strictly  speaking, 
it  is  a  misnomer  to  call  ISR  a  database;  although  it 
does  have  facilities  for  writing  data  sets  to  flies 
and  reading  them  back  in  again,  file  access  is  too 
slow  a  data  transfer  mechanism  for  real-time 
vision.  The  ISR  keeps  its  data  in  memory,  and 
should  therefore  be  called  a  data  store.)  What 
separates  ISR  from  other  systems  is  that  it’s  data 
access  routines  have  been  optimized  for  computer 
vision. 

In  particular,  ISR  must  provide  flinctions  for 
accessing  tokens  by  ^tial  location  and  by  feature 
value,  as  well  as  by  name.  Such  access  Unctions 
are  important  because  tokens  are  rarely  isolated 
entities:  typically,  they  have  important  symbolic, 
numeric  and  logical  relations  to  each  other. 
Hierarchies  of  tokens  are  important,  as  when 
surfaces  are  defined  by  ordered  sequences  of  lines, 
and  lines  are  defined  by  pairs  of  endpoints.  It  is 
therefore  important  to  be  able  to  access  all  the 
points  in  a  given  model,  for  example  (a  form  of 
indirect  named  access).  Just  as  frequently,  tokens 
are  organized  less  rigidly  into  sets,  such  as  the  set 
of  lines  extracted  from  a  single  image.  In  such 
cases,  it  is  important  for  a  visual  module  to  be  able 
to  access  subsets  of  these  tokens,  for  example  all 
the  lines  in  the  upper  corner  of  an  image  or  all  the 
long  lines  in  an  image,  examples  of  spatial  and 
associative  access. 

Furthermore,  it  is  often  convenient  for  visual 
modules  to  operate  on  tokens  as  elements  of  a  set 
One  form  of  this  is  when  a  user  wants  to  combine 
multiple  forms  of  access,  for  example  by 
accessing  the  long  lines  in  die  ui^r  corner  of  an 
image.  Other  times  a  visual  module  may  need  to 
operate  over  the  tokens  in  a  set,  for  example  to 
find  the  average  of  the  length  of  lines  in  a  set  For 
these  reasons,  ISR  includes  operations  for  iterating 
over  the  tokens  in  a  set.  as  well  as  taking  the 
union,  intersection  and  differences  of  sets^. 

The  Image  Examiner  is  a  very  different  kind  of 
tool.  It  provides  the  visual  modules  designer  with 
graphics  support  for  displaying  images  and  data. 
As  such,  it  is  only  one  of  many  tools  for 
examining  images,  including  displaying  them  at 
different  scales  or  with  altered  color  maps.  What 
distinguishes  the  Image  Examiner  different  from 
most  other  tools  (e.g.  XV)  is  that  it  is  integrated 
with  ISR  and  includes  routines  for  disi^aying  most 
common  forms  of  visual  symbolic  tokens.  It  is 
therefore  trivial,  for  example,  to  overlay  a  set  of 


^  The  complement  of  a  set  is  not  well  defined,  since 
there  is  an  infinite  universe  of  possible  tokens. 
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lines  on  top  of  an  image,  or  on  top  of  a  symbolic 
image  region.  Tokens  can  also  be  pseudo-colored 
according  to  feature  ranges,  allowing  tire  user  to 
visually  inspect  symbolic  data. 


visual  modules  can  be  chained  together  simply  by 
graphically  conirecting  the  output  of  one  module  to 
the  input  of  another  (see  Figure  2).  Loops  and 
branching  sequences  of  modules  can  also  be 
graphically  created.  Adding  a  new  module  into  the 
graphical  interface  is  easily  accomplished  by  tools 
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Figure  2.  An  Example  Cantata  Interface  from  Khoros 


Khoros 

Khoros  [Ar^ro  91],  although  often  compared  to 
KBVision,  is  in  fact  a  very  different  type  of 
system,  designed  for  different  uses.  Like 
KBVision,  Khoros  is  organized  around  a  set  of 
visual  modules.  But  whereas  KBVision  was 
designed  as  a  high-level  image  understanding 
environment,  with  facilities  for  reasoning  about 
Hires,  surfaces  and  other  symbolic  events,  Khoros 
was  designed  as  an  image  processing  environment. 
As  a  result,  it  has  no  equivalent  to  the  ISR  and 
manipulates  only  image  data.  (It  provides  a  set  of 
tools  which  are  similar  in  ftmction  to  what  the 
Image  Examiner  provides  KBVision,  but  again 
they  apply  only  to  image  data.) 

Where  Khoros  is  effective,  however,  is  in  its  user 
interface.  Khoros  provides  an  execution 
environment,  called  Cantata,  in  which  visual 
modules  are  representing  by  graphical  icons. 
Clicking  on  an  icon  pops  up  a  menu  of  the 
modules  parameters,  and  ^lows  the  user  to  adjust 
those  parameters  without  having  to  recompile  the 
module  or  the  system.  Even  more  conveniently. 


in  the  Khoros  system  called  ghostwriter. 


4.  MoPLSE 

Unfortunately,  neither  KBVision  nor  Khoros 
satisfied  tire  needs  of  the  MH.  projea.  KBVision 
has  a  central  data  store  (ISR)  and  convenient 
graphics  tools,  but  does  not  provide  a  flexible 
enough  graphical  interface  for  sequencing  modules 
or  putting  them  in  loops.  Khoros  has  a  flexible 
user  interface,  but  only  provides  support  for 
image-like  data.  Moreover,  neither  system  was 
originally  designed  for  real-time  research,  in  that 
both  systems  pass  data  through  files  and  execute 
each  modules  in  its  own  UNIX  process.  If  us^  in 
a  real-time  application,  the  costs  of  file  access  and 
process  creation  would  prove  unacceptable. 

A  new  software  environment  was  therefore  needed 
for  the  MPL.  MoPLSE  was  built  around  an 
improved  version  of  the  ISR  database  and 
Khoros's  user  interface,  with  graphics  facilities 
built  on  the  XV  system  developed  at  the 
University  of  Pennsylvania.  What  follows  is  a 
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brief  description  of  MoPLSE  in  teims  of  tfwse 
three  components. 

4.1.  ISR3 

The  centerpiece  of  MoPLSE  is  the  ISR. 
MoPLSE's  ISR  (called  ISR3)  is  conceptually 
similar  to  the  earlier  versions  built  at  the 
University  of  Massachusetts  [Brolio  89]  and  to  the 
ISR  embedded  in  KBVision™.  Both  systems 
provide  facilities  for  defining  symbolic  token 
types,  adding  new  tokens  to  the  database,  and 
accessing  tokens  by  name,  value  and  spatial 
location  (see  above).  ISR3  differs  from  the  earlier 
versions,  however,  in  that  it  stores  native  C  data 
structures,  is  memory-based  rather  than  file-based, 
and  provides  primitives  for  memory  management 
and  multiprocess  synchronization.  However.  1SR3 
does  provide  facilities  for  storing  data  to  files  for 
debugging. 

Studying  each  of  these  improvements  in  turn,  we 
begin  with  ISR3's  ability  to  store  native  C  structs 
rather  than  special,  ISR-defined  records. 
KBVision's  ISR  includes  a  language  for  defining 
token  types.  Beside  the  conceptual  overhead  of 
learning  another  syntax,  this  forces  users  to  access 
fields  of  tokens  through  special  ISR  access 
functions,  and  at  times  to  copy  data  into  C 
structures.  In  a  real-time  environment,  speed  of 
data  access  is  crucial,  and  copying  data  should  be 
avoided.  In  ISR3,  tokens  are  C  structures  stored  in 
shared  memory,  and  in  order  to  define  a  new  token 
type,  a  user  simply  adds  a  new  structure  definition 
to  a  file.  By  storing  native  C  structures  and 
returning  pointers  to  them,  ISR3  obviates  the  need 
to  copy  data  (other  than  for  making  local 
destructive  changes).  Just  as  importantly,  access 
to  C  structures  is  optimized  by  most  commercial  C 
compilers. 

By  default,  ISR3  structures  are  stored  not  in  a  file 
but  in  shared  memory.  (Users  can  also  specify  that 
a  token  is  local  to  a  process  and  should  be 
allocated  in  local  memory.)  As  a  result,  when  one 
process  needs  access  to  tokens  created  by  another, 
it  simply  uses  the  I3R3  token  access  functions  to 
get  a  pointer  to  the  appropriate  structures  in 
memoiy.  This  is  much  faster  than  in  KBVision, 
where  one  process  has  to  write  the  data  out  to  a 
file  and  the  other  has  to  read  it  back  into  its  local 
memory.  Moreover,  ISR3  allows  two  processes  to 
share  data,  which  is  not  possible  in  KBVision. 

Shared  tokens  in  turn  brings  us  to  multiprocess 
synchronization.  One  problem  with  shared  visual 
databases  in  a  multiprocess  environment  is  how  to 
stop  processes  from  accidentally  overwriting  or 
destroying  each  other's  data.  In  a  typical 


application,  hundreds  or  thousands  of  tokens  may 
be  in  memory  at  any  one  time.  If  modules  have 
uncontrolled  access  to  these  tokens  and  modify 
them,  unpredictable  interactions  may  cause  hard- 
to-find  bugs.  On  the  other  hand,  it  is  not 
uncommon  for  a  process  to  access  a  hundred 
tokens  at  a  time,  for  example  when  a  grouping 
routine  looks  for  pencils  of  converging  lines  in 
vanishing  point  analysis.  If  such  a  process  has  to 
lock  and  unlock  a  token  each  time  it  reads  a 
feature  value,  then  given  the  relatively  slow  speed 
of  semai^ores  under  UNIX,  protection  becomes 
unacceptably  expensive. 

Our  compromise,  therefore,  was  to  associate 
semaphores  with  sets  of  tokens.  When  a  set  of 
related  tokens  are  created,  for  example  the  set  of 
lines  from  a  single  image,  a  semaf^re  is  allocated 
to  protect  those  tokens.  Any  ISR3  function  that 
accesses  that  set  of  tokens  will  first  check  the 
semaphore;  if  users  access  them  surreptitiously 
through  C  pointers,  they  are  expected  to  lock  the 
semaphore  before  accessing  the  first  token  and  to 
unlock  it  after  the  last  token  access.  All  ISR3 
access  function  that  create  subsets  of  a  set  of 
tokens  will  assign  the  same  semaphore  to  the 
subset  that  is  used  for  the  parent  set. 

Finally,  ISR3  provides  a  level  of  memory 
management.  For  non-real-time,  file-based 
systems,  memory  management  is  not  a  critical 
issue;  for  continuous,  real-time  systems,  however, 
it  is  crucial.  ISR3  applications  operate  in  real-time 
loops,  allocating  new  tokens  on  each  iteration. 
Memory  allocation  must  be  rapid.  More 
importantly,  memory  must  be  recycled,  with  space 
allocated  to  old  tokens  being  reassigned  to  new 
ones  once  the  old  data  is  no  longer  needed. 
Unknown  to  the  user,  ISR3  provides  buffers  for 
tokens  of  every  declared  type,  with  freed  tokens 
being  returned  to  the  appropriate  buffer.  For  users 
who  store  tokens  in  hierarchies,  ISR3  also 
provides  functions  for  tracing  through  a  hierarchy 
and  freeing  all  the  tokens  in  it,  so  ^at  it  is  easy, 
for  example,  to  free  all  the  memory  associated 
with  a  given  image  once  that  image  is  no  longer 
current. 

4.2.  A  Modified  Cantata 

MoPLSE's  graphical  user  interface  is  a 
modification  of  Ac  interface  found  in  Khoros's 
Cantata  program.  As  in  Cantata,  an  icon  is  created 
for  each  visual  module  which,  when  clicked  on, 
gives  Ae  user  access  to,  and  an  ability  to  change, 
that  modules  parameters.  A  module  can  be 
executed  simply  by  clicking  on  Ae  icon's  run 
button,  and  libraries  of  available  modules  can  be 
selected  wiA  Ac  mouse  (sec  Figure  2). 
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More  importantly,  MoPLSE  borrows  Cantata's 
facilities  for  sequencing  modules.  If  process  B 
uses  data  created  by  process  A,  then  the  user 
simply  draws  a  line  from  the  output  of  A  to  the 
input  of  B.  This  tells  the  execution  monitor  to 
execute  process  A  before  process  B,  and  to  route 
its  output  accordingly.  Users  can  also  create 
infinite  or  counted  looi»  in  which  B  follows  A  and 
A  follows  B,  or  branching  sequences  of  control  in 
which  the  results  of  one  module  are  used  to  select 
one  of  two  control  flow  branches. 

The  problems  with  the  original  Cantata,  as 
mentioned  before,  are  1)  that  it  only  provides 
facilities  for  passing  image-like  data  from  one 
module  to  the  next,  2)  it  executes  each  module  as  a 
separate  process,  and  3)  data  is  passed  from  one 
module  to  the  next  through  files.  MoPLSE 
addresses  the  first  problem  by  integrating  1SR3 
into  the  execution  controller.  Modules  may  output 
a  token  or  set  of  tokens  of  any  type  known  to 
ISR3.  Graphically,  these  tokens  are  passed  to 
other  modules  just  as  image  data  is  passed  from 
module  to  module  in  Cantata  -  by  connecting  the 
output  of  one  module  to  the  input  of  another.  The 
interface  knows  enough  about  ISR3  to  check  that 
the  token  types  match,  and  if  they  do  allows  the 
connection  to  be  made. 

The  ISR3  also  solves  Cantata's  third  problem  of 
passing  data  through  flies.  In  MoPLSE,  data  is 
passed  as  pointers  to  tokens  in  shared  memory, 
eliminating  the  time  delay  related  to  flies'^.  The 
second  drawback  to  Khoros  is  removed  by  having 
the  execution  environment  execute  visual  modules 
as  subroutines  rather  than  separate  proFcesses. 
Although  there  are  disadvantages  associated  with 
using  subroutines,  the  gain  in  eliminating  process 
creation  overhead  more  than  makes  up  for  ^em. 

4J.  Graphics 

The  third  and  flnal  component  of  MoPLSE  are  its 
graphics  tools.  Khoros  provides  a  complete  set  of 
graphics  tools,  but  they  are  applicable  only  to 
image-like  data.  Facilities  for  drawing  2D  line 
segments,  for  example,  or  projecting  3D  lines  are 
not  provided.  KBVision's  Image  Examiner 
includes  the  ability  to  display  lines,  regions  and 
other  symbolic  tokens,  but  currently  it  cannot  be 
expanded  to  display  user-defined  token  types.  It 
only  displays  the  basic  set  of  token  types  defined 
by  Amerinex.  Since  many  of  the  token  types  used 


^  Caching  techniques  sometimes  make  files  equivalent 
to  passing  pointers  in  memory,  but  this  cannot  be  relied 
on,  particularly  with  large  data  sets. 


in  research  aboard  MPL,  such  as  surfaces  and  3D 
lines,  are  not  currently  included  among  AAI's 
predefined  token  types,  another  set  of  graphics 
tools  are  included  in  MoK^E. 

Fortunately,  many  other  research  institutions  have 
developed  graphics  tools  before  us.  One  such  tool 
is  XV,  developed  at  the  University  of 
Ftnnsylvania.  XV  is  a  portable  gitqrhics  facility, 
with  available  source  code,  for  displaying  images 
under  X  windows.  Like  Cantata's  graphics  tools, 
however,  it  is  limited  to  displaying  image-like 
data.  In  order  to  build  a  facility  for  displaying 
both  standard  and  novel  token  types,  we  divided 
the  graphics  process  into  three  sie^,  rasterization, 
overlay  and  display.  MoPLSE  has  modules  for 
rasterizing  lines,  regions,  displacement  vectors. 
^  other  types  of  tokens.  These  modules  produce 
image-like  data  from  symbolic  tokens,  and  in 
some  cases  multiple  modules  exist  for  rasterizing  a 
single  type  of  token.  Lines,  for  example,  can  be 
displayed  with  or  without  arrows  indicating  their 
gradient  direction.  Users  who  wish  to  display  their 
own  user-defined  token  types  need  to  develop 
modules  for  rasterizing  them  according  to  their 
unique  semantics.  Another  module  is  provided  for 
overlaying  rasterized  data,  so  that,  for  example, 
lines  can  easily  be  displayed  over  the  image  they 
were  extracted  from,  llie  overlay  module  is 
actually  capable  of  any  logical  combination  of  two 
rasteri:^  data  images,  but  simple  overlay  is  by  far 
the  most  commonly  used.  Finally,  a  display 
module  invokes  XV  to  display  the  rasterized,  aiKl 
possibly  overlaid,  data. 
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